Distributed GraphLab: a Framework for Machine Learning and Data Mining in the Cloud (2012)

#Distributed System

May 7, 2024

GraphLab is a programming model for machine learning and data mining. GraphLab takes a direct graph, a set of vertices, and update functions as input. Every update function accepts a vertex and returns a set of vertices. GraphLab selects a vertex from the set, applies a function to it, and adds the resulting set of vertices back intothe set, repeating this process until the set is empty. GraphLab supports parallel execution of update functions on multi-processor machines. Distributed GraphLab extends GraphLab by distributing parts of a graph across machines that are memory, utilizing Chandy-Lamport’s snapshot algorithm for fault tolerance.

The GraphLab graph is a directed one where arbitrary data, such as model parameters, can be associated each vertex and edge. The update functions receive a vertex, its adjacent vertices, and its adjacent edges - collectively referred to as the scope. while a function can update date within its scope, it cannot alter the structure of the graph. Given a set of vertices, GraphLab selects a vertex, applies the relevant update function to the scope of that vertex, then adds any resulting vertices back into the set, repeating until the set is empty.

There are three consistency models to ensure serializability. The full consisntency model prevents the scopes of concurrently running update functions from overlapping. The edge consistency model allows update functions to have exclusive read-write access to their vertex and the adjacent edges, but read only access to adjacent vertices. The vertex consistency model permits all update functions to run in parallel. The following figure, cited from the paper, compares the lock scopes of the consistency models.

Distributed GraphLab divides the GraphLab graph into the subgraphs, and distributes them to machines. Each subgraph is stored as a separete file, referred to as an atom, oin a distributed storage system. Each atom file contains a binary compressed journal of graph-genrating commands, such as AddEdge(42->314, edata), AddVertex(500, vdata).

The Distributed GraphLab engine ensures that the execution of update functions follows the consistency model. There are two engines, the chromatic engine and the distributed locking engine. The chromatic engine limits parallel execution of update functions by constructing a vetex coloring that assigns a color to each vertex such taht no adjacent vertices share the same color. While the chormatic engine satisfies the distributed GraphLab abstraction, it does not provide sufficient scheduling flexibility for many applications. The distributed locking engine adresses the limitations through distributed mutual exclusion, associating a readers-wrtier lock with each vertex in shared memory.