# Overview: GraphMat

## Vertex programming

In this model, the entire algorithm can be expressed as operations on a single vertex and its edges. This programming model is generally considered easy to use and not overly restrictive, but implementation performance can vary significantly in practice. GraphMat has been shown to have the best performance amongst many different software graph frameworks on a single node.

### What?

Traditionally, when we implement a graph algorithm, we can freely iterate through all vertices, all edges, and maybe create some additional data structure. Basically, we have a global view of the graph and everything else. In the vertex-centric model, however, we write algorithms from a single vertex's point of view. For any vertex, the only information it has is its neighbor list and its own properties, which varies by algorithm depending on what needs to be stored!

### How?

A vertex-centric algorithm consists of a so-called vertex program. This program is executed by each vertex. It can manipulate any local information, send messages to neighboring vertices, and receive messages from them. An algorithm runs iteratively. In each iteration, the vertex program is executed by the vertices, and messages are exchanged between vertices. The algorithm terminates when no message are sent from any, vertex indicating a halt.

### Why?

This model seems to be more restrictive, and it is. However, it makes distributed graph processing much easier.

• The algorithm can be parallelized because all vertices can execute the vertex program in parallel.

• With a limited set of operations available to the vertex program, distributed graph processing frameworks can be developed. This would enable all kinds of features you want to see in a distributed system. This also improves the programmer's productivity because we only need to write a few lines of code.

• As a bonus, this can enable some additional optimizations because the computation and communication patterns are constrained.

### Sparse Matrix vector multiplication

From Wiki: When storing and manipulating sparse matrices on a computer, it is beneficial and often necessary to use specialized algorithms and data structures that take advantage of the sparse structure of the matrix. Specialized computers have been made for sparse matrices, as they are common in the machine learning field. Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to large sparse matrices as processing and memory are wasted on the zeros. Sparse data is by nature more easily compressed and thus requires significantly less storage. Some very large sparse matrices are infeasible to manipulate using standard dense-matrix algorithms.

``````function sparse_matrix_mult(A,x) =
{sum({v * x[i] : (i,v) in row})
: row in A};
``````

A common operation on sparse matrices is to multiply them by a dense vector. In such an operation, the result is the dot-product of each sparse row of the matrix with the dense vector.

This code takes each index-value pair `(i,v)` in the sparse row, multiplies `v` with the `i`th value of `x`, and sums the results. The work and depth is easily calculated using the performance rules. If `n` is the number of non-zeros in the row, then the depth of the computation is the depth of the sum, which is `O(log n)`, and the work is the sum of the work across the elements, which is `O(n)`.

## GraphMat Processing Model: Surface

Simplified GraphMat processing model

``````For each Vertex V
For each incoming edge E(U,V) from active vertex U
Res <- Process_Edge (E_weight, U_prop, [OPTIONAL]V_prop)
V_temp <- Reduce (V_temp, Res)
End
End
For each Vertex V,
V_prop <- Apply (V_temp, V_prop, V_const)
End
``````

Fig. 1 Slightly different from GraphMat that integrates `Send_message` with `Apply`

Fig. 1 shows a vertex programming example using three essential operations

• `Process_Edge`
• `Reduce`
• `Apply`

As previously stated, in a vertex program, each vertex has an application-specific vertex property that is updated iteratively. For all incoming edges from active vertices (vertices whose states were updated in the last iteration), the corresponding vertex processes each edge separately using `Process_Edge` and performs a reduction using `Reduce` to form a single value. Lastly, the reduced value, its vertex property, and a constant vertex property are used to update the current state of the vertex using `Apply`. Iterations proceed until there are no more active vertices or until a maximum number of iterations is specified.

## GraphMat Processing Model: Deep

GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We thus get the productivity benefits of a vertex programming framework without sacrificing performance.

GraphMat is based on the idea that graph analytics via vertex programming can be performed through a backend that supports only sparse matrix operations. GraphMat takes graph algorithms written as vertex programs and performs generalized sparse matrix vector multiplication on them (iteratively in many cases).

This is possible as edge traversals from a set of vertices can be written as a sparse matrix-sparse vector multiplication routines on the graph adjacency matrix (or its transpose). Example to get out-degrees of a graph

### Mapping Vertex to Generalized SPMV

It's important to note that while vertex programs can have slightly different semantics, they are all equivalent in terms of expressibility.

As previously stated, a typical vertex program has a state associated with each vertex that is updated iteratively. Each iteration starts with a subset of vertices that are active. A vertex receiving such messages from its neighbors processes each message separately and reduces them to a single value. The reduced value is used to update the current state of the vertex. The iterative process continues for a fixed number of iterations or until no vertices change state. This follows the Bulk-synchronous parallel model.

Bulk Synchronous Parallel (BSP) is a programming model and computation framework for parallel computing. Computation is divided into a sequence of supersteps. In each superstep, a set of processes, running the same code, executes concurrently and creates messages that are sent to other processes. The superstep ends when all the computation in the superstep is complete and all messages have been sent. A barrier synchronization at the end of the superstep ensures that all messages have been transmitted (but not yet delivered to the processes). The next superstep begins with the delivery of all those messages to the processes, that then execute their superstep and send messages that will be delivered at the start of the next superstep. This process continues until all processors vote to halt.

BSP is a model that comes up in the context of large multi-core machines, sharing main memory. M/R comes up in the context of distributing over many small machines, where it is not possible to share memory.

### Specifications for a graph program

• Each vertex has a user-defined property data that is initialized (based on the algorithm used).

• A set of vertices are marked active.

• The user-defined function `Send_Message()` reads the vertex data and produces a message object (done for each active vertex)

• `Process_Message()` reads the message object, edge data along which the message arrived, and the destination vertex data and produces a processed message for that edge.

• `Reduce()` is typically a commutative function taking in the processed messages for a vertex and producing a single reduced value.

• `Apply()` reads the reduced value and modifies its vertex data (done for each vertex that receives a message).

• `Send_Message()` can be called to scatter along in and/or out edges.

## Example: Single Source Shortest Path (SSSP)

We want to calculate the shortest path to all vertices from a source vertex `A`.

• At a given iteration, we generate a sparse vector using `Send_Message()` on the active vertices. The message is the shortest distance to the vertex calculated so far.
• `Process_Message()` adds this message to the edge length
• `Reduce()` performs a `min` operation
• `Process_Edge()` and `Reduce()` together form a sparse matrix sparse vector multiply operation replacing traditional SPMV multiply operation with addition and SPMV addition with `min` respectively.

## SPMV Implementations

In this framework, a vertex program with `Process_Message` and `Reduce` can be written as a generalized SPMV. We can also partition this matrix into many chunks to improve parallelism and load balancing.

Generalized SPMV

``````function SPMV(Graph G, SparseVector x, Process_Message, Reduce)
y <- new SparseVector()
for j in G^T.column_indices do
if j is present in x then
for k in G^T .column_j do
result <- Process_Message(x_j, G.edge_value(k,j), G.getVertexProperty(k))
y_k <- Reduce(y_k, result)
return y
``````

Assuming that the graph adjacency matrix transpose G^T is stored in a Compressed Sparse Column (CSC) format

We implement SPMV by traversing the non-zero columns in G^T. If a particular column j has a corresponding non-zero at position j in the sparse vector, then the elements in the column are processed and values accumulated in the output vector y.

### Overall framework

The set of active vertices is maintained using a boolean array for performance reasons. In each iteration, this array is scanned to find the active vertices and a sparse vector of messages is generated. Then, a generalized SPMV is performed using this vector. The resulting output vector is used to update the state of vertices. If any vertices change state, they are marked active for the next iteration. The algorithm continues for a user-specified maximum number of iterations or until convergence.

### GraphMat overview `x`,`y`, are sparse vectors

``````function Run_Graph_Program(Graph G, GraphProgram P)
for i = 1 to MaxIterations do
for v = 1 to Vertices do
if v is active then
x_v <- P.Process_Message, P.Reduce)
Reset active for all vertices
for j = 1 to y.length do
v <- y.getVertex(j)
old_vertexproperty <- G.getVertexProperty(v)
G.setVertexProperty(v, y.getValye(j), P.Apply)
if G.getVertexProperty(v) != old_vertexproperty then
v set to active
if Number of active vertices == 0 then
break
``````

GraphMat follows an iterative process of `Send_Message` (lines 3-5), SPMV (line 6), and `Apply` (lines 8-13). Each such iteration is a superstep. GraphMat does not work on graphs that are too large to fit in main memory, as it assumes in-memory graph processing.

## Key characteristics that must be accounted for when considering graph domain accelerators

• Graph analytics applications are often memory latency/bandwidth bound

• Graph traversals often require many memory accesses relative to only small amounts of computation.

General purpose processors are not the ideal platform for executing such applications. Their inefficiencies include

• Waste of off-chip memory bandwidth from inefficient memory access granularity
• ineffective on-chip memory usage
• mismatch in execution granularity

Graphicionado pipeline is inspired by the vertex programming paradigm coupled with a few reconfigurable blocks.

• A specialized graph analytics processing hardware pipeline that employs datatype and memory subsystem specializations while offering workload-specific reconfigurable blocks called Graphicionado.

• An in-depth tour of the various microarchitecture optimizations to provide performance and energy efficiency, techniques to extract more parallelism, and tactics to support large-scale real-world graphs using slicing and partitioning.

We previously discussed the GraphMat framework, now we delve into Graphicionado: a hardware accelerator specialized for processing graph analytics algorithms. For better programming and flexibility, Graphicionado inherits the advantage of GraphMat. As in software graph processing frameworks, Graphicionado allows users to express specific graph algorithms by defining three computations

• `Process_Edge`
• `Reduce`
• `Apply`

In addition, it transparently handles all necessary data movement and communication on-chip and off-chip to support those operations. Graphicionado overcomes the limitations of GraphMat (and others) by applying specializations on the compute pipeline and the memory subsystem.

### Graphicionado Base Graph Processing Model

Graphicionado takes an input graph in the coordinate format. In this format, a graph is represented as a list of vertices where each vertex `v` is associated with a vertex property `VProperty`, and each edge `e` is associated with a 3-tuple (`srcid`, `dstid`, `weight`) indexed by the edge id `eid`. This edge list is sorted according to the `srcid` and then the `dstid`. Before the input graph can be f ed into the Graphicionado processing pipeline, some preprocessing of the input graph is done: an `EdgeIDTable` is constructed and stored in memory. This array stores the `eid` of the first edge of each vertex to allow streaming accesses of the edges starting at a particular vertex. Graphicionado also uses memory to store the vertex property array `VProperty`, the temporary vertex property array `VTempProperty`, and the constant vertex property array `VConst` associated with all vertices.

#### Processing Phase

``````for (int i=0; i < ActiveVertexCount; i++) {
Vertex src = ActiveVertex[i]; //Sequential Vertex Read
int eid = EdgeIDTable[src.id]; //Edge ID Read
Edge e = Edges[eid]; //Edge Read
while (e.srcid == src.id) {
dst.prop = VProperty[e.dstid]; //[Optional] Random Vertex Read
VertexProperty res = Process_Edge(e.weight, src.prop, dst.prop);
VertexProperty temp = VTempProperty[e.dstid]; //Random Vertex Read
temp = Reduce(temp, res);
VTempProperty[e.dstid] = temp; //Random Vertex Write
}
}
//Reset ActiveVertex and ActiveVertexCount
``````

In this phase, all outgoing edges from every active vertex are examined and the necessary user-defined computations, `Process-Edge` and `Reduce`, and updates to the associated vertex properties are calculated and stored into the temporary vertex property array `VTempProperty`. This phase is only terminated when all active vertices are processed. For some graph analytics workloads such as Collaborative Filtering, not only does the property associated with the source vertex need to be read and manipulated, but also the property associated with the destination vertex, as shown in the pseudocode (line 6),

#### Apply Phase

``````for (int i=1; i < TotalVertexCount; i++) {
VertexProperty vprop = VProperty[i]; //Sequential Vertex Read
VertexProperty temp = VTempProperty[i]; //Sequential Vertex Read
VertexProperty vconst = VConst[i];
temp = Apply(vprop, temp, vconst);
VProperty[i] = temp; //Sequential Vertex Write
if (temp != vprop) {
Vertex v;
v.id = i;
v.prop = temp;
ActiveVertex[ActiveVertexCount++] = v; //Sequential Vertex Write
}
}
``````

In this phase, properties associated with all vertices are updated using the user-defined `Apply` function. `Apply` uses input values from the vertex property array `VProperty`, the constant vertex property array `VConst` stored in memory and the temporary vertex property array `VTempProperty` computed from the processing phase to make necessary updates to the vertex property array. The `ActiveVertex` array keeps track of which active vertices changed their property values in this phase.

### Bandwidth

As the number of cores on chips increase, off-chip bandwidth is becoming a big wall. The off-chip bandwidth is divided between traffic to memory and traffic from memory. The traffic from memory is mainly due to cache misses. The traffic to memory occurs due to blocks evicted from last level cache to be sent down to the memory.

From the last level on-chip cache to memory is due to the writes sent to the memory whenever a block in LLC is evicted and is dirty. The block chosen to be evicted depends on the replacement policy of LLC. the victim block is then written to the memory.

## Microarchitecture

Each module corresponds to one or more lines of pseudocode written above, and their physical characteristics are obtained using the implementation methodology described later on.

### Modules

The Sequential Vertex Read performs sequential memory read accesses given a starting address. It buffers one cacheline worth of data and passes the vertex properties one at a time to the output queue for the consumption of the next pipe stage. This module is used in stage `P1` of the `Processing` phase, and `A1` and `A2` of the `Apply` phase to read `VProperty` and `VTempProperty`.

The Sequential Vertex Write performs sequential memory writes given a starting address. It takes the data to be stored from the input queue and writes it to its internal buffer. When its internal buffer has a cacheline worth of data, it performs a write request to memory. This module is used to store the updated `VProperty` in stage `A4` and `ActiveVertex` in stage `A5` during the `Apply` phase.

The Random Vertex Read/Write performs random reads and writes given a vertex id. It is used to read the destination `VProperty` in stage `P4`, read and write `VTempProperty` in stage `P7` and `P9`.

The EdgeID Read performs a read from the preconstructed `EdgeIdTable` given a vertex id and outputs the `eid` of the first edge associated with the input vertex id. Implementations of this module are different for the base and the optimized Graphicionado pipeline.

The Edge Read performs random and sequential reads of the edge data given an `eid`. The initial access is random but subsequent accesses are sequential. All edge data streamed in are examined for their `srcid`. The module continues fetching edge data from the memory as long as the `srcid` of the fetched edges matches the `srcid` of the edges being processed.

The Atomic Update performs the update of the destination `VProperty` in three steps. First, the module reads the `VProperty` in stage `P7`, then it performs a `Reduce` computation in stage `P8`, and finally it writes the modified `VProperty` back to memory in stage `P9`. Since this is a read-modify-write update, the process needs to be atomic for the same vertex (i.e. the same vertex cannot be in more than one of these pipeline stages at the same time). This hardware module enforces the atomicity in stage `P6` by stalling the pipeline and blocking the issue of the vertex id being processed when violation of the condition is detected. We use a small 4-way associative CAM-like structure to enforce such atomicity (16 4-byte entries).

### Custom Modules

As previously described, Graphicionado uses three user-defined custom computations, which express the target graph algorithm. There are several options to implement these custom computations:

• Fully reconfigurable fabric (FPGA on-chip/on-package)
• Programmable ALUs
• Fully custom implementations on chip for a set of algorithms.

## Optimizations

There are many algorithms (such as PageRank and CF) that treat all vertices as active. From this, we can make optimizations to the pipeline that satisfy these specific algorithms, essentially all that don't operate with frontiers of active vertices.