About Compressed Matrices

Once upon a time, there was a problem that inspired a quest. That quest created Mendz.Graphs, which is now a .Net Core class library that grew from a simple implementation of G = (V, E) in to one that also provides features to represent a graph as a list, a dense matrix, a sparse matrix and, quite recently, a compressed matrix.

Did I just say "compressed"?! Weren't the sparse matrices the solution to represent large graphs? Is this another one of your "but wait, there's more!" kinda thing, Mendz?

Yes, yes and yes!

The sparse matrices save memory by storing only what needs to be stored in memory. Apparently, there are many ways to save more memory. Way more!

Let's start by reading some online articles that discuss a bit about data compression. Wikipedia has an article about it. How Stuff Works has a pretty extensive series on compression, too. The simplicity of this WordPress article though is perhaps the most concise and easiest to understand.
Whenever there is a redundancy in data, lossless compression can compress the data. Therefore lossless compression exploits data redundancy. -- George Dallas
In a nutshell, the most basic lossless approach is really what I need. If the values in the matrix are redundant, there must be a way to store them that takes advantage of their redundancy. For example, in a (1, 0, 0)-adjacency matrix, 1 is the only possible value. Thus, you can represent a huge sized 10000 x 10000 matrix with a couple of lists or so containing the coordinates where the 1's are. If the matrix entries are redundant, the reduction in memory use can be quite significant compared to even the sparse matrices themselves.

Compressed Value Storage (CVS)

I am calling this approach compressed value storage (CVS). CVS is made up of four values: (A, LA, L, S):
  • A is a list[NDNZ] of distinct non-zero values;
  • LA is a list[NDNZ] of lists[Ng] containing the linear indexes where Ai appears; and
  • L is the indicator that specifies if the coordinates to linear indexing calculation used row-major order or column-major order.
  • S is the (int rows, int columns) tuple size of the matrix.
The size of A and LA is NDNZ (number of distinct non-zero elements). The elements in A align with the elements in LA. LAi is a list of Ai's linear indexes in the matrix. The total number of linear indexes in LA ΣNg = NNZ (number of non-zero elements). L is a scalar which can be 0 (row-order major) or 1 (column-major order).

In Mendz.Library.Matrices, CVS can be represented as a tuple:

(
    List<int> value, 
    List<List<int>> linearIndex, 
    MatrixLinearIndexMode linearIndexMode,
    (int rows, int columns) size
)

Where A is List<int> value, LA is List<List<int>> linearIndex, L is MatixLinearIndexMode linearIndexMode and S is (int rows, int columns) size.

MatixLinearIndexMode is an enumeration.

MatrixLinearIndexMode.cs
namespace Mendz.Library.Matrices
{
    public enum MatrixLinearIndexMode
    {
        RowMajorOrder,
        ColumnMajorOrder
    }
}

This code defines the MatrixLinearIndexMode enum which has two possible values: RowMajorOrder (0) and ColumnMajorOrder (1). The importance of MatrixLinearIndexMode is to indicate how the linear indexes should be interpreted or can be converted back to coordinates when needed. You see, instead of saving the actual matrix coordinates, CVS saves their equivalent matrix linear indexes instead. In C#, an int is a 32-bit structure, which uses approximately 4 bytes of memory (assuming 8 bits = 1 byte). Coordinates are a pair of two integers (int row, int column), thus using 8 bytes of memory. By converting coordinates to their linear index equivalent, which is an int, the memory use is cut down to 4 bytes instead.

CVS can be compared to the common compressed storage formats for sparse matrices like CRS (compressed row storage) and CCS (compressed column storage), for example. You can read more about these formats in the following articles. These common compressed matrix formats are supported by many existing Math libraries that support operations with matrices.

CVS vs. CRS and CCS

CRS and CCS are basically tuples of three lists.
  • CRS is (value[NNZ], rowPointer[Ni+1], columnIndex[NNZ])
  • CCS is (value[NNZ], columnPointer[Ni+1], rowIndex[NNZ])
In comparison, CVS is a tuple of two lists (although, the second list is a list of lists), a scalar and a tuple.
  • CVS is (value[NDNZ], linearIndex[NDNZ][ΣNg=NNZ], linearIndexMode, (int rows, int columns) size)
Observe that in CVS, only linearIndex[NDNZ][ΣNg=NNZ] is guaranteed to hit NNZ. CRS and CCS each has two lists guaranteed to hit NNZ. Thus, CRS/CCS can use more memory than CVS. In addition, CVS value[NDNZ] can be significantly smaller than the CRS/CCS values[NNZ] (ex. for a (1, 0, 0)-adjacency list, the difference is 1 for CVS vs. NNZ for CRS/CCS). Likewise, CVS value[NDNZ] can also be smaller than the CRS/CSC *Pointer[Ni+1].

CRS/CCS will use memory for (2 * NNZ) elements + Ni+1 elements. CVS will use memory for NDNZ elements + NNZ elements (+ a scalar + a tuple value). Thus, CVS saves the extra memory used by CRS/CCS for (NNZ - NDNZ) elements + Ni+1 elements.

Comments