About Compressed Value Storage (CVS)

There is not much to find online about the compressed value storage (CVS). It seems like CRS and CCS are the most popular matrix compression formats. So here I am writing about CVS a little more. This way, you have something to find online about it.

Compressed value storage (CVS) applies basic lossless compression to a matrix such that each distinct non-zero value is aligned with its list of matrix linear indexes. CVS supports compression in row-major order and in column-major order. The basic representation of CVS is a tuple of 4 values:

A - is an unordered list[NDNZ] of distinct non-zero values in the matrix.
LA - is a list[NDNZ] such that LA_i is A_i's ordered list[N_g] of linear indexes in the matrix.
L - is an indicator of the matrix linear indexing mode: 0 for row-major order, and 1 for column-major order.
S - is the (int rows, int columns) tuple size of the matrix.

NNDZ means "number of distinct non-zeroes". NNZ means "number of non-zeroes". In LA, the total number of linear indexes ΣN_g = NNZ.

CVS stores the linear indexes in LA_i instead of the actual coordinates to save memory. In C# for example, an int is a 32-bit structure, which uses approximately 4 bytes of memory (assuming 8 bits = 1 byte). Coordinates are a pair of two integers (int row, int column), thus using 8 bytes of memory. By converting coordinates to their linear index equivalent, which is an int, the memory use is cut down to 4 bytes instead.

The inclusion of L and S is relevant in the conversion of coordinates to/from their equivalent matrix linear indexes.

If L is 0 (row-major order):
- The conversion from coordinates to linear index is columnIndex + (rowIndex * size.columns).
- The conversion from linear index to coordinates is (linearIndex / size.columns, linearIndex % size.columns).
If L is 1 (column-major order):
- The conversion from coordinates to linear index is rowIndex + (columnIndex * size.rows).
- The conversion from linear index to coordinates is (linearIndex % size.rows, linearIndex / size.rows).

In order for CVS to be interpreted correctly, the same linear indexing mode used to create the CVS should be used in operations. Thus, CVS should clearly identify itself if it's in row-major order or in column-major order. It is possible for algorithms to be created in such a way that it can be used for CVS in either row-major order or in column-major order, for as long as the linear indexing mode is evaluated and used within the algorithm itself.

Compression

In my tests (and using Visual Studio 2017's built-in profiler) with a DOK sparse (1, 0, 0)-adjacency matrix (Mendz.Library.Matrices.CoordinatesKeyedConcurrentSparseMatrix) with the size of 2500x2500 containing 2500 NNZ, it is possible for CVS to achieve 13:1 compression ratio, or space savings of 92%. For the same sparse matrix instance, CRS/CCS can achieve 4:1 compression ratio, or space savings of 76%. My tests show that the amount of memory used by CVS can be 39% of the amount of memory used by CRS/CCS.

There are many apparent reasons for this:

CVS uses memory for NDNZ + ΣN_g = NNZ + scalar + (int,int) tuple + LA[NDNZ] overhead -- where (scalar + (int,int) tuple) may be approx. 12+ bytes. CRS/CSC use memory for (2 * NNZ) + N_i+1. Assuming the scalar and tuple bytes are insignificant, CVS saves the memory used by CRS/CCS for NNZ + N_i+1 - (NDNZ + LA[NDNZ] overhead).
NNZ can be significantly > NDNZ. For example, in a (1, 0, 0)-adjacency matrix with 2500 NNZ, CVS NDNZ = 1.
N_i+1 can be significantly > NDNZ. For example, in a 2500 x 2500 matrix sourced (1, 0, 0)-adjacency matrix, N_i+1 = 2501, compared to CVS NDNZ = 1.
CVS is a lossless compression that exploits redundancy. For matrices with no redundant entries, such that NDNZ = NNZ, and assuming the scalar and tuple bytes are insignificant, CVS saves the memory used by CRS/CCS for N_i+1 - (NDNZ + LA[NDNZ] overhead).

As of this writing, I have no data to provide with regards to performance and resource utilization during compression (and decompression). My crude tests in this regard show that CVS can be better, if not match, CRS/CCS.

I highly encourage developers who are adept to performing these kinds of tests and profiling to do so -- I would also appreciate to get your results.

Operations

Looking at the matrix-vector multiplication implementations in Mendz.Library.Matrices, CVS does not necessarily sacrifice access/traversal complexity and performance. For operations optimized to work with matrix linear indexes, CVS can offer access/traversal logic comparable to accessing/traversing a DOK sparse matrix. The possible compromise is in operations where the linear indexes need to be translated to their equivalent coordinates, which can add extra CPU/memory use inline to support the calculations.

So far, I only have matrix-vector multiplication to work with in Mendz.Library.Matrices. I observed that the first call to the extension methods seems to take longer than succeeding calls to the same extension method. Based on my crude tests, and ignoring the first calls to the extension methods, it seems like CVS overall performance can be better, if not match matrix-vector multiplication with CRS/CCS (noting BTW that CCS seems to perform better than CRS).

I highly encourage developers who are adept to performing such tests and profiling to do so -- I would also appreciate to get your results.

Learning more...

I am exploring other operations that I can implement for CVS, CRS and CCS. I'll compare them side-by-side. I am currently looking at this for inspiration: http://web.stanford.edu/~wfsharpe/mia/mat/mia_mat2.htm. Wish me luck!

Search This Blog

As I See Tech