Sparse Matrix Magic

One of the hardest parts of using R with large datasets is fitting them into RAM. R needs to keep all the data it is working with in memory and you can quickly fill up a few GB with only a couple of big models. This post will talk about using the sparse matrix class from the Matrix package to improve memory use, and when you might be able to do it.

How R Uses Matrices

You are using matrices whenever R builds a model, such as a linear regression. The most common way to build models is by preparing data in a data.frame, a model formula and then using them in a modelling function. For example, we can build the simple linear model below. This will give us the model object, cars_lm. This is a list containing all kinds of vectors and values that can be used by functions like predict() and summary() to tell us the results of the model.

#Load an example dataset into a data.frame
cars_data <- mtcars

# This formula will look for how the cyl, disp, hp and wt columns relate to mpg.
cars_formula <- "mpg ~ cyl + disp + hp + wt"

# Create the model with the linear model (lm) function
cars_lm <- lm(formula = cars_formula, data = cars_data)

summary(cars_lm)
## 
## Call:
## lm(formula = cars_formula, data = cars_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## hp          -0.02054    0.01215  -1.691 0.102379    
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10

What this code conceals is how R calculates the model using a model matrix. One of the first steps that lm() will perform is to use the model formula and the data to create the model matrix; a way of representing all of the variables that need to be fitted. R allows us to do the same thing ourselves with the function model.matrix(). The model matrix is then used as an input to the function lm.fit(), which is the function that does the hard work of figuring out the coefficients to use in the model.

cars_model_matrix <- model.matrix(as.formula(cars_formula), cars_data)

head(cars_model_matrix)
##                   (Intercept) cyl disp  hp    wt
## Mazda RX4                   1   6  160 110 2.620
## Mazda RX4 Wag               1   6  160 110 2.875
## Datsun 710                  1   4  108  93 2.320
## Hornet 4 Drive              1   6  258 110 3.215
## Hornet Sportabout           1   8  360 175 3.440
## Valiant                     1   6  225 105 3.460
cars_lm_fit <- lm.fit(x = cars_model_matrix, y = cars_data[["mpg"]])

summary(cars_lm_fit)
##               Length Class  Mode   
## coefficients   5     -none- numeric
## residuals     32     -none- numeric
## effects       32     -none- numeric
## rank           1     -none- numeric
## fitted.values 32     -none- numeric
## assign         5     -none- numeric
## qr             5     qr     list   
## df.residual    1     -none- numeric

The output of lm.fit() is not the same as lm(). It does not have the class attribute “lm”, so the summary function will simply treat cars_lm_fit as a generic list. The list still contains the coefficients that are needed to fit the model and predict values for new data, it just wont work with functions that expect an lm object.

The Sparse Matrix

The model matrix we created above is not sparse. It contains values in all of its rows and columns and they are all needed to build the model. Other models do not behave as nicely as this. If the data includes a categorical variable, that is one which has different ‘levels’, it will make a very different model matrix. As an example we will add a categorical variable to the cars data called type with the possible levels of “class a”, “class b” and “class c”. This is then added to the formula and fitted against.

cars_data[["type"]] <- factor(x = c("class a", "class a", "class b", "class c", "class b",
                                    "class a", "class a", "class b", "class c", "class b",
                                    "class a", "class a", "class b", "class c", "class b",
                                    "class a", "class a", "class b", "class c", "class b",
                                    "class a", "class a", "class b", "class c", "class b",
                                    "class a", "class a", "class b", "class c", "class b",
                                    "class a", "class a"
                                    )
                              , levels = c("class a", "class b", "class c")
                              )

cars_formula_type <- "mpg ~ cyl + disp + hp + wt + type"

# Create the model with the linear model (lm) function
cars_lm_type <- lm(formula = cars_formula_type, data = cars_data)
summary(cars_lm_type)
## 
## Call:
## lm(formula = cars_formula_type, data = cars_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2237 -1.2012 -0.1439  0.6997  5.0205 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.510828   2.888704  13.678 4.14e-13 ***
## cyl         -1.348446   0.644211  -2.093  0.04664 *  
## disp         0.008863   0.011744   0.755  0.45749    
## hp          -0.016236   0.012162  -1.335  0.19389    
## wt          -3.563300   1.032325  -3.452  0.00199 ** 
## typeclass b  1.726290   1.006648   1.715  0.09874 .  
## typeclass c  0.409291   1.237442   0.331  0.74359    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.464 on 25 degrees of freedom
## Multiple R-squared:  0.8652, Adjusted R-squared:  0.8329 
## F-statistic: 26.75 on 6 and 25 DF,  p-value: 9.874e-10

The type is fitted as if it were 2 variables; ‘typeclass b’ and ‘typeclass c’. The model assumes that ‘class a’ is the default and does not require its own coefficient. This behavior is based on the order of the factor levels, it will always use the first level.

We can see what this does to the model matrix below.

cars_model_matrix_type <- model.matrix(as.formula(cars_formula_type), cars_data)

head(cars_model_matrix_type)
##                   (Intercept) cyl disp  hp    wt typeclass b typeclass c
## Mazda RX4                   1   6  160 110 2.620           0           0
## Mazda RX4 Wag               1   6  160 110 2.875           0           0
## Datsun 710                  1   4  108  93 2.320           1           0
## Hornet 4 Drive              1   6  258 110 3.215           0           1
## Hornet Sportabout           1   8  360 175 3.440           1           0
## Valiant                     1   6  225 105 3.460           0           0
cars_lm_fit_type <- lm.fit(x = cars_model_matrix_type, y = cars_data[["mpg"]])

summary(cars_lm_fit_type)
##               Length Class  Mode   
## coefficients   7     -none- numeric
## residuals     32     -none- numeric
## effects       32     -none- numeric
## rank           1     -none- numeric
## fitted.values 32     -none- numeric
## assign         7     -none- numeric
## qr             5     qr     list   
## df.residual    1     -none- numeric

The model matrix now includes two extra columns; ‘typeclass b’ and ‘typeclass c’. These are 1 if type is class b or class c respectively, and only one of them can be anything other than 0 in a row.

If a variable has a single variable with 50 factors it will add 49 columns to the model matrix. In each row, only a single column will be 1 and all of the others will be 0. The default matrix in R takes just as much memory to record a 1, a 0 or any other reasonably small integer. The amount of RAM it takes to hold and manipulate the matrix increases rapidly, just by including different factors.

A sparse matrix from the Matrix package does not suffer from this problem. It does not record any cell with a value of 0. Instead, it records the location and values of all non-zero values. This extra overhead means that it will be larger than a non-sparse matrix if it is used to store a matrix that is full of data. For a matrix absolutely stuffed with 0’s, though, it can make a matrix that takes 1 GB of RAM fit into a couple of MB. Perfect for models dealing with lots of factors.

Using Sparse Matrix in Models

In a perfect world the function you are using to build your model will have an argument like sparse=TRUE to tell the model to use a sparse model matrix. Check the documentation for the function to see if it talks about it.

If your function doesn’t include built-in support for a sparse matrix you may be able to force it to use them. Look for a function ending in .fit like lm.fit, since this is commonly the function that will be acting on the model matrix. If this is available then you can create a sparse model matrix yourself using the sparse.model.matrix function and use it as the input to the fitting function. However, it is not garanteed to work.

library(Matrix)
cars_model_matrix_type <- sparse.model.matrix(as.formula(cars_formula_type), cars_data)
head(cars_model_matrix_type)
## 6 x 7 sparse Matrix of class "dgCMatrix"
##                   (Intercept) cyl disp  hp    wt typeclass b typeclass c
## Mazda RX4                   1   6  160 110 2.620           .           .
## Mazda RX4 Wag               1   6  160 110 2.875           .           .
## Datsun 710                  1   4  108  93 2.320           1           .
## Hornet 4 Drive              1   6  258 110 3.215           .           1
## Hornet Sportabout           1   8  360 175 3.440           1           .
## Valiant                     1   6  225 105 3.460           .           .
cars_lm_fit_type <- lm.fit(x = cars_model_matrix_type, y = cars_data[["mpg"]])
## Error in lm.fit(x = cars_model_matrix_type, y = cars_data[["mpg"]]): INTEGER() can only be applied to a 'integer', not a 'NULL'

This example failed because lm.fit does not accept sparse matrices as inputs. There are other packages, such as SparseM that include this capability. Try searching CRAN and Google for your modelling function and “sparse matrix” and you may be in luck.

A sparse matrix is not always going to use less memory than the default, but remember it the next time you work with factors.