One of the hardest parts of using R with large datasets is fitting them into RAM. R needs to keep all the data it is working with in memory and you can quickly fill up a few GB with only a couple of big models. This post will talk about using the sparse matrix class from the Matrix package to improve memory use, and when you might be able to do it.

## How R Uses Matrices

You are using matrices whenever R builds a model, such as a linear regression. The most common way to build models is by preparing data in a data.frame, a model formula and then using them in a modelling function. For example, we can build the simple linear model below. This will give us the model object, *cars_lm*. This is a list containing all kinds of vectors and values that can be used by functions like *predict()* and *summary()* to tell us the results of the model.

```
#Load an example dataset into a data.frame
cars_data <- mtcars
# This formula will look for how the cyl, disp, hp and wt columns relate to mpg.
cars_formula <- "mpg ~ cyl + disp + hp + wt"
# Create the model with the linear model (lm) function
cars_lm <- lm(formula = cars_formula, data = cars_data)
summary(cars_lm)
```

```
##
## Call:
## lm(formula = cars_formula, data = cars_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## cyl -1.29332 0.65588 -1.972 0.058947 .
## disp 0.01160 0.01173 0.989 0.331386
## hp -0.02054 0.01215 -1.691 0.102379
## wt -3.85390 1.01547 -3.795 0.000759 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
```

What this code conceals is how R calculates the model using a *model matrix*. One of the first steps that *lm()* will perform is to use the model formula and the data to create the model matrix; a way of representing all of the variables that need to be fitted. R allows us to do the same thing ourselves with the function model.matrix().
The model matrix is then used as an input to the function *lm.fit()*, which is the function that does the hard work of figuring out the coefficients to use in the model.

```
cars_model_matrix <- model.matrix(as.formula(cars_formula), cars_data)
head(cars_model_matrix)
```

```
## (Intercept) cyl disp hp wt
## Mazda RX4 1 6 160 110 2.620
## Mazda RX4 Wag 1 6 160 110 2.875
## Datsun 710 1 4 108 93 2.320
## Hornet 4 Drive 1 6 258 110 3.215
## Hornet Sportabout 1 8 360 175 3.440
## Valiant 1 6 225 105 3.460
```

```
cars_lm_fit <- lm.fit(x = cars_model_matrix, y = cars_data[["mpg"]])
summary(cars_lm_fit)
```

```
## Length Class Mode
## coefficients 5 -none- numeric
## residuals 32 -none- numeric
## effects 32 -none- numeric
## rank 1 -none- numeric
## fitted.values 32 -none- numeric
## assign 5 -none- numeric
## qr 5 qr list
## df.residual 1 -none- numeric
```

The output of *lm.fit()* is not the same as *lm()*. It does not have the class attribute “lm”, so the summary function will simply treat *cars_lm_fit* as a generic list. The list still contains the coefficients that are needed to fit the model and predict values for new data, it just wont work with functions that expect an lm object.

## The Sparse Matrix

The model matrix we created above is not sparse. It contains values in all of its rows and columns and they are all needed to build the model. Other models do not behave as nicely as this. If the data includes a categorical variable, that is one which has different ‘levels’, it will make a very different model matrix. As an example we will add a categorical variable to the cars data called *type* with the possible levels of “class a”, “class b” and “class c”. This is then added to the formula and fitted against.

```
cars_data[["type"]] <- factor(x = c("class a", "class a", "class b", "class c", "class b",
"class a", "class a", "class b", "class c", "class b",
"class a", "class a", "class b", "class c", "class b",
"class a", "class a", "class b", "class c", "class b",
"class a", "class a", "class b", "class c", "class b",
"class a", "class a", "class b", "class c", "class b",
"class a", "class a"
)
, levels = c("class a", "class b", "class c")
)
cars_formula_type <- "mpg ~ cyl + disp + hp + wt + type"
# Create the model with the linear model (lm) function
cars_lm_type <- lm(formula = cars_formula_type, data = cars_data)
summary(cars_lm_type)
```

```
##
## Call:
## lm(formula = cars_formula_type, data = cars_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2237 -1.2012 -0.1439 0.6997 5.0205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.510828 2.888704 13.678 4.14e-13 ***
## cyl -1.348446 0.644211 -2.093 0.04664 *
## disp 0.008863 0.011744 0.755 0.45749
## hp -0.016236 0.012162 -1.335 0.19389
## wt -3.563300 1.032325 -3.452 0.00199 **
## typeclass b 1.726290 1.006648 1.715 0.09874 .
## typeclass c 0.409291 1.237442 0.331 0.74359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.464 on 25 degrees of freedom
## Multiple R-squared: 0.8652, Adjusted R-squared: 0.8329
## F-statistic: 26.75 on 6 and 25 DF, p-value: 9.874e-10
```

The type is fitted as if it were 2 variables; ‘typeclass b’ and ‘typeclass c’. The model assumes that ‘class a’ is the default and does not require its own coefficient. This behavior is based on the order of the factor levels, it will always use the first level.

We can see what this does to the model matrix below.

```
cars_model_matrix_type <- model.matrix(as.formula(cars_formula_type), cars_data)
head(cars_model_matrix_type)
```

```
## (Intercept) cyl disp hp wt typeclass b typeclass c
## Mazda RX4 1 6 160 110 2.620 0 0
## Mazda RX4 Wag 1 6 160 110 2.875 0 0
## Datsun 710 1 4 108 93 2.320 1 0
## Hornet 4 Drive 1 6 258 110 3.215 0 1
## Hornet Sportabout 1 8 360 175 3.440 1 0
## Valiant 1 6 225 105 3.460 0 0
```

```
cars_lm_fit_type <- lm.fit(x = cars_model_matrix_type, y = cars_data[["mpg"]])
summary(cars_lm_fit_type)
```

```
## Length Class Mode
## coefficients 7 -none- numeric
## residuals 32 -none- numeric
## effects 32 -none- numeric
## rank 1 -none- numeric
## fitted.values 32 -none- numeric
## assign 7 -none- numeric
## qr 5 qr list
## df.residual 1 -none- numeric
```

The model matrix now includes two extra columns; ‘typeclass b’ and ‘typeclass c’. These are 1 if `type`

is class b or class c respectively, and only one of them can be anything other than 0 in a row.

If a variable has a *single* variable with 50 factors it will add 49 columns to the model matrix. In each row, only a single column will be 1 and all of the others will be 0. The default matrix in R takes just as much memory to record a 1, a 0 or any other reasonably small integer. The amount of RAM it takes to hold and manipulate the matrix increases rapidly, just by including different factors.

A **sparse matrix** from the Matrix package does not suffer from this problem. It does not record any cell with a value of 0. Instead, it records the location and values of all non-zero values. This extra overhead means that it will be larger than a non-sparse matrix if it is used to store a matrix that is full of data. For a matrix absolutely stuffed with 0’s, though, it can make a matrix that takes 1 GB of RAM fit into a couple of MB. Perfect for models dealing with lots of factors.

## Using Sparse Matrix in Models

In a perfect world the function you are using to build your model will have an argument like `sparse=TRUE`

to tell the model to use a sparse model matrix. Check the documentation for the function to see if it talks about it.

If your function doesn’t include built-in support for a sparse matrix you may be able to force it to use them. Look for a function ending in *.fit* like `lm.fit`

, since this is commonly the function that will be acting on the model matrix. If this is available then you can create a sparse model matrix yourself using the `sparse.model.matrix`

function and use it as the input to the fitting function. However, it is not garanteed to work.

```
library(Matrix)
cars_model_matrix_type <- sparse.model.matrix(as.formula(cars_formula_type), cars_data)
head(cars_model_matrix_type)
```

```
## 6 x 7 sparse Matrix of class "dgCMatrix"
## (Intercept) cyl disp hp wt typeclass b typeclass c
## Mazda RX4 1 6 160 110 2.620 . .
## Mazda RX4 Wag 1 6 160 110 2.875 . .
## Datsun 710 1 4 108 93 2.320 1 .
## Hornet 4 Drive 1 6 258 110 3.215 . 1
## Hornet Sportabout 1 8 360 175 3.440 1 .
## Valiant 1 6 225 105 3.460 . .
```

`cars_lm_fit_type <- lm.fit(x = cars_model_matrix_type, y = cars_data[["mpg"]])`

`## Error in lm.fit(x = cars_model_matrix_type, y = cars_data[["mpg"]]): INTEGER() can only be applied to a 'integer', not a 'NULL'`

This example failed because lm.fit does not accept sparse matrices as inputs. There are other packages, such as `SparseM`

that include this capability. Try searching CRAN and Google for your modelling function and “sparse matrix” and you may be in luck.

A sparse matrix is not always going to use less memory than the default, but remember it the next time you work with factors.