R factoid: matrices are faster than data frames (when accessed element by element)

Matrix vs Data Frame

In R there are obvious ways to store a rectangular grid of numbers: a matrix of a data frame. A data frame can handle different types in different columns, so it is richer than a matrix which has a single type. Internally a matrix is just a vector with a dimension atribute.

A dataframe can be accessed as if it is a matrix using the notation df[i,j].

When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.

Lets first create a 100x100 matrix.

# create a 100 x 100 matrix with random numbers
n<-100
m<-100
a<-matrix(runif(n*m,0,100),nrow=m,ncol=n)
str(a)

##  num [1:100, 1:100] 85.1 17.7 34.9 54.2 51 ...

Now copy the data into a data frame:

df<-as.data.frame(a)
tibble::glimpse(df[,1:10])

## Observations: 100
## Variables: 10
## $ V1  <dbl> 85.065933, 17.728731, 34.860332, 54.248920, 50.977486, 99....
## $ V2  <dbl> 5.813586, 56.404924, 41.151549, 32.029510, 40.857083, 81.8...
## $ V3  <dbl> 4.452198, 34.878083, 34.398440, 90.276525, 26.451508, 48.9...
## $ V4  <dbl> 43.496058, 95.162642, 65.250058, 83.941005, 55.507790, 17....
## $ V5  <dbl> 58.70423852, 39.08486762, 73.75686767, 51.61834429, 72.739...
## $ V6  <dbl> 34.55152, 36.22242, 12.87432, 74.17734, 32.98368, 93.05312...
## $ V7  <dbl> 96.112254, 55.291468, 8.653270, 28.452267, 18.926653, 80.6...
## $ V8  <dbl> 45.4440507, 20.2528389, 98.3861469, 90.8775842, 14.8820597...
## $ V9  <dbl> 3.2462605, 10.2364068, 90.3974374, 5.3393112, 41.1585281, ...
## $ V10 <dbl> 39.686012, 74.177476, 67.095580, 96.879985, 16.446060, 97....

Here we compare the memory use. They are roughly the same.

pryr::object_size(a)
pryr::object_size(df)

## 80.2 kB
## 91 kB

Let’s create a function that randomly access one million elements. To be able to check that we do the same thing we sum these elements.

K<-1e6
rowi<-sample(n,size=K,replace=T)
colj<-sample(m,size=K,replace=T)
f<-function(a){
s<-0
for(kin1:K){
s<-s+a[rowi[k],colj[k]]
}
return(s)
}
# same result: we compute the same thing
f(a)
f(df)

## [1] 49894303
## [1] 49894303

The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:

system.time(f(a))
system.time(f(df))

##    user  system elapsed 
##    2.74    0.00    2.75 
##    user  system elapsed 
##   50.71    0.02   50.81

Matrix vs Data Frame

A data frame can be accessed as if it is a matrix using the notation df[i,j].

When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.

Let's first create a 100 x 100 matrix.

# create a 100 x 100 matrix with random numbers
m <-100
n <-100
a <-matrix(runif(m*n,0,100),nrow=m,ncol=n)
str(a)

## num [1:100, 1:100] 16.6 43.1 86.7 44.4 11.1 ...

Now copy the data into a data frame using as.data.frame(a). I could have used data.frame(a) instead (that would yield different column names X1,X2,...).

df <-as.data.frame(a)
df[1:5,1:5]

## V1 V2 V3 V4 V5
## 1 16.63825 37.27615 91.70009 31.16568013 60.35773
## 2 43.07365 26.23530 45.98191 27.83072027 94.46351
## 3 86.72209 14.70443 55.62293 51.58801596 83.38001
## 4 44.37759 93.33136 41.71945 0.06141574 75.64287
## 5 11.13246 14.10785 75.97597 51.34233858 54.39535

Here we compare the memory use. They are roughly the same.

pryr::object_size(a)
pryr::object_size(df)

## 80.2 kB
## 91 kB

Let's create a function that randomly accesses one million elements. To be able to check that we do the same thing we sum these elements.

K <-1e6
rowi <-sample(m,size=K,replace=T)
colj <-sample(n,size=K,replace=T)
f <-function(a) {
   s <-0
   for(k in 1:K) {
      s <-s +a[rowi[k],colj[k]]
   }
   return(s)
}
# same result: we compute the same thing
f(a)
f(df)

## [1] 49990017
## [1] 49990017

The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:

system.time(f(a))
system.time(f(df))

##    user system elapsed
##    3.18    0.00    3.20
##    user system elapsed
##   58.39    0.07   59.31

Note: don't try this function on a Data Table. The behavior and rules of indexing on a Data Table are slightly different. Although we can use:

dt <-data.table(df)
dt[1,2]

## V2
## 1: 37.27615

the indexing as used in function f() is not identical to what we are used to when working with data frames and matrices.

R factoid: matrices are faster than data frames (when accessed element by element)

Matrix vs Data Frame

Matrix vs Data Frame

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...