Aim of the Unit

This Unit aims to introduce some built-in functions of R that are used to repeat functions over a number of objects.

Repeating things

One of the most interesting feature of R is the possibility to repeat things. For instance, with a data-frame, it is possible to apply a function :

over each element
over columns (column-wise)
over rows (row-wise)
over rows/columns by groups

R functions used to repeat things are:

Function	Arguments	Description
`apply()`		Apply a function row-wise or column-wise.
	`x`	Matrix name
	`MARGIN`	1 stands for row-wise, 2 for column-wise.
	`FuN`	the function to be applied.
`lapply()`		Works with lists and return a list.
`sapply()`		Work as `lapply` but return a data-frame.
`tapply()`		Work as `sapply` but by groups.
	`INDEX`	a list of one or more factors.
	`FuN`	the function to be applied.
`mapply()`		Work as `sapply` but for each element of lists

Apply to each element

The application of a function to each element of a vector is the default operation in R.

Consider the vector:

X <- c(2,3,4)

If you want the squared of each element, just type:

X^2

## [1]  4  9 16

If you have a data-frame:

X <- data.frame(A = c(2,3,4), B = c(3,4,5))
X

##   A B
## 1 2 3
## 2 3 4
## 3 4 5

the squared of each element is given by:

X^2

##    A  B
## 1  4  9
## 2  9 16
## 3 16 25

Apply

Often, it is more important to calculate statistics row-wise or column-wise. This can be done with the function apply():

Given a dataset:

A <- rnorm(n = 4, mean = 10 , sd = 1)
B <- rnorm(n = 4, mean = 20 , sd = 2)
C <- rnorm(n = 4, mean = 30 , sd = 3)
X <- round(data.frame(A, B, C), digits = 2); X

##       A     B     C
## 1  8.37 21.26 25.45
## 2 10.02 19.38 35.35
## 3  9.23 18.65 26.12
## 4 10.62 18.64 29.51

it is possible to calculate the marginal row means by:

apply(X, MARGIN = 1, mean)

## [1] 18.36000 21.58333 18.00000 19.59000

or the marginal column means by:

apply(X, MARGIN = 2, mean)

##       A       B       C 
##  9.5600 19.4825 29.1075

The function apply() can be used also with user-defined functions:

apply(X, MARGIN = 2, 
      function(x) round(
        sd(x)/mean(x), 
        digits = 3))

##     A     B     C 
## 0.102 0.063 0.155

Lapply

lapply() executes a function over the element of a list and return a list. Since data-frames are a kind of column-wise lists, it works also on data-frames. In such case, the result is the same as: apply(x, MAGIN = 2, FUN = ...), but the output is a list with as many elements as the columns in the data-frame:

LPLY <- lapply(X, function(x) round(
         sd(x)/mean(x), 
         digits = 3))
str(LPLY)

## List of 3
##  $ A: num 0.102
##  $ B: num 0.063
##  $ C: num 0.155

LPLY

## $A
## [1] 0.102
## 
## $B
## [1] 0.063
## 
## $C
## [1] 0.155

Sometimes, it is very useful to use the function sapply() instead, which gives a matrix as output.

SPLY <- sapply(X, function(x) round(
         sd(x)/mean(x), 
         digits = 3))
str(SPLY)

##  Named num [1:3] 0.102 0.063 0.155
##  - attr(*, "names")= chr [1:3] "A" "B" "C"

SPLY

##     A     B     C 
## 0.102 0.063 0.155

By the way, the same result could be obtained with the function apply():

apply(X, 
      MARGIN = 2, 
      function(x) round(
         sd(x)/mean(x), 
         digits = 3))

##     A     B     C 
## 0.102 0.063 0.155

Tapply

The function tapply() works as sapply() but the result is grouped by a factor. This is especially important in factorial experiments, where one is interested to know the averages over one or more factors.

Given a factorial design

FactorA <- rep(seq(from = 1, 
                   to = 2, 
                   by = 1), 
               each = 4)
FactorB <- rep(seq(from = 1, 
                   to = 2, 
                   by = 1), 
               each = 2, 
               times = 2)
Response1 <- 5*FactorA - 
             2*FactorB + 
             FactorA*FactorB + 
             rnorm(n = length(FactorA), 
                   mean = 0, 
                   sd = 1)
df <- data.frame(FactorA = as.factor(FactorA), 
                 FactorB = as.factor(FactorB), 
                 Response1 = Response1) 
df

##   FactorA FactorB Response1
## 1       1       1  4.456599
## 2       1       1  2.906168
## 3       1       2  3.506017
## 4       1       2  3.736970
## 5       2       1 10.219816
## 6       2       1  9.484414
## 7       2       2  9.882215
## 8       2       2 12.331149

it is possible to calculate the averages of the measured variable Response1 grouped by the FactorA:

tapply(df$Response1, df$FactorA, mean)

##         1         2 
##  3.651439 10.479398

or by the FactorB:

tapply(df$Response1, df$FactorB, mean)

##        1        2 
## 6.766749 7.364088

Aggregate

A convenient version of tapply() is aggregate(). This splits the data into subsets according to a specific factor, apply a function for each, and returns the result in a convenient form.

aggregate(df$Response1, 
          by = list(df$FactorA), 
          mean)

##   Group.1         x
## 1       1  3.651439
## 2       2 10.479398

The function can combine even more factors:

aggregate(Response1, 
          by = list(FactorA = FactorA, 
                    FactorB = FactorB),
          data = df,
          mean)

##   FactorA FactorB         x
## 1       1       1  3.681384
## 2       2       1  9.852115
## 3       1       2  3.621494
## 4       2       2 11.106682

and can accept also the notation of formula:

aggregate(Response1 ~ FactorA*FactorB, 
          data = df,
          mean)

##   FactorA FactorB Response1
## 1       1       1  3.681384
## 2       2       1  9.852115
## 3       1       2  3.621494
## 4       2       2 11.106682

Problem n.2: import multiple .csv files simultaneously

Suppose we have a folder containing multiple data.csv files. It is possible to import them all simultaneously with the following script:

# Get the files names
files = list.files(pattern="*.csv")
# First apply read.csv, then rbind
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))

Looping