This Unit aims to introduce some built-in functions of R that are used to repeat functions over a number of objects.
One of the most interesting feature of R is the possibility to repeat things. For instance, with a data-frame, it is possible to apply a function :
R functions used to repeat things are:
Function | Arguments | Description |
---|---|---|
apply() |
Apply a function row-wise or column-wise. | |
x |
Matrix name | |
MARGIN |
1 stands for row-wise, 2 for column-wise. | |
FuN |
the function to be applied. | |
lapply() |
Works with lists and return a list. | |
sapply() |
Work as lapply but return a data-frame. |
|
tapply() |
Work as sapply but by groups. |
|
INDEX |
a list of one or more factors. | |
FuN |
the function to be applied. | |
mapply() |
Work as sapply but for each element of lists |
The application of a function to each element of a vector is the default operation in R.
Consider the vector:
X <- c(2,3,4)
If you want the squared of each element, just type:
X^2
## [1] 4 9 16
If you have a data-frame:
X <- data.frame(A = c(2,3,4), B = c(3,4,5))
X
## A B
## 1 2 3
## 2 3 4
## 3 4 5
the squared of each element is given by:
X^2
## A B
## 1 4 9
## 2 9 16
## 3 16 25
Often, it is more important to calculate statistics row-wise or column-wise. This can be done with the function apply()
:
Given a dataset:
A <- rnorm(n = 4, mean = 10 , sd = 1)
B <- rnorm(n = 4, mean = 20 , sd = 2)
C <- rnorm(n = 4, mean = 30 , sd = 3)
X <- round(data.frame(A, B, C), digits = 2); X
## A B C
## 1 8.37 21.26 25.45
## 2 10.02 19.38 35.35
## 3 9.23 18.65 26.12
## 4 10.62 18.64 29.51
it is possible to calculate the marginal row means by:
apply(X, MARGIN = 1, mean)
## [1] 18.36000 21.58333 18.00000 19.59000
or the marginal column means by:
apply(X, MARGIN = 2, mean)
## A B C
## 9.5600 19.4825 29.1075
The function apply()
can be used also with user-defined functions:
apply(X, MARGIN = 2,
function(x) round(
sd(x)/mean(x),
digits = 3))
## A B C
## 0.102 0.063 0.155
lapply()
executes a function over the element of a list
and return a list. Since data-frames are a kind of column-wise lists, it
works also on data-frames. In such case, the result is the same as: apply(x, MAGIN = 2, FUN = ...)
, but the output is a list with as many elements as the columns in the data-frame:
LPLY <- lapply(X, function(x) round(
sd(x)/mean(x),
digits = 3))
str(LPLY)
## List of 3
## $ A: num 0.102
## $ B: num 0.063
## $ C: num 0.155
LPLY
## $A
## [1] 0.102
##
## $B
## [1] 0.063
##
## $C
## [1] 0.155
Sometimes, it is very useful to use the function sapply()
instead, which gives a matrix as output.
SPLY <- sapply(X, function(x) round(
sd(x)/mean(x),
digits = 3))
str(SPLY)
## Named num [1:3] 0.102 0.063 0.155
## - attr(*, "names")= chr [1:3] "A" "B" "C"
SPLY
## A B C
## 0.102 0.063 0.155
By the way, the same result could be obtained with the function apply()
:
apply(X,
MARGIN = 2,
function(x) round(
sd(x)/mean(x),
digits = 3))
## A B C
## 0.102 0.063 0.155
The function tapply()
works as sapply()
but
the result is grouped by a factor. This is especially important in
factorial experiments, where one is interested to know the averages over
one or more factors.
Given a factorial design
FactorA <- rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorB <- rep(seq(from = 1,
to = 2,
by = 1),
each = 2,
times = 2)
Response1 <- 5*FactorA -
2*FactorB +
FactorA*FactorB +
rnorm(n = length(FactorA),
mean = 0,
sd = 1)
df <- data.frame(FactorA = as.factor(FactorA),
FactorB = as.factor(FactorB),
Response1 = Response1)
df
## FactorA FactorB Response1
## 1 1 1 4.456599
## 2 1 1 2.906168
## 3 1 2 3.506017
## 4 1 2 3.736970
## 5 2 1 10.219816
## 6 2 1 9.484414
## 7 2 2 9.882215
## 8 2 2 12.331149
it is possible to calculate the averages of the measured variable Response1
grouped by the FactorA
:
tapply(df$Response1, df$FactorA, mean)
## 1 2
## 3.651439 10.479398
or by the FactorB
:
tapply(df$Response1, df$FactorB, mean)
## 1 2
## 6.766749 7.364088
A convenient version of tapply()
is aggregate()
.
This splits the data into subsets according to a specific factor, apply
a function for each, and returns the result in a convenient form.
aggregate(df$Response1,
by = list(df$FactorA),
mean)
## Group.1 x
## 1 1 3.651439
## 2 2 10.479398
The function can combine even more factors:
aggregate(Response1,
by = list(FactorA = FactorA,
FactorB = FactorB),
data = df,
mean)
## FactorA FactorB x
## 1 1 1 3.681384
## 2 2 1 9.852115
## 3 1 2 3.621494
## 4 2 2 11.106682
and can accept also the notation of formula:
aggregate(Response1 ~ FactorA*FactorB,
data = df,
mean)
## FactorA FactorB Response1
## 1 1 1 3.681384
## 2 2 1 9.852115
## 3 1 2 3.621494
## 4 2 2 11.106682
Suppose we have a folder containing multiple data.csv files. It is possible to import them all simultaneously with the following script:
# Get the files names
files = list.files(pattern="*.csv")
# First apply read.csv, then rbind
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))