This Unit aims to build a dataset from scratch
A data-frame is a fundamental data structure in R. It is a collections of variables (by columns) and samples (by rows).
Data-frames can be built by the following functions:
Function | Arguments | Description |
---|---|---|
c() |
Combine values into a vector or list (i.e. c(1:5) gives: 1, 2, 3, 4, 5) |
|
rbind() |
Combine vectors or data-frames by rows | |
cbind() |
Combine vectors or data-frames by columns | |
rep() |
Replicate a list (or vector, matrix, etc.) | |
times |
number of times to repeat a list | |
length.out |
exact number of elements in the output | |
each |
number of times to repeat each element of a list | |
seq() |
Create regular sequences of numbers or characters | |
from |
Starting value of a sequence | |
to |
End value of a sequence | |
by |
Increment value of a sequence | |
length.out |
Exact length of a sequence | |
sample() |
Take a random sample from a vector or list |
In this Unit, you will learn how to build a experimental design, with three categorical factors and three responses, each affected by some experimental noise, like the following table:
FactorA | FactorB | FactorC | Response1 | Response2 | Response3 |
---|---|---|---|---|---|
1 | 1 | 1 | 3 | 6 | 9 |
1 | 1 | 2 | 5 | 10 | 15 |
1 | 2 | 1 | 2 | 5 | 8 |
1 | 2 | 2 | 4 | 9 | 14 |
2 | 1 | 1 | 6 | 10 | 14 |
2 | 1 | 2 | 8 | 14 | 20 |
2 | 2 | 1 | 6 | 10 | 14 |
2 | 2 | 2 | 8 | 14 | 20 |
An Experimenter aims to understand the effect of a factor on a measurable variables. Factors may be fixed at certain levels (i.e. “all the experiments are performed at three different temperatures”), or taken randomly (the measurable variable is recorded at different temperatures randomply chosen). In R, factors are handled with the following functions:
Function | Arguments | Description |
---|---|---|
factor() |
Encode a vector as a factor. | |
x |
a vector of data | |
levels |
a vector with a unique set of data. | |
labels |
a character vector of labels. | |
ordered |
are the levels are in the order given? | |
as.factor() |
Coerces its argument to a factor. | |
is.factor() |
Return TRUE or FALSE whether its argument is a type of factor. | |
as.ordered() |
Coerces its argument to an ordered factor. | |
is.ordered() |
Return TRUE or FALSE whether its argument is a type of ordered factor. | |
levels() |
levels(x) returns the value of the levels; levels(x)<-c("a",...) sets the attribute. |
|
labels() |
Set labels for use in printing or plotting. |
factor()
For instance, the sequence in the column name FactorA
can be reproduced with:
FactorA <- rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorA
## [1] 1 1 1 1 2 2 2 2
However, the object FactorA
is a vector:
is.factor(FactorA)
## [1] FALSE
is.vector(FactorA)
## [1] TRUE
To encode such object as factor, use the function factor
or as.factor
:
FactorA <- factor(FactorA)
str(FactorA)
## Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2
It is possible to use more expressive factor names with the argument labels
:
FactorA <- factor(FactorA,
levels = c(1, 2),
labels = c("red", "green"))
str(FactorA)
## Factor w/ 2 levels "red","green": 1 1 1 1 2 2 2 2
Although the factors appear now as characters, the levels have the same value as before.
Experimental responses can be simulated as linear combination of the factor levels. Moreover, it is possible to add some experimental noise to simulate experimental uncertainty.
Responses can be simulated with random numbers. In R, there are several functions for the generation of random numbers. A couple of these functions are the following:
Function | Arguments | Description |
---|---|---|
rnorm() |
Normally distributed random numbers. | |
n |
Number of values to be drawn. | |
mean |
Mean of the gaussian population. | |
sd |
Standard deviation of the gaussian population. | |
runif() |
Uniform distribution of random numbers. | |
min |
Lower limit of the distribution. | |
max |
Upper limit of the distribution. |
Response1 <- rnorm(n = length(FactorA),
mean = 10,
sd = 1)
Response1
## [1] 10.594103 10.087919 9.918756 10.982557 9.927781 9.877853 10.521454
## [8] 9.820152
Any number of responses and factors can be set as shown before. Here is the code for a factorial experiment:
FactorA <- rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorB <- rep(seq(from = 1,
to = 2,
by = 1),
each = 2,
times = 2)
FactorC <- rep(seq(from = 1,
to = 2,
by = 1),
times = 4)
Response1 <- 2*FactorA -
2*FactorB +
2*FactorC +
FactorA*FactorB
Response2 <- 3*FactorA -
2*FactorB +
4*FactorC +
FactorA*FactorB
Response3 <- 4*FactorA -
2*FactorB +
6*FactorC +
FactorA*FactorB
df <- data.frame(A = as.factor(FactorA),
B = as.factor(FactorB),
C = as.factor(FactorC),
R1 = Response1,
R2 = Response2,
R3 = Response3)
df
## A B C R1 R2 R3
## 1 1 1 1 3 6 9
## 2 1 1 2 5 10 15
## 3 1 2 1 2 5 8
## 4 1 2 2 4 9 14
## 5 2 1 1 6 10 14
## 6 2 1 2 8 14 20
## 7 2 2 1 6 10 14
## 8 2 2 2 8 14 20
A preliminary overview of data is given by the following functions:
Function | Description |
---|---|
head() |
Shows only the first six rows of the dataset |
tail() |
Shows only the last six rows of the dataset |
class() |
Shows the type of data |
dim() |
Shows the dimension of the dataset (rows x columns) |
ncol() |
Number of columns |
nrow() |
Number of rows |
length() |
Number of elements in a vector |
str() |
Shows the structure of the dataset |
names() |
Shows the column names of a dataset |
summary.default() |
Shows some basic information on the dataset, such as name of columns, length and class |
For instance, to check the first few rows of the data-frame:
head(df)
## A B C R1 R2 R3
## 1 1 1 1 3 6 9
## 2 1 1 2 5 10 15
## 3 1 2 1 2 5 8
## 4 1 2 2 4 9 14
## 5 2 1 1 6 10 14
## 6 2 1 2 8 14 20
To check data structure:
str(df)
## 'data.frame': 8 obs. of 6 variables:
## $ A : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2
## $ B : Factor w/ 2 levels "1","2": 1 1 2 2 1 1 2 2
## $ C : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2
## $ R1: num 3 5 2 4 6 8 6 8
## $ R2: num 6 10 5 9 10 14 10 14
## $ R3: num 9 15 8 14 14 20 14 20
To have a summary of the data-frame:
summary.default(df)
## Length Class Mode
## A 8 factor numeric
## B 8 factor numeric
## C 8 factor numeric
## R1 8 -none- numeric
## R2 8 -none- numeric
## R3 8 -none- numeric
To change columnames:
names(df) <- c("FactorA",
"FactorB",
"FactorC",
"Resp1",
"Resp2",
"Resp3")
df
## FactorA FactorB FactorC Resp1 Resp2 Resp3
## 1 1 1 1 3 6 9
## 2 1 1 2 5 10 15
## 3 1 2 1 2 5 8
## 4 1 2 2 4 9 14
## 5 2 1 1 6 10 14
## 6 2 1 2 8 14 20
## 7 2 2 1 6 10 14
## 8 2 2 2 8 14 20