People generally have data to read into R. You probably have your data in an Excel spreadsheet. The simplest way to load these into R is to save a copy of the data as a comma separated values file (csv) and work with that.
It is actually possible to read directly from Excel (but see the gdata package that has a
read.xls function, and see this page for other alternatives. This is usually more hassle than it’s worth, and going through a comma separated file is easy enough.
To load the data into R:
data <- read.csv("data/seed_root_herbivores.csv")
(this doesn’t usually produce any output — the data is “just there” now).
Clicking the little table icon next to the data in the Workspace browser will view the data.Running
View(data) will do the same thing.
data variable contains
data.frame object. It is a number of columns of the same length, arranged like a matrix. That sentence is tricky, for reasons that will become apparent.
Often, looking at the first few rows is all you need to remind yourself about what is in a data set.
You can get a vector of names of columns:
You can get the number of rows:
and the number of columns
The last one is surprising to most people. There is a logical (if not good) reason for this, which we will get to later.
Aside from issues around factors and character vectors (that we’ll cover shortly) this is most of what you need to know about loading data.
However, it’s useful to know things about saving it.
FALSEor a category or a count or a measurements.
data2 <- data data2$Height <- data2$Height * 1.1 write.table(data2, "data/seed_root_herbivores.txt", sep=";", row.names=FALSE) rm(data2)
data/seed_root_herbivores.txthas almost the same data, but in tab separated format (it does have the same number of rows and columns). Look at the ?read.table help page and work out how to load this file in.
==tests for equality,
!=tests for inequality
Hint: Look at the
data2 <- read.table("data/seed_root_herbivores.txt", header=TRUE, sep=";") data2 == data # or data2 != data
The point here is that many of the functions and operators in R will try to do the Right Thing, depending on what you give them.
This won’t work, because the default arguments of
read.csv are different for the header.
tmp <- read.table("data/seed_root_herbivores.txt", sep=";") head(tmp)
Notice that a fake header (V1, V2, etc) has been created and the actual header is now the first row of data.
Looking at your data
There are other ways of looking at your data. The
summary function works with most types, and gives a by-column summary of the data set
So, we see there is an issue in the file — how to we get to it?
There a bunch of different ways of extracting bits of your data.
Get the column
This does almost the same thing
This is the main difference: if the column name is in a variable, then
$ won’t work, while
[[ will. Let’s define a variable
v that has the name if the first column as its value:
v <- "Plot"
We can extract this column of the data set using the
but using the
$ notation won’t work as it will look for the column called
NULL to indicate that the column does not exist.
data$P will “expand” to make
data$S will return
NULL because that is ambiguous. Always use the full name!
Single square brackets also index the data, but do so differently. This returns a
data.frame with one column:
This returns a
data.frame with two columns:
(I’m just using
head here to keep the output under control. If you actually wanted a
data.frame like this you might do
data.sub <- data[c("Plot", "Weight")]
and then continue to use the new
The difference between
[[ can be confusing.
The best explanation I have seen is that imagine that the thing you are subsetting is a train with a bunch of carriages.
[x] returns a new train with carriages represented by the variable
train[c(1,2)] returns a train with just the first two carriages, and
train returns a train with just the first carriage. The
[[ operator gets the contents of a single carriage. So
train[] gets the contents of the first carriage, and
train[[c(1,2)]] doesn’t make any sense.
Plotting is covered in the next R module, but it’s one of the best things about R so I can’t resist showing how to do it:
Here is a histogram of the height variable:
(it will appear in the bottom right of your screen)
Here is a scatter plot of Height vs weight:
The order of arguments is x-variable, y-variable.
There is an alternative interface using R’s “formulae” (you’ll see this a lot in statistical models with Drew). Read this as “
Height is a function of
Weight“. It makes nicer axis labels, too.
plot(Height ~ Weight, data)
Here is a series of bivariate plots for height, weight and the number of seed heads:
pairs(data[c("Height", "Weight", "Seed.heads")])
The take-home being that R makes it very easy to create graphs, and most people who use it casually just make plots of whatever they’re looking at. The plots can vary from quick and dirty like this to really beautiful pieces of art.
Extracting a row always returns a new
data[10,] data[10:20,] data[c(1, 5, 10),]
Be careful with indexing by location.
The above all index by name or by location (index). However, you generally want to avoid referencing by number in your saved code, e.g.:
data.height <- data[]
This is because if you change the order of your spreadsheet (add or delete a column), everything that depends on
data.height may change. You may also see people do this in their code.
data.height <- data[,5]
This should really be avoided. By name is much more robust and easy to read later on, even if it is more typing at first.
data.height <- data$Height data.height <- data[["Height"]]
When should you index by location?
When you are computing the indices. As an example: suppose that you wanted every other row (perhaps you’re trying to generate a nonrandom some sample of data?) Remember
seq from above? We can generate a sequnce of integers 1, 3, …, up to the last (or second to last) row in our data set like this:
idx <- seq(1, nrow(data), by=2)
Then subset like this:
data.oddrows <- data[idx,]
Our new data set has half the rows of the old data set:
Because row names are preserved, you can see the odd numbers in the row names.
This is one of the most powerful ways of indexing.
Remember our data mismatch:
data != data2
There is one entry in the
Height row that disagrees. How can we extract the line that the mismatch is on?
We could do it by index:
But that requires us to look for the error, note the row, write it down, etc. Boring, and computers are less error prone than humans. Plus, I just said that we should not do that.
This is a logical vector that indicates where the entries in vector 1 disagree with vector 2:
data.differ <- data$Height != data2$Height
We can index by this – it will return rows for which there are true
You can convert from a logical (
FALSE) vector to an integer vector with the
This can be really useful.
- Return all the rows in
datawhere both data sets have the same value for
- Return all the rows in
data.same <- data$Height == data2$Height data[data.same,]
!x as “not x”,
data[data$Plot == "plot-8",]
Subsetting can be useful when you want to look at bits of your data. For example, all the rows where the Height is at least 10 and there was no seed herbivore:
data[data$Height > 10 & data$Seed.herbivore,]
& operator here is a logical “and” (read
x & y as “x and y”):
TRUE & TRUEis
TRUE & FALSEis
FALSE & TRUEis
FALSE & FALSEis
In contrast, the
| operator is a logical “or” (read as “or”)
TRUE | TRUEis
TRUE | FALSEis
FALSE | TRUE is
FALSE | FALSE is
The other, less common, operator is the exclusive or:
So you can do all sorts of crazy things like
data[data$Plot == "plot-2" & data$Seed.herbivore & data$Root.herbivore,]
and get all the cases in plot 2 where there were both seed herbivores and root herbivores. Or
data[data$Height > 75 & (data$Seed.herbivore | data$Root.herbivore),]
and get all the plants that are quite tall in treatments with either a seed herbivore or a root herbivore (or both).
You can build these up if you want:
idx.tall <- data$Height > 75 idx.herbivore <- data$Seed.herbivore | data$Root.herbivore idx.select <- idx.tall & idx.herbivore data[idx.select,]
whatever you find easiest to read and write.
Programs should be written for people to read, and only incidentally for machines to execute (Structure and Interpretation of Computer Programs” by Abelson and Sussman)
subsetfunction to simplify writing complex subsets
There is a function
subset that may help you write complex subsets.
subset(data, Height > 75 & (Seed.herbivore | Root.herbivore))
This can help, especially interactively, but it can also bite you. It is not always obvious where the “value” of the variables in the second argument are coming from. For example:
subset(data, idx.tall & (Seed.herbivore | Root.herbivore))
This works fine, because it found
idx.tall. So when you read your code, you need to think carefully about which values are coming from the
data.frame and which are coming from elsewhere.
This is an unfortunate example of a function designed to be used by beginners, but it only really understandable once you understand more of what is going on. You’ll see it used widely, and it can simplify things. But be careful.
It is easy to add new columns, perhaps based on old ones:
data$small.plant <- data$Height < 50 head(data)
You can delete a column by setting it to
data$small.plant <- NULL head(data)
In this data set, the last column contains the number of seeds in 25 seed heads. However, there weren’t always 25 seed heads on a plant:
data[data$Seed.heads < 25,]
In these three cases, the column contains the number of seeds over all seed heads.
How do we compute the mean number of seeds per seed head?
data$Seeds.per.head <- data$Seeds.in.25.heads / 25 idx.few.heads <- data$Seed.heads < 25 data$Seeds.per.head[idx.few.heads] <- data$Seeds.in.25.heads[idx.few.heads] / data$Seed.heads[idx.few.heads]
R generally offers several ways of doing things:
alternative <- data$Seeds.in.25.heads / pmin(data$Seed.heads, 25) alternative == data$Seeds.per.head
all function to determine if all values are TRUE:
all(alternative == data$Seeds.per.head)
Given this vector with the first give letters of the alphabet:
x <- c("a", "b", "c", "d", "e")
Repeat the first letter once, the second letter twice, etc.
x[c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5)]
rep is incredibly useful, and can be used in many ways. See the help page
This material was adapted from Rich FitzJohn’s 2013 Intro to R module.