We've reached the point in learning R where we can now afford to focus on efficiency over "whatever works." A prime example of this is the

**apply**functions, which are powerful tools to quickly analyze data. The functions can find slices of data frames, matrices, or lists and rapidly perform calculations on them. These functions make it simple to perform analyses like finding the variance of all rows of a matrix, or calculating the mean of all individuals that meet conditions X, Y, and Z in your data, or to feed each element of a vector into a complex equation. With these functions in hand, you will have the tools to move beyond introductory knowledge of R and into more specialized or advanced analyses.*[An obligatory comment before we get started: I initially framed this post as a counter to how slow***for loops**are, but to my surprise, for loops were actually consistently faster than**apply.**The catch is that the for loop has to be written well, i.e. with pre-allocated space and avoiding overwriting variables on every iteration. I'll go into detail on this in the next series of R posts: "Random R." Stay tuned.]**This is the fifth in a series of posts on R. This post covers:**

**- apply, sapply, lapply, tapply**(pronounced "apply," "s-apply," "l-apply," and "t-apply")

**- sweep**

Previous posts covered 1) an introduction to R, 2) using R to understand distributions, plotting, and linear regression; 3) for loops and random walks; and 4) functions and if statements. The most important code from those posts is shown below. Following the formatting of the previous entry, I'll have color-coded

*[ ;-) ]*text so it's easier to read:**quoted text**in green;**for****loops**in dark orange;**if**,**else**,**else if**and**function**in goldenrod; and**defined functions**in purple. Code will always be**bolded and blue**(especially if inside paragraphs where I'm explaining something) and comments in**orange**or**red**.**c() # Concatenate**

**x <- c(5, 10) # x is now a two-element vector**

**x[x > 7] # Subset the values of x that are greater than 7**

data <- rnorm(100) # Assign 100 randomly-drawn values from a normal

data <- rnorm(100) # Assign 100 randomly-drawn values from a normal

**# distribution with mean zero and standard deviation 1 to**

**# the object "data"**

**hist(data) # Plot a histogram of the "data" object**

**y <- runif(100) # Assign 100 randomly-drawn values from a uniform**

**# distribution with minimum zero and maximum 1 to the**

**# object "y"**

**plot(y ~ data) # Make a scatterplot of**

*y*as a function of*data*

for(i in 1:length(x)){ # For i in the sequence 1 to the length of x...

for(i in 1:length(x)){ # For i in the sequence 1 to the length of x...

**print(i) # ...print i**

**}**

**our.mean <- function(x){sum(x) / length(x)} # Create a mean function**

**if(1 + 1 < 5){print("Yup, less than 5")} # Print the statement if 1 + 1 < 5 is true**

__________________________________________________________________

**Apply**

The simplest of the apply functions is the one the family is named after. Let's revisit the following plot from the third R post on for loops.

A random walk, if you remember, is a series of random steps. The walk can be in as many dimensions as you want; the ten walks shown above are in one dimension (where there's only one movement axis: here, it's up/down). It seems like the variance in the walks increases with time: the more you move to the right, the more spread out the walks are. How could we quantify this?

In case you missed it, here's how we generated the data up there. But this time, let's run 100 random walks instead of 10.

In case you missed it, here's how we generated the data up there. But this time, let's run 100 random walks instead of 10.

**start <- 0**

mean.move <- 0

sd.move <- 1

n.steps <- 100000

n.indivs <- 100 # The number of random walks you want to run

# Create an empty matrix that we'll fill with the random walk values

WALKS <- matrix(NA, nrow = n.indivs, ncol = n.steps)

# Start all the walks at the origin

WALKS[, 1] <- start

# Perform the random walks

for(i in 2:n.steps){

WALKS[, i] <- WALKS[, i - 1] + # This column looks at the previous column...

rnorm(n.indivs, mean.move, sd.move) # 10 random values are generated, one for each row.

# If you did rnorm(1, ...), you would generate 10 identical random walks

# because the same value would be added to each walk

print(i)

}

mean.move <- 0

sd.move <- 1

n.steps <- 100000

n.indivs <- 100 # The number of random walks you want to run

# Create an empty matrix that we'll fill with the random walk values

WALKS <- matrix(NA, nrow = n.indivs, ncol = n.steps)

# Start all the walks at the origin

WALKS[, 1] <- start

# Perform the random walks

for(i in 2:n.steps){

WALKS[, i] <- WALKS[, i - 1] + # This column looks at the previous column...

rnorm(n.indivs, mean.move, sd.move) # 10 random values are generated, one for each row.

# If you did rnorm(1, ...), you would generate 10 identical random walks

# because the same value would be added to each walk

print(i)

}

We have a 100 x 100,000 matrix, where each row is a random walk and each column is a time step. To show how the variance of the walks changes over time, we want to take the variance of each column. One way to do this would be to create a for loop that iterates over each column. Below,

**WALKS**is our matrix with the random walk data.**walk.variance <- c()**

**for(i in 1:ncol(WALKS)){**

**walk.variance[i] <- var(WALKS[, i])**

**print(i)**

**}**

This works, but there's a simpler way to do this.

*[I'll no longer say there's a faster way to do this; see the note at the start of this post!]*Let's use**apply**, which only requires one line of code.**walk.variance <- apply(WALKS, 2, var)**

**apply**'s inputs are the matrix, whether you want rows

**(apply(..., 1, ...))**or columns

**(apply(..., 2, ...))**, and the function you want to apply to that dimension of the matrix.

*[Note: you can also do this for higher-dimensional arrays, e.g. a 3-dimensional "book" where you want to summarize each "page." That would be apply(..., 3, ...)]*

If we plot that out, we see a nice linear relationship between the variance and how long we've run the random walk.

You're also not limited in what function you run across each column or row (as long as the function can take a vector input). Using what we learned in post #4 of our R series, let's apply a more complicated function to each row of the matrix. This function will find the mean of the six highest values of each random walk, then subtract it from the range. No real reason why. Just flexing a little R muscle!

**# Find the mean of the six highest values of the input, then subtract it from the range**

**fun.times <- function(x){mean(head(x)) - (range(x)[2] - range(x)[1])}**

**lots.of.fun <- apply(WALKS, 1, fun.times)**

**____________________________________________________________**

**sweep**

**apply**is great for performing a calculation on each row or column of a matrix, but what if you want to, say, subtract an

*entire vector*across each column of a matrix? You don't just have one value that's getting subtracted by the whole matrix. You have a bunch of values, and the different rows/columns in the matrix get a different value applied to them. For this, you'll need

**sweep.**

Here's a simple example that doesn't require

**sweep**. You're subtracting one value from the entire matrix.

**WALKS.adjusted <- WALKS - mean(WALKS)**

Here, the new object (

**WALKS.**

**adjusted**) has been centered around the mean of

**WALKS**. This isn't incredibly informative, though, because the mean of

**WALKS**condenses the information from all the random walks

*and*all the time points of those walks into one value.

With

**sweep**, we can be more nuanced. Let's create a new matrix,

**WALKS.adjusted2**, where all of the walks are centered around the mean of all the random walks

*for that particular time point.*Walks that are have particularly high values will have positive values, average walks will be around zero, and walks with low values will have negative values.

**WALKS.adjusted2 <- sweep(WALKS, 2, colMeans(WALKS), "-")**

**____________________________________________________________**

**sapply**

**apply**works great on matrices, but it doesn't work if you want to perform a calculation on every element of a vector, or if you want to perform a calculation on only select columns of a matrix or data frame.

**sapply**is the answer here.

First, a vector example. Say you have a list of probabilities, and you want to simulate outcomes based on those probabilities. You have 50 people, and they each have some probability of marathoning the eight

*Harry Potter*movies back to back in a massive sleepover party with two B-list celebrities of their choosing, friendly law enforcement officers, and children's science fiction

*Animorphs*author K.A. Applegate in attendance.

**# 50 random values drawn from a uniform distribution between 0 and 1**

**probs <- runif(50, 0, 1)**

**# The possible outcomes**

**possibilities <- c("It's marathon time!", "Sorry, I actually have this thing I have to go to instead.")**

**# A function that takes a probability of watching the movie and randomly chooses whether you see it or not based on your probability**

**outcome <- function(x){sample(possibilities, 1, prob = c(x, 1-x))}**

**# The outcomes**

**sapply(probs, FUN = outcome)**

Note that the outcomes will differ a bit every time because you're dealing with

*probabilities*, and so there's an element of randomness with who gets chosen. A probability of 0.5 will give a "marathoning" outcome 50% of the time and a "sorry" outcome the other half, for example.

Great. Now let's say that we're hosting our

*Harry Potter*marathon. It's a great party, Halle Berry and Officer Chuck are enjoying the homemade marshmallow snacks, we're content. But a nagging thought persists:

*could we use*

**sapply**to find the sums of the numeric columns of a data frame that's has columns with numbers, factors, and characters?When we Google those exact words, Google asks if we meant

**lapply**, which makes us uncomfortable because we haven't gotten that far into this blog post yet. Fortunately, we

*can*use

**sapply**to analyze solely the numeric columns.

As an ambitious high school student, we don't have any data to brag to K.A. Applegate about, so we'll just have to use a random number generator to create some.

**M <- data.frame(rnorm(20), # A random variable: height change**

**rnorm(20,2), # A random variable: weight change**

sort(rep(c("A","B"),10)), # The treatment: A or B

sort(rep(c("A","B"),10)), # The treatment: A or B

**rep(1:5, 4)) # The trial number**

**colnames(M) <- c("height_change", "weight_change", "treatment", "trial")**

**# Tell R that the trial column is a category, not numbers**

**M$trial <- as.factor(M$trial)**

If we try to use

**apply**

**(M, 2, sum)**, we'll get an error because not all of the columns are numeric. We can specify which columns we want with

**sapply**.

**num.cols <- sapply(M, is.numeric)**

**sapply(which(num.cols == T), function(x){sum(M[, x])})**

Alternatively, we could also get our result with

**apply(M[1:2, ], 2, sum)**. If we tried just doing

**sum(M[1:2, ])**, it would give us one value: the grand sum of all the values in all columns.

**____________________________________________________________**

**tapply**

While you can use

**sapply**on data frames,

**tapply**is a friendlier tool. Let's say we accidentally spilled salsa all over our phone when the tortilla chip broke mid-conversation with K.A. Applegate. We know we shouldn't have been using our phone as a plate, that was so silly, the creator of

*Animorphs*who inspired millions, possibly billions of children and young adults to pursue biology and never give up against insurmountable odds is suddenly pretending she has to take an important call. Well, at least we still have R.

Reinforce your chips during salsa dipping when in important social situations. |

So here we are, salsa-drenched phone in hand. Let's say we want to conduct an experiment examining how difficult it is to read

*The Headbanging Behaviorist*on phones slathered in different brands, spiciness, colors, and quantities of salsa. We measure difficulty reading in seconds required to quickly scroll through the blog post and then give a "Like" on Facebook. (Side note: it may be necessary to remove values of infinity from the data set.)

With

**tapply**, we can cut up our data frame to look at

*just*the effects of Tostitos at all quantities, or

*just*"Very Hot" Green Mountain Gringo. We can apply summary statistics to see if there's an effect of red versus clear salsa when your phone is so buried in salsa the screen has gone black ( > 3.0 kg).

**As you'll see,**

**tapply**is fantastic for data frames because it can "read" characters within the data frame, as opposed to

**apply**or

**sapply**which can only look at the column or row level.

**# Generate the data by replicating the character inputs (rep) and then scrambling them (sample)**

**our.salsas <- sample(rep(c("Tostitos", "Green Mountain Gringo", "Newman's Own", "Goya", "Old El Paso"), 200))**

**spiciness <- sample(rep(c("Mild", "Medium", "Hot", "Very Hot!"), 250))**

**colors <- sample(rep(c("Red", "Green", "Clear"), 250))**

**quantities <- sample(rep(c("0 - 1.0 kg", "1.0 - 2.0 kg", "2.0 - 3.0 kg", "> 3.0 kg"), 250))**

**time_to_read <- rpois(1000, 2)**

the.data <- data.frame(our.salsas, spiciness, colors, quantities, time_to_read)

# Convert every variable (except time_to_read) to factors

our.salsas <- as.factor(our.salsas)

spiciness <- as.factor(spiciness)

colors <- as.factor(colors)

quantities <- as.factor(quantities)

the.data <- data.frame(our.salsas, spiciness, colors, quantities, time_to_read)

# Convert every variable (except time_to_read) to factors

our.salsas <- as.factor(our.salsas)

spiciness <- as.factor(spiciness)

colors <- as.factor(colors)

quantities <- as.factor(quantities)

Now let's use

**tapply**to find the mean

*time to read*column as a function of the

*spiciness*column.

**with(M, tapply(time_to_read, spiciness, mean))**

To keep the code clean, I've used the

**with()**command. Otherwise, because

**time_to_read**and

**spiciness**aren't global variables (they're columns within the variable

**M**); we'd have to write

**the.data$time_to_read**and

**the.data$spiciness**to call them. Just a bit cleaner this way.

The first argument in

**tapply**is the numeric data column you're interested in. The second argument is how you want to cross-section those data, and the third argument is the calculation you're performing.

By editing the second argument a bit, we can have

**tapply**provide us a more nuanced answer:

**with(M, tapply(time_to_read, list(spiciness, color_of_salsa), mean))**

With the code above, we can see the mean time to read as a function of the spiciness and color of our salsa.

**__________________________________________________________**

**lapply**

The last function I'll mention is

**lapply.**With

**lapply**, you can perform a calculation on every object in a list. Lists are vectors containing objects that can be vectors themselves as well, or matrices, or character strings, etc. Here's an example of a list:

**opinions <- c("I loved it!", "It was ok...", "Nah.")**

**random.numbers.with.no.real.purpose <- rnorm(10)**

**the.matrix <- matrix(runif(100, 0, 1), ncol = 5, nrow = 10)**

**our.list <- list(opinions, random.numbers.with.no.real.purpose, the.matrix)**

When you run

**our.list**, you'll see the three objects, one after the other. If you had some function that could operate on the three objects in the list, you could use

**lapply**on them. But I really think we need some closure on our

*Harry Potter*viewing party, so let's go back there.

In an effort to get on the front page of /r/dataisbeautiful, we decide to collect some data at our party. Every 30 minutes, we poll the audience on how much fun they're having. (It'd be solid 10/10's throughout the night, but play along and imagine there's some variation.) With these data, we build a matrix where each row is a party participant and each column is a time point (3:10am, 3:20am, etc.). The data are how much fun person i is having at time point j.

**have.fun <- matrix(round(runif(5000, 0, 10)), ncol = 100, nrow = 50)**

What if we want to know what moments each person was having at least 9/10 fun? We could just write

**which(have.fun > 9)**, but that would convert

**have.fun**into a vector and then give us the indices of the vector where the value is greater than 9. So for our 100 x 50 matrix, we'd get answers like "9" and "738" and "2111," which don't have any identifying information on who was happy when.

This sounds like a job for

**apply**, so we run this:

**passed.threshold <- apply(have.fun, 1, function(x){which(x > 9)})**

We now have a list, where each object is a person, and the values are the time points at which they scored at least a 9/10 on having fun. However, we can now use

**lapply**to get more nuanced information from

**passed.threshold**.

**# Find the first time point that passed the threshold**

**lapply(passed.threshold, min)**

**# Get a summary of the time points for each person that passed the threshold**

**lapply(passed.threshold, summary)**

Note that these commands return their contents as a list. If you'd prefer they were returned as a vector, we can use

**sapply**instead.

**sapply(passed.threshold, min)**

**__________________________________________________________**

Thanks for reading the "Introduction to R" series! Next up will be "Random R," where I'll take on random projects (like quantifying the computation time of

**for loops**versus

**apply**). Stay tuned.

Cheers,

-Matt

Salsa photo: drodd.com

## No comments:

## Post a Comment