Thursday, July 13, 2017

Random R: for loops vs. apply

Welcome to the first Random R post, where we ask random programming questions and use R to figure them out. In this post we'll look at the computational efficiency of for loops versus the apply function. 

For loops are a way of executing a command again and again, with one (or more) variable value changed each time. Let's say we have data on a bunch of people for food-related spending over one week. (And that yes, some days they spent only $0.11 and they never spent more than $10. Bear with me.) The first four columns of our matrix look like this:

If we want to know the average amount each person spent that week, we can take the mean of every column. If our matrix is called our.matrix, we could type out: 

mean(our.matrix[, 1])
mean(our.matrix[, 2])
mean(our.matrix[, 3])
mean(our.matrix[, 4])

But a simpler way to do this, especially if you have a lot of columns, would be to iterate through the columns with a for loop, like this:

for(i in 1:ncol(our.matrix)){
    mean(our.matrix[, i])

Here, the value of "i" is being replaced with the number 1, 2, 3, etc. up to the matrix's number of columns. 

Conceptually, this is pretty easy to understand. For loops are straightforward and for many applications, they're an excellent way to get stuff done. (We'll use nested loops in the next section, for example.) But one of the first things a beginning coder might hear about for loops is that as soon as you get the hang of them, it's a good idea to move onto more elegant functions that are simpler and faster. (This DataCamp intro to loops post, for example, mentions alternatives in the same sentence it introduces for loops!) One such "elegant" command is called apply. 

apply(our.matrix, 2, mean)

The apply command above does the same thing as the two preceding chunks of code, but it's much shorter and considered better practice to use. We're making our final script length shorter and the code above is easier to read. But what about what's going on under the hood? Does your computer find the above command actually faster to compute? Let's find out.

At its core, our methods will involve creating a matrix, performing a calculation on every column in the matrix with either a for loop or apply, and timing how long the process took. We'll then compare the computation times, which will be our measure of efficiency. Before we get started, however, we'll address three additional points:

1. Examine the role of matrix size
If there are differences in computational efficiency between for loops and apply, the differences will be more pronounced for larger matrices. In other words, if you want to know if you or your girlfriend is a better endurance runner, you'll get a more accurate answer if you run a marathon than if you race to that plate of Nachos on the table. (Save yourself an argument about fast-twitch versus slow-twitch muscle and just agree that watching Michael Phelps YouTube videos basically counts as exercise anyway.) So as we pit our competing methods against each other, we want to give them a task that will maximize the difference in their effectiveness.

But the nice thing about coding is that it's often really easy to get a more nuanced answer to our question with just a few more lines of code. So let's ask how the difference in effectiveness between for loops and apply changes with the size of the matrix. Maybe there's effectively no difference until you're dealing with matrices the size of what Facebook knows about your personal life, or maybe there are differences in efficiency right from the start. To look at this, we'll keep the number of columns constant at 1,000 but we'll vary the size of each column from 2 rows to 1,000.

2. Vary how difficult the computation is
Maybe our results will depend on what computation, exactly, we're performing on our matrices. We'll use a simple computation (just finding the mean) and a more complicated one (finding the mean of the six smallest values, which requires sorting and subsetting too).  

3. Minimize the role of chance
At the core of statistics is that there are innumerable random forces swaying the results in any data collection we perform. Maybe that bird preening itself just had an itch and isn't actually exhibiting some complex self recognition. Maybe the person misread the survey question and doesn't actually think the capital of Montana is the sun. One way we address this randomness we can't control for is through replication. We give a survey to lots of people; we look at lots of birds. One sun-lover is probably a mistake, but if everyone in the survey thought the sun was the capital, then we need to sit down and reevaluate how Montana is being portrayed in the media. So in our code, we'll run our simulation 10 times to account for randomness within our computer's processing time.

[The code to carry out these ideas is at the bottom of the post. Otherwise, let's look at some figures.]

1. Simple calculation
When we just want to know the mean of each column of a matrix, it turns out that for loops are actually faster than apply. 

In the figure above, the gray and red colors correspond to for loops and apply, respectively. The lines are the mean values for 10 runs, and the shaded regions are the means + 1 standard error. (If the envelopes don't overlap, we can be fairly confident that the differences we see are statistically significant.)

The y-axis corresponds to the time it took to perform the calculation. Lower values means the calculation was faster. As you move to the right on the x-axis, you're seeing the computational time for increasingly larger matrices. We see the gray and red lines diverge because the differences in computational efficiency grow larger as we give R an increasingly difficult task. For relatively small matrices, though, there's essentially no difference in computational efficiency between the two methods.

In summary, we have evidence that for loops outcompete apply for a simple task on matrices above roughly 300,000 cells. Below that size, though (which is pretty big), you won't see any difference in how fast the calculation is run. Even for a matrix with a million data points, an average laptop can find the mean of each column in less than 1/20 of a second.

2. More complex calculation
Here, we're asking R to find the mean of the six lowest values of each column of a matrix. This requires R to first order the column values from smallest to largest, then pay attention to only the smallest six values, then take the mean of those values. According to our figure, we find any differences between for loops and apply to be negligible.

Maybe for incredibly large matrices, for loops are slightly more efficient, as seen by the shaded regions not overlapping at the far right of the figure. However, we should still be careful in interpreting this figure; while the error envelopes don't overlap for some matrix sizes (e.g. ~720k cells, 1 million), they do for other matrix sizes (e.g. 600k, 825k), meaning the differences we're seeing are likely due to chance. 

Ten replicates of our simulation is clearly not enough to reject the idea that there's any difference between for loops and the apply function here. Let's rerun the code with more replicates. To save time, we'll focus on small and large matrices and skip intermediate sizes.

With the additional clarity of 50 replicates, we can see that for very small matrices, apply appears to outperform for loops. Any differences quickly become negligible as we increase matrix size. For large matrices, we see a fairly consistent superiority of for loops when the matrices are greater than 900,000 cells, meaning our original intuition seems to hold. This indicates that for loops are more efficient at performing this complicated function than apply, but only when the matrices are huge. Meanwhile, apply might actually be stronger for relatively small matrices, which many users are more likely to be dealing with.

So we see that for loops often have a slight edge over apply in terms of speed, especially for simple calculations or for huge matrices. Cool, but this leaves us with a few questions.

1. Why are for loops more efficient?
The results above came as a surprise to me because I'd always heard about how inefficient for loops could be. It turns out I'd missed the bit that should follow that statement: for loops can be inefficient, but not if you allocate your memory well. I won't write the code here, but a massively inefficient way to write a for loop is to have a variable that you overwrite with every iteration. That basically forces R to recall the result of every previous iteration, for every iteration. If you look at the code at the end of the post, you'll see that we instead created empty matrices whose cells were filled with values. 

So we didn't mess up, but that doesn't fully explain why for loops are faster. According to this thread on Stack Exchange, a for loop can outcompete apply here because for loops use the vectorized indexes of a matrix, while apply converts each row into a vector, performs the calculations, and then has to convert the results back into an output. 

[For those of you who are really into the technical details, lapply is an exception and is actually faster than for loops because more of its calculations are brought down into the language C and performed there.]

2. If for loops are generally faster, why bother with apply?
The figures in this post have shown that apply is usually a bit slower than for loops. But does a difference of 0.01 seconds, or even 0.1 seconds, matter? If you're performing an evolutionary simulation on huge populations for thousands of generations, and you're doing this multiple times with different model parameters... then sure, maybe those 0.1 seconds add up. But the clarity in reading apply commands is far more important, I would argue. If you're sharing your code with a collaborator, or writing code you might read again in the future, you want to be as clear as possible. Concise code is better than a sprawling mess.

Thanks for reading. If you have any suggestions for a fun R project, shoot me an e-mail.


Code for this post
[Note: I'm just including the code for the complicated calculation. For the simple calculation, replace mean(head(sort(data[, k]))) with mean(data[,k]) in the nested for loop.]

# Before we start
  replicates <- 10                                 # Higher numbers take longer but decrease noise

  n.rows <- seq(2, 1000, by = 10)  # The range in matrix sizes

  loop.times <- matrix(NA, nrow = replicates, ncol = length(n.rows))
  apply.times <- matrix(NA, nrow = replicates, ncol = length(n.rows))

# Run the analysis
for(i in 1:replicates){
   for(j in 1:length(n.rows)){
      data <- matrix(rnorm(1000 * n.rows[j]), ncol = 1000, nrow = n.rows[j])
      # Method 1: for loop
      start.time <- Sys.time()
      for(k in 1:ncol(data)){
        mean(head(sort(data[, k])))
      loop.times[i, j] <- difftime(Sys.time(), start.time)

      # Method 2: apply
      start.time <- Sys.time()
      apply(data, 2, function(x){mean(head(sort(x)))})
      apply.times[i, j] <- difftime(Sys.time(), start.time)

    print(paste0("Replicate ", i, "| iteration ", j))

# Prepare for plotting

size <- values * 1000

# Get the mean times and SE for loops versus apply
m.loop <- apply(loop.times, 2, mean)
se.loop <- apply(loop.times, 2, sd) / sqrt(replicates)

m.apply <- apply(apply.times, 2, mean)
se.apply <- apply(apply.times, 2, sd) / sqrt(replicates)

# Create the error envelopes. Note that the x-coordinate is the same
coord.x <- c(size, size[length(size)], rev(size))
coord.y.loop <- c(m.loop + se.loop, m.loop[length(m.loop)] - se.loop[length(se.loop)], rev(m.loop - se.loop))

coord.y.apply <- c(m.apply + se.apply, m.apply[length(m.apply)] - se.apply[length(se.apply)], rev(m.apply - se.apply))  

# Plot it
# 1st: plot the mean loop times

plot(size, m.loop, type = 'l', ylim = c(min(m.loop - se.loop), max(m.apply + se.apply)),
     main = "Time required to find mean of 6 lowest values\nin each column of a matrix", xlab = "Size of matrix (N cells)", ylab = "Time (s)",
     cex.main = 1.5, cex.lab = 1.2, font.lab = 2)

# Add the error envelope
polygon(coord.x, coord.y.loop, col = "gray80", border = NA)
lines(size, m.loop)

# Add the mean apply times
lines(size, m.apply, col = "red")
polygon(coord.x, c.y.apply, col = rgb(1, 0.1, 0.1, 0.2), border = NA)

# Add the legend
par(font = 2)
legend("topleft", col = 1:2, pch = 19, cex = 0.8, c("For loop", "Apply"), bty = 'n')

Wednesday, February 1, 2017

Introduction to R: 5 - The apply functions

We've reached the point in learning R where we can now afford to focus on efficiency over "whatever works." A prime example of this is the apply functions, which are powerful tools to quickly analyze data. The functions can find slices of data frames, matrices, or lists and rapidly perform calculations on them. These functions make it simple to perform analyses like finding the variance of all rows of a matrix, or calculating the mean of all individuals that meet conditions X, Y, and Z in your data, or to feed each element of a vector into a complex equation. With these functions in hand, you will have the tools to move beyond introductory knowledge of R and into more specialized or advanced analyses.

[An obligatory comment before we get started: I initially framed this post as a counter to how slow for loops are, but to my surprise, for loops were actually consistently faster than apply. The catch is that the for loop has to be written well, i.e. with pre-allocated space and avoiding overwriting variables on every iteration. I'll go into detail on this in the next series of R posts: "Random R." Stay tuned.]

This is the fifth in a series of posts on R. This post covers:
- apply, sapply, lapply, tapply (pronounced "apply," "s-apply," "l-apply," and "t-apply")
- sweep

Previous posts covered 1) an introduction to R, 2) using R to understand distributions, plotting, and linear regression; 3) for loops and random walks; and 4) functions and if statements. The most important code from those posts is shown below. Following the formatting of the previous entry, I'll have color-coded [ ;-) ] text so it's easier to read: quoted text in green; for loops in dark orange; if, else, else if and function in goldenrod; and defined functions in purple. Code will always be bolded and blue (especially if inside paragraphs where I'm explaining something) and comments in orange or red.

     c()                                     # Concatenate
     x <- c(5, 10)                   # x is now a two-element vector
     x[x > 7]                            # Subset the values of x that are greater than 7

    data <- rnorm(100)     # Assign 100 randomly-drawn values from a normal 

                                                # distribution with mean zero and standard deviation 1 to  
                                                # the object "data"
    hist(data)                        # Plot a histogram of the "data" object
    y <- runif(100)              # Assign 100 randomly-drawn values from a uniform 
                                                # distribution with minimum zero and maximum 1 to the 
                                                # object "y"   

    plot(y ~ data)                 # Make a scatterplot of y as a function of data

    for(i in 1:length(x)){    # For i in the sequence 1 to the length of x...

        print(i)                          # ...print i

    our.mean <- function(x){sum(x) / length(x)} # Create a mean function 
    if(1 + 1 < 5){print("Yup, less than 5")}  # Print the statement if 1 + 1 < 5 is true


The simplest of the apply functions is the one the family is named after. Let's revisit the following plot from the third R post on for loops

A random walk, if you remember, is a series of random steps. The walk can be in as many dimensions as you want; the ten walks shown above are in one dimension (where there's only one movement axis: here, it's up/down). It seems like the variance in the walks increases with time: the more you move to the right, the more spread out the walks are. How could we quantify this?

In case you missed it, here's how we generated the data up there. But this time, let's run 100 random walks instead of 10.

     start <- 0
     mean.move <- 0
     sd.move <- 1
     n.steps <- 100000
     n.indivs <- 100             # The number of random walks you want to run

     # Create an empty matrix that we'll fill with the random walk values
     WALKS <- matrix(NA, nrow = n.indivs, ncol = n.steps)
     # Start all the walks at the origin
     WALKS[, 1] <- start

     # Perform the random walks
     for(i in 2:n.steps){
        WALKS[, i] <- WALKS[, i - 1] +                        # This column looks at the previous column...

        rnorm(n.indivs, mean.move, sd.move)  # 10 random values are generated, one for each row.
                                                                                        # If you did rnorm(1, ...), you would generate 10 identical random walks
                                                                                       # because the same value would be added to each walk

We have a 100 x 100,000 matrix, where each row is a random walk and each column is a time step. To show how the variance of the walks changes over time, we want to take the variance of each column. One way to do this would be to create a for loop that iterates over each column. Below, WALKS is our matrix with the random walk data.

walk.variance <- c()
for(i in 1:ncol(WALKS)){
    walk.variance[i] <- var(WALKS[, i])

This works, but there's a simpler way to do this. [I'll no longer say there's a faster way to do this; see the note at the start of this post!] Let's use apply, which only requires one line of code.

walk.variance <- apply(WALKS, 2, var)

apply's inputs are the matrix, whether you want rows (apply(..., 1, ...)) or columns (apply(..., 2, ...)), and the function you want to apply to that dimension of the matrix. [Note: you can also do this for higher-dimensional arrays, e.g. a 3-dimensional "book" where you want to summarize each "page." That would be apply(..., 3, ...)

If we plot that out, we see a nice linear relationship between the variance and how long we've run the random walk.

You're also not limited in what function you run across each column or row (as long as the function can take a vector input). Using what we learned in post #4 of our R series, let's apply a more complicated function to each row of the matrix. This function will find the mean of the six highest values of each random walk, then subtract it from the range. No real reason why. Just flexing a little R muscle!

# Find the mean of the six highest values of the input, then subtract it from the range
fun.times <- function(x){mean(head(x)) - (range(x)[2] - range(x)[1])} <- apply(WALKS, 1, fun.times) 

apply is great for performing a calculation on each row or column of a matrix, but what if you want to, say, subtract an entire vector across each column of a matrix? You don't just have one value that's getting subtracted by the whole matrix. You have a bunch of values, and the different rows/columns in the matrix get a different value applied to them. For this, you'll need sweep.

Here's a simple example that doesn't require sweep. You're subtracting one value from the entire matrix.

WALKS.adjusted <- WALKS - mean(WALKS)

Here, the new object (WALKS.adjusted) has been centered around the mean of WALKS. This isn't incredibly informative, though, because the mean of WALKS condenses the information from all the random walks and all the time points of those walks into one value. 

With sweep, we can be more nuanced. Let's create a new matrix, WALKS.adjusted2, where all of the walks are centered around the mean of all the random walks for that particular time point. Walks that are have particularly high values will have positive values, average walks will be around zero, and walks with low values will have negative values.

WALKS.adjusted2 <- sweep(WALKS, 2, colMeans(WALKS), "-") 


apply works great on matrices, but it doesn't work if you want to perform a calculation on every element of a vector, or if you want to perform a calculation on only select columns of a matrix or data frame. sapply is the answer here.

First, a vector example. Say you have a list of probabilities, and you want to simulate outcomes based on those probabilities. You have 50 people, and they each have some probability of marathoning the eight Harry Potter movies back to back in a massive sleepover party with two B-list celebrities of their choosing, friendly law enforcement officers, and children's science fiction Animorphs author K.A. Applegate in attendance. 

# 50 random values drawn from a uniform distribution between 0 and 1
probs <- runif(50, 0, 1)

# The possible outcomes
possibilities <- c("It's marathon time!", "Sorry, I actually have this thing I have to go to instead.")
# A function that takes a probability of watching the movie and randomly chooses whether you see it or not based on your probability 
outcome <- function(x){sample(possibilities, 1, prob = c(x, 1-x))}

# The outcomes 
sapply(probs, FUN = outcome)

Note that the outcomes will differ a bit every time because you're dealing with probabilities, and so there's an element of randomness with who gets chosen. A probability of 0.5 will give a "marathoning" outcome 50% of the time and a "sorry" outcome the other half, for example.
Great. Now let's say that we're hosting our Harry Potter marathon. It's a great party, Halle Berry and Officer Chuck are enjoying the homemade marshmallow snacks, we're content. But a nagging thought persists: could we use sapply to find the sums of the numeric columns of a data frame that's has columns with numbers, factors, and characters?  

When we Google those exact words, Google asks if we meant lapply, which makes us uncomfortable because we haven't gotten that far into this blog post yet. Fortunately, we can use sapply to analyze solely the numeric columns.

As an ambitious high school student, we don't have any data to brag to K.A. Applegate about, so we'll just have to use a random number generator to create some. 

M <- data.frame(rnorm(20),     # A random variable: height change
                                 rnorm(20,2),   # A random variable: weight change
                                 sort(rep(c("A","B"),10)), # The treatment: A or B

                                 rep(1:5, 4))                            # The trial number
colnames(M) <- c("height_change", "weight_change", "treatment", "trial")

# Tell R that the trial column is a category, not numbers 
M$trial <- as.factor(M$trial)

If we try to use apply(M, 2, sum), we'll get an error because not all of the columns are numeric. We can specify which columns we want with sapply

num.cols <- sapply(M, is.numeric)

sapply(which(num.cols == T), function(x){sum(M[, x])})

Alternatively, we could also get our result with apply(M[1:2, ], 2, sum). If we tried just doing sum(M[1:2, ]), it would give us one value: the grand sum of all the values in all columns. 


While you can use sapply on data frames, tapply is a friendlier tool. Let's say we accidentally spilled salsa all over our phone when the tortilla chip broke mid-conversation with K.A. Applegate. We know we shouldn't have been using our phone as a plate, that was so silly, the creator of Animorphs who inspired millions, possibly billions of children and young adults to pursue biology and never give up against insurmountable odds is suddenly pretending she has to take an important call. Well, at least we still have R.
Reinforce your chips during salsa
dipping when in important
social situations.

So here we are, salsa-drenched phone in hand. Let's say we want to conduct an experiment examining how difficult it is to read The Headbanging Behaviorist on phones slathered in different brands, spiciness, colors, and quantities of salsa. We measure difficulty reading in seconds required to quickly scroll through the blog post and then give a "Like" on Facebook. (Side note: it may be necessary to remove values of infinity from the data set.)

With tapply, we can cut up our data frame to look at just the effects of Tostitos at all quantities, or just "Very Hot" Green Mountain Gringo. We can apply summary statistics to see if there's an effect of red versus clear salsa when your phone is so buried in salsa the screen has gone black ( > 3.0 kg). As you'll see, tapply is fantastic for data frames because it can "read" characters within the data frame, as opposed to apply or sapply which can only look at the column or row level.

# Generate the data by replicating the character inputs (rep) and then scrambling them (sample)
our.salsas <- sample(rep(c("Tostitos", "Green Mountain Gringo", "Newman's Own", "Goya", "Old El Paso"), 200))
spiciness <- sample(rep(c("Mild", "Medium", "Hot", "Very Hot!"), 250))
colors <- sample(rep(c("Red", "Green", "Clear"), 250))
quantities <- sample(rep(c("0 - 1.0 kg", "1.0 - 2.0 kg", "2.0 - 3.0 kg", "> 3.0 kg"), 250))  
time_to_read <- rpois(1000, 2) <- data.frame(our.salsas, spiciness, colors, quantities, time_to_read)

# Convert every variable (except time_to_read) to factors
our.salsas <- as.factor(our.salsas)
spiciness <- as.factor(spiciness)
colors <- as.factor(colors)
quantities <- as.factor(quantities)

Now let's use tapply to find the mean time to read column as a function of the spiciness column.

with(M, tapply(time_to_read, spiciness, mean)) 

To keep the code clean, I've used the with() command. Otherwise, because time_to_read and spiciness aren't global variables (they're columns within the variable M); we'd have to write$time_to_read and$spiciness to call them. Just a bit cleaner this way.

The first argument in tapply is the numeric data column you're interested in. The second argument is how you want to cross-section those data, and the third argument is the calculation you're performing. 

By editing the second argument a bit, we can have tapply provide us a more nuanced answer:

with(M, tapply(time_to_read, list(spiciness, color_of_salsa), mean))

With the code above, we can see the mean time to read as a function of the spiciness and color of our salsa.


The last function I'll mention is lapply. With lapply, you can perform a calculation on every object in a list. Lists are vectors containing objects that can be vectors themselves as well, or matrices, or character strings, etc. Here's an example of a list:

opinions <- c("I loved it!", "It was ok...", "Nah.") <- rnorm(10)
the.matrix <- matrix(runif(100, 0, 1), ncol = 5, nrow = 10)

our.list <- list(opinions,, the.matrix) 

When you run our.list, you'll see the three objects, one after the other. If you had some function that could operate on the three objects in the list, you could use lapply on them. But I really think we need some closure on our Harry Potter viewing party, so let's go back there.

In an effort to get on the front page of /r/dataisbeautiful, we decide to collect some data at our party. Every 30 minutes, we poll the audience on how much fun they're having. (It'd be solid 10/10's throughout the night, but play along and imagine there's some variation.) With these data, we build a matrix where each row is a party participant and each column is a time point (3:10am, 3:20am, etc.). The data are how much fun person i is having at time point j. <- matrix(round(runif(5000, 0, 10)), ncol = 100, nrow = 50)
What if we want to know what moments each person was having at least 9/10 fun?  We could just write which( > 9), but that would convert into a vector and then give us the indices of the vector where the value is greater than 9. So for our 100 x 50 matrix, we'd get answers like "9" and "738" and "2111," which don't have any identifying information on who was happy when.

This sounds like a job for apply, so we run this:

passed.threshold <- apply(, 1, function(x){which(x > 9)})

We now have a list, where each object is a person, and the values are the time points at which they scored at least a 9/10 on having fun. However, we can now use lapply to get more nuanced information from passed.threshold.

# Find the first time point that passed the threshold
lapply(passed.threshold, min)

# Get a summary of the time points for each person that passed the threshold
lapply(passed.threshold, summary)   

Note that these commands return their contents as a list. If you'd prefer they were returned as a vector, we can use sapply instead. 

sapply(passed.threshold, min)  

Thanks for reading the "Introduction to R" series! Next up will be "Random R," where I'll take on random projects (like quantifying the computation time of for loops versus apply). Stay tuned.


Salsa photo: