Sunday, April 24, 2016

Introduction to R: 3 - For loops and random walks

When I was a teenager, I didn't mind extremely repetitive, mind-numbing labor. To document my musical tastes, every few months I'd gather data from my iTunes library, some of which involved manually counting the number of songs that I had listened to 0, 1, 2, 3, up to 6 times so I could create a histogram in Excel. I loved office work like stapling papers for hours, and pipetting was one of my favorite aspects of the evolutionary development lab I briefly worked in during college. (Side note: my favorite aspect was feeding the opossums, which made me realize I loved animal behavior.)

While I still don't mind the sort of work where you turn off your brain and do repetitive work (*cough cough* fixing fish trajectories by hand...), there are some points where you just can't resort to human willpower to get through an analysis. Fortunately, computers are fantastic for work that would otherwise be menial and soul-draining. They don't get bored as far as we can tell, they're much faster, and they don't make mistakes if the code is correct.

This is the third in series of posts on R. This post covers:
- For loops
- Random walks

The first post was an introduction to R, and the second was on using R to understand distributions, plotting, and linear regression. The most important code from those posts is listed below.

     c()                              # "Concatenate," or combine. This is how you tell R that there are
                                        # multiple elements.

     x <- c(5, 10)           # x is now a two-element vector of the numbers 5 and 10
     x[x > 7]                    # Subset. This shows you only the elements of x that are greater than 7

     data <- rnorm(100) # Draw 100 random values from a normal distribution with mean = 0 

                                                          # and standard deviation = 1.
# Plot a histogram of the data object
     plot(y ~ x)                   # Plot a scatterplot of y as a function of x

For loops
For loops are basically a way to iterate, or to run a command repeatedly. (Think of the word "reiterate," for example: "I had to reiterate to my cat to not use my Gmail without permission. It's the third time this week. I just don't think he's old enough to use the internet unsupervised.")

Here's the simplest for loop possible
     for(i in 1:10){


It reads "for i in the sequence 1 to 10, print i." It's the same as if you wrote this: 


A for loop finds wherever you placed an "i" within the loop and changes that to the ith value in the sequence you give it (here it's 1:10). The first value in the sequence above is 1, the second is 2, etc. 

The sequence you give the for loop doesn't have to be linear. You could write whatever bunch of numbers you want. Writing this...
     bubbles <- c(1, -80, 4, NA, 100, -Inf, 0)
for(i in bubbles){

... is the same as writing this:

You'll also get the same thing if you tell R that you want the ith value of bubbles. This is a subtle distinction. Instead of looking at the actual numbers in bubbles, it will go through the 1st value, then the 2nd, then the 3rd, until it reaches the end. I'd recommend doing it this way in the future.

     for(i in 1:length(bubbles)){  # "For i in 1 to 7 (the number of values in bubbles)"
         print(bubbles[i])                  # "Print the ith value of bubbles."

The above code is the same thing as writing:

Combine for loops with the paste() command for character values

All of the values in bubbles above are numbers, which makes things simple. It's only slightly more challenging if you have a bunch of words, or you're trying to iterate a sentence.

Let's say you want to type out "Today is April 1," Today is April 2," "Today is April 3," etc., all the way up to "Today is April 30." Maybe you really love April or something. In R, you could literally write:

     "Today is April 1"
     "Today is April 2"
     "Today is April 3"
     "Today is April 4"

... and so on. That would be 30 lines of code. Manageable, but not great. Alternatively, you could use a for loop combined with the paste() command.
     for(i in 1:30){
        print(paste("Today is April", i))

If you wrote print("Today is April i"), it will say "Today is April i" instead of replacing the i with whatever number you're on. paste() lets you input values into character strings. For example:
     names <- c("Matt", "George", "Andre")
     for(i in 1:length(names)){
       paste("My name is", names[i])

Real-world example
I work with animal tracking data in my research. I'll film experiments involving fish swimming in a tank. These videos then get fed through tracking software that identifies the positions and orientations of all individuals. These data are in the form of four matrices: the x- and y-coordinates for each fish for each frame, as well as the x- and y-components of the unit vector of their heading. Each row of a matrix is one individual, and each column is a time point.

From these simple matrices, I can calculate more interesting measurements like the speed and distance to nearest neighbor for every fish. Below is the code for calculating the speed time series for each individual.

# First create an empty matrix with the same dimensions as the other matrices 
# - Here, we just use the x-coordinate matrix, but any of the others would work, too.
     s <- matrix(NA, ncol = ncol(xs), nrow = nrow(xs))

# Calculate how far a fish moved between time points. Iterate over all fish
     for(i in 1:nrow(xs)){                                    # For each individual
        s[i, ] <- c(NA,                                               # No speed at first time point
        sqrt(diff(xs[i, ])^2 - diff(ys[i, ])^2))  # Pythagorean theorem
        print(i)                                                           # Say what iteration we're on

# Plot Individual 6's trajectory from time point 1000 to 2000
    focal.indiv <- 6
    time.start <- 1000
    time.end <- 2000

    plot(s[focal.indiv, time.start:time.end],
             main = paste("Individual ", focal.indiv, " speed from t = ", time.start,
             " to t = ", time.end, sep = "")

Random walks
For loops are ideal for when you need to iterate a computation. A simple example of this is a random walk. A random walk is a series of random steps. It's a time series that serves as a null expectation: "what should we expect to see if literally nothing but noise is happening?" It's a baseline that you can then compare to stock market prices, animal movement, gas molecule paths, and more.

Here, these peaks and valleys aren't caused by any extrinsic factor; they're just random fluctuations. The size of these fluctuations will depend on how much noise you add into the random walk: 

And finally, remember that "random" means it's not going to be the same every time.

There are two steps to building a random walk in R: first, you set the starting point and the rules for the steps. Then, you run a for loop to take each step. Below is the code to build a simple one-dimensional random walk like the plots above.

# Parameters
     start <- 0                 # The starting point of the random walk
     mean.move <- 0   # For each step, is there a bias to move in a certain direction?  

                                        # A positive value means the random walk will tend to move 
                                        # upwards, a negative value will make it tend downward, and 
                                        # zero is unbiased.
     sd.move <- 1          # The standard deviation of the step size. The higher this is, 

                                         # the more the random walk will bounce around.
     t <- 100000             # The number of steps you want the walk to take overall

# Begin the walk at the starting point 
      walk <- start

# Run the random walk
      for(i in 2:t){                                             # Start at 2 because we already have our  

                                                                           # starting point
         walk[i] <- walk[i - 1] +                      # Take the previous position...

         rnorm(1, mean.move, sd.move)   # ... and add a random value
         print(i)                                                   # Tell us what iteration we're on

Finally, to visualize our results, we use the plot() command.  
     plot(walk, type = 'l')

In the above walk, we use a normal distribution as the source of our step sizes. It means that most of our steps will be concentrated around mean.move, with large step sizes (in either direction) occurring less frequently. We can easily use a different distribution like a long-tailed distribution (e.g. for a Lévy flight) or a Poisson. The plot below shows such a skewed step size distribution and what a two-dimensional random walk with steps drawn from this distribution looks like. The code to do this is at the end of this post. [There's a bit more to mention with the direction of travel as well as the step sizes, but I'll skip the details here.]

Don't use for loops :-)
The first thing you learn about for loops is what they are. The second thing you learn about for loops is that they're incredibly computationally inefficient. While computers are very good at thinking about many things simultaneously, for loops force your computer to focus on one thing at a time. This is usually not a big deal: for small computations like the ones listed above, we rarely encounter a situation where for loops won't work well.

For loops become a burden when you're doing thousands of calculations and speed is essential. Imagine an evolutionary model where you're simulating predator-prey population dynamics under various levels of resource availability. You'll want to run the model for long enough to see what the evolutionary trajectory looks like, so maybe you're looking at 100,000 generations. You'll also want to run the model several times to see whether the populations always reach the same outcome, though, or if something weird like a premature random extinction occurred. 

Your computer will have to perform all the calculations you give it for a generation, for every generation, for every run of the model... for every combination of parameter values. You'll need to repeat this process for every level of resource availability you care about, e.g. 5 patches, 10 patches, 15 patches, and 5 units of food / patch, 10 units/patch, 20, etc. Without exaggerating, your model could literally require hundreds of years of computational time to complete if the code isn't efficient.


That's why R blog post #5 will be on the apply() functions, which are one way to avoid for loops! I'll first cover function(), though, which is a fantastic tool in R, especially when combined with apply().

Code for plots:

First random walk
     start <- 0                      # Starting position
     mean.move <- 0        # The step distribution will be unbiased, i.e. centered at zero
     sd.move <- 1                # The standard deviation of step sizes
     n.steps <- 100000     # The number of steps
     walk <- start               # Start the walk at the origin

     for(i in 2:n.steps){
        walk[i] <- walk[i - 1] +                      # This step is the previous step plus...
        rnorm(1, mean.move, sd.move)  # A randomly drawn step from a normal distribution with
                                                                          # the mean and SD we specified above
        print(i)                                                   # Print the iteration we're on

     plot(walk, type = 'l', col = "darkcyan", lwd = 2, xlab = "Time", ylab = "Position", las = 1,
          main = "A random walk", cex.main = 1.5, cex.lab = 1.2, font.lab = 2)

Random walks with differing noise
     start <- 0
     n.steps <- 1e5

     mean.step <- 0
     sd.step1 <- 0.01
     sd.step2 <- 0.5

     # Start both walks at zero
     walk1 <- walk2 <- start

     for(i in 2:n.steps){
        walk1[i] <- walk1[i - 1] + rnorm(1, mean.step, sd.step1)
        walk2[i] <- walk2[i - 1] + rnorm(1, mean.step, sd.step2)

     # Plot it

     par(mfrow = c(1,2))
     par(oma = c(0,0,2,0))    # Create space at the top of the window for a grand title

     plot(walk1, type = 'l', col = "dodgerblue", main = paste("SD =", sd.step1), font.lab = 2,
          xlab = "Time", ylab = "Position", ylim = c(min(walk2), max(walk2)))
     plot(walk2, type = 'l', col = "dodgerblue", main = paste("SD =", sd.step2), font.lab = 2,
          xlab = "Time", ylab = "Position", ylim = c(min(walk2), max(walk2)))
     mtext(outer = T, "Random walks with differing noise", cex = 2, font = 2)

     par(oma = array(0, 4)) # Make your graphics window look normal again
Multiple random walks
     start <- 0
     mean.move <- 0
     sd.move <- 1
     n.steps <- 100000
     n.indivs <- 10             # The number of random walks you want to run

     # Create an empty matrix that we'll fill with the random walk values
     WALK <- matrix(NA, nrow = n.indivs, ncol = n.steps)
     # Start all the walks at the origin
     WALK[, 1] <- start

     # Perform the random walks
     for(i in 2:n.steps){
        WALK[, i] <- WALK[, i - 1] +                        # This column looks at the previous column...

        rnorm(n.indivs, mean.move, sd.move)  # 10 random values are generated, one for each row.
                                                                                        # If you did rnorm(1, ...), you would generate 10 identical random walks
                                                                                       # because the same value would be added to each walk

# First create an empty plot, then add each walk with a for loop

     plot(NA, ylim = c(min(WALK), max(WALK)), xlim = c(0, n.steps), cex.main = 1.5, cex.lab = 1.2,
        xlab = "Time", ylab = "Position", las = 1, font.lab = 2, main = "10 random walks")

     for(i in 1:nrow(WALK)){
       lines(WALK[i,], col = rainbow(n.indivs)[i])

Two-dimensional random walk
# Create the distribution of step sizes. A beta distribution works well for something strongly skewed
     step.dist <- rbeta(10000, 0.1, 1) # Very likely to take small step; unlikely to take big step

# Set the parameters for the (Gaussian) turning angle distribution

     mean.turn <- 0
     sd.turn <- pi/4

# Starting positions

     xs <- ys <- 0
     BO <- runif(1, -pi, pi)
    # Could also start at zero. Here: random orientation chosen

# Run it
     for(i in 2:t){   
         # Update the heading and step size
         BO[i] <- BO[i - 1] + rnorm(1, mean.turn, sd.turn)
        step.size <- sample(step.dist, 1)

        # Move
       xs[i] <- xs[i - 1] + cos(BO[i]) * step.size   # cosine gives you the x-component of the unit vector
        ys[i] <- ys[i - 1] + sin(BO[i]) * step.size    # sine gives you the y-component

No comments:

Post a Comment