Sunday, May 15, 2016

Introduction to R: 4 - Functions and if statements

A critical step in growing up is learning to think for yourself. With so much conflicting information available from polarized media outlets, non-interacting academic schools of thought, and corporations with various shades of biased self-interest, the definition of the "right thing to do" or the "truth" is rarely clear or may not even objectively exist. Being able to think critically and independently is therefore a crucial skill to develop for academia, industry, and yourself.

Anyway, in this post you'll follow along with my code and let me do the thinking for you, as we put off being a free thinker for another day. But in a baby step towards independent thought, in this post you'll learn how to create your own functions and if statements, as opposed to relying on code that someone in the past already wrote for you. So let's get started with reading code that someone in the past already wrote for you. ;-)

This is the fourth in a series of posts on R. This post covers:
- Creating your own functions
- If statements
- (Briefly: logical operators)
- Integrating functions and if statements into for loops

Previous posts included 1) an introduction to R, 2) using R to understand distributions, plotting, and linear regression; and 3) for loops and random walks. The most important code from those posts is shown below. Because we're beginning to cover more complicated code, from this post onward, I'll be highlighting quoted text in green; for loops in dark orange; if, else, else if and function in goldenrod; and defined functions in purple. Code will always be bolded and blue (especially if inside paragraphs where I'm explaining something) and comments in orange or red. (Don't worry about memorizing this. It's more to allow your brain to chunk the code into discrete segments when you look at it.)

     c()                                     # Concatenate
     x <- c(5, 10)                   # x is now a two-element vector
     x[x > 7]                            # Subset the values of x that are greater than 7

    data <- rnorm(100)     # Assign 100 randomly-drawn values from a normal 

                                                # distribution with mean zero and standard deviation 1 to  
                                                # the object "data"
    hist(data)                        # Plot a histogram of the data
    y <- runif(100)              # Assign 100 randomly-drawn values from a uniform 
                                                # distribution with minimum zero and maximum 1 to the 
                                                # object "y"   

    plot(y ~ data)                 # Make a scatterplot of y as a function of data

    for(i in 1:length(x)){    # For i in the sequence 1 to the length of x...

        print(i)                          # ...print i
    }
__________________________________________________________________ Creating functions

A function is basically a box with some machinery inside. You give a function an input, the box rumbles around for a bit, and then it returns an output. Here's an example of a simple function:

     f(x) = x + 1

Here, f(x) is the output and x is the input. If your input was 5, the function would look like this:

     f(x) = x + 1
     f(5) = 5 + 1
     f(5) = 6
 
In R, we can give this function a name and then call on it later.

      add.function <- function(x){x + 1}

add.function is the name of our function. function is the command in R that says "I'm going to make a function." The stuff inside the curly brackets is the function we wrote. Similarly to how for loops find wherever you wrote an "i" and replace it with a number, your function will find wherever you wrote "x" and replace it with the value or vector you feed into it. (You also don't have to use "x." function(N){N + 1} is identical to our add.function.)

What's nice about writing a function is that now you can just say add.function(5) or add.function(1:1000) and R will run your inputs (here: "5" and "1 through 1000") through that equation for you. And if you ever forget the code that went into creating this function, just type add.function and R will print the full machinery.

Our own mean function
Here's something slightly more complicated. R has a built-in function for finding the mean of a vector or matrix, but we can easily write our own version.

     our.mean <- function(x){sum(x) / length(x)}

We can check it by creating a vector with some values and comparing the outputs from R's mean and our own.

     values <- seq(from = 4, to = 6, by = 0.5)
     our.mean(values)
     mean(values)   

More advanced mechanics
You are essentially unhindered in what you make into a function. You can have multiple inputs that interact with each other, you can specify default values, you can nest functions within each other... let your imagination go nuts.

     multiply <- function(x, y){x * y}
     multiply(5, 1:5)

     mega.multiply <- function(x, y){multiply(x, y) * x * y}
     mega.multiply(1e5, 1e-5)

     super.duper <- function(x, y, z = 1){ (x + y) / z }  

In super.duper, the x and y inputs are required, but z is optional. If you don't include it, R will automatically set z = 1. Conversely, if you write function(x, y, z) and then don't specify z when you call the function (e.g. super.duper(1, 1)), R will give you an error message.

__________________________________________________________________
If statements

Let's say you want R to run a command only if a certain condition is met. You're curious if 1 + 1 is less than 5, for example, because you skipped that day of kindergarten and you're finally ready to address this gap in your knowledge... using R. 

     if(1 + 1 < 5){print("Yup, less than 5")}

R will print the statement only if 1+1 is less than 5. To avoid spoilers, I won't post the output from R; it's on you to figure this one out! ;-)

We can easily turn this if statement into a function that checks whether any input is less than 5:

     less.than.five <- function(x){if(x < 5){return("Yup, less than 5")}}
     less.than.five(3)
     less.than.five(10) 

(We use return here instead of print for a subtle distinction that I'm not sure I fully understand... both will take an output from the internal environment of the function (whatever is inside the curly brackets) and bring it into the global environment for you to see. Functions in R have hidden return commands that give you the result of the last calculation performed by the function; mean, for example, doesn't show you the calculations it does to find the average, it just gives you the average. Specifically writing return is useful in case there are multiple calculations required but you care about a particular one. print just announces its contents if triggered, which caused me some trouble in the for loop we'll go through at the end of the post. Here's some more info from a Code Academy forum on some Python thing called PygLatin.)

Note that nothing happens when the input to less.than.five is greater or equal to five. That's because we didn't tell R what to do if this situation occurs, so it does nothing. (Computationally, this can be pretty useful in a script, like if you want to make a change to a variable if a condition is met but to just move on if the condition isn't met.) Here, a little extra code will address what happens when x > 5:

     less.than.five <- function(x){if(x < 5){
            return("Yup, less than 5")} else if (x == 5) {
            return("No, it's 5")} else {
            return("No, greater than 5")}}  
 
In English, this reads as:
1 - If x < 5, return "Yup, less than 5."
2 - If x is not less than 5, check if x = 5. If it does, return "No, it's 5."
3 - If neither of the previous conditions were met, return "No, greater than 5."

The else if statement is basically another if statement, but it also incorporates the previous if statement. This isn't crucial here, but it'd be important if, for example, you had one if statement looking for "anything less than 10" and another looking for "anything greater than 5." The values 5 through 9 would trigger both if statements, which could be a problem if they have conflicting actions. Here, else if is saying "if the previous if statement wasn't met, check if x equals 5." Finally, the else tells R that if neither of the previous if statements were met, follow this action.

In point #2, we use x == 5  because that notation asks R if the statement is true or false. If we wrote x = 5, we'd be assigning the value 5 to x, which causes an error within the if statement. You can use the logical operators <, >, <=, >=, ==, |, &, and ! to ask R if a statement is true. 

     x == 4                                       # "x equals 4, true or false?"
     x != 4                                        # "x does not equal 4, true or false?"
     x > mega.multiply(10, 4)  # "x is greater than the output of this function, true or 
                                                         # false?"
     x < 4 | x > 2                             # "x is less than 4 OR x is greater than 2, true or false?"
     x < 4 & x > 2                           # "x is less than 4 AND x is greater than 2, true or 
                                                        # false?"   

__________________________________________________________________
Integrating function, if, and for loops

In the previous blog post, we talked about for loops and random walks. Much like a frustrating cumulative exam in school, let's incorporate everything we've learned to answer a tougher question: how do we write a function that finds the median?

No, using R's built-in median function won't cut it in this tutorial. Let's write our own. (median does, however, serve as a nice check to see whether our function is doing what we think it is.) 

As a brief review of summary statistics, when we have a set of numbers, we often want a description of the "average" value of the set, or what we would expect if we drew a random value. Common descriptors are the mean, median, and mode. The median is the value that divides the set such that half of the values are below the median, and half are above. The median doesn't care about the actual values in the set; it will just find the middle-ranked one. It's therefore insensitive to outliers that would otherwise wreck the output from the mean. (If Bill Gates was visiting a high school classroom, the mean income of the people in the classroom would be millions of dollars, while the median would correctly be closer to zero.)

There are a few steps to finding the median. Let's write them here to clarify the plan of attack. We start with a set of values. Then: 

1 - Sort the values from lowest to highest
  
2a - If there are an odd number of values, the middle value (where half the values lie below this point and half the values lie above) is the median 

2b - If there are an even number of values, find the middle two values. Take their mean. This value is the median.

This is how the above would look like in R:

     med <- function(x){
        if(length(x) %% 2 == 1){                              # "If the number of values is odd"
           return(sort(x)[length(x) / 2 + 1/2]) }
        if(length(x) %% 2 == 0){                             # "If the number of values is even"
           return(mean(sort(x)[c(length(x) / 2, length(x) / 2 + 1)])) }
      }  

 A more detailed explanation:

     med <- function(x){                # med will be a function that takes one input

     # Number of values is odd  
     if(length(x) %% 2 == 1){   # If the length of the input, divided by 2, leaves a 
                                                        # remainder of 1
        return(                                 # Return whatever will result from inside these 
                                                        # parentheses
        sort(x)[                                # Sort the values of x from lowest to highest and 
                                                        # find the value that...
        length(x)/2 + 1/2])}         # ...is 1/2 + the length of the input divided by 2 (i.e. 
                                                        # the middle)

     # Number of values is even
     if(length(x) %% 2 == 0){   # If the length of the input, divided by 2, leaves a 
                                                         # remainder of 0
         return(mean(                    # Return the mean of...
         sort(x)[c(length(x) / 2,  # ...the sorted values of x, specifically the 
                                                         # "low" middle value...
         length(x)/2 + 1)]))}          #... and the "high" middle value


Combining all that with a for loop
Nice work! Finally, let's create a one-dimensional random walk that uses if statements and our median function. Let's say we want to impose boundaries on our random walk such that if it wanders too far from the origin, we restart it at the median of the walk so far.

     # Set the parameters
     n.steps <- 1e5                 # 100,000 steps (1.0 x 10^5)
     mean.step <- 0              # The random walk is unbiased (no preference up or 
                                                 # down)
     sd.step <-0.5                  # The standard deviation of step sizes (steps are drawn
                                                 # from a normal distribution)

     up.boundary <- 50       # The upper boundary of our walk
     down.boundary <- -50 # The lower boundary

     # Prepare variables
     walk <- 0                         # Start the walk at zero
     shifts <- c()                    # Create an empty variable to note when the walk hits
                                                # boundary

     # The random walk 
     for(i in 2:n.steps){
        # First, take a step
        walk[i] <- walk[i-1] + rnorm(1, mean.step, sd.step)
        # If the walk went above the upper boundary or below the lower boundary...
         if(walk[i] > up.boundary | walk[i] < down.boundary){ 
            walk[i] <- med(walk)                 # The walk resets at the median value of the 
                                                                         # walk up to this point
            shifts[i] <- i                                    # We also note the step at which this occurred
        }
         print(i)                                                # Tell us what iteration of the walk we're on
    } 

     # Plot it
     plot(walk, type = 'l', main = "A supervised random walk", las = 1, xlab =  
         "Time", ylab = "Position", cex.main = 1.8, cex.lab = 1.4, font.lab = 2, col =  
         "gray40")
     abline(v = shifts, col = "blue")
     par(font = 2)
     legend("topleft", lty = 1, col = "blue", legend = "Shifts", cex = 0.8)

Thanks for reading. In the next post, we'll cover the apply functions, which are very powerful methods for performing many calculations at once and efficiently analyzing matrices and data frames.

 Cheers,
-Matt