A critical step in growing up is learning to think for yourself. With so much conflicting information available from polarized media outlets, non-interacting academic schools of thought, and corporations with various shades of biased self-interest, the definition of the "right thing to do" or the "truth" is rarely clear or may not even objectively exist. Being able to think critically and independently is therefore a crucial skill to develop for academia, industry, and yourself.

Anyway, in this post you'll follow along with my code and let me do the thinking for you, as we put off being a free thinker for another day. But in a baby step towards independent thought, in this post you'll learn how to create

*your own*functions and if statements, as opposed to relying on code that someone in the past already wrote for you. So let's get started with reading code that someone in the past already wrote for you. ;-)**This is the fourth in a series of posts on R. This post covers:**

**- Creating your own functions**

**- If statements**

**-**

*(Briefly: logical operators)***- Integrating functions and if statements into for loops**

Previous posts included 1) an introduction to R, 2) using R to understand distributions, plotting, and linear regression; and 3) for loops and random walks. The most important code from those posts is shown below. Because we're beginning to cover more complicated code, from this post onward, I'll be highlighting

**quoted text**in green;**for****loops**in dark orange;**if**,**else**,**else if**and**function**in goldenrod; and**defined functions**in purple. Code will always be**bolded and blue**(especially if inside paragraphs where I'm explaining something) and comments in**orange**or**red**. (Don't worry about memorizing this. It's more to allow your brain to chunk the code into discrete segments when you look at it.)**c() # Concatenate**

**x <- c(5, 10) # x is now a two-element vector**

**x[x > 7] # Subset the values of x that are greater than 7**

data <- rnorm(100) # Assign 100 randomly-drawn values from a normal

data <- rnorm(100) # Assign 100 randomly-drawn values from a normal

**# distribution with mean zero and standard deviation 1 to**

**# the object "data"**

**hist(data) # Plot a histogram of the data**

**y <- runif(100) # Assign 100 randomly-drawn values from a uniform**

**# distribution with minimum zero and maximum 1 to the**

**# object "y"**

**plot(y ~ data) # Make a scatterplot of**

*y*as a function of*data*

for(i in 1:length(x)){ # For i in the sequence 1 to the length of x...

for(i in 1:length(x)){ # For i in the sequence 1 to the length of x...

**print(i) # ...print i**

**}**

__________________________________________________________________

**Creating functions**

A function is basically a box with some machinery inside. You give a function an input, the box rumbles around for a bit, and then it returns an output. Here's an example of a simple function:

**f(x) = x + 1**

Here, f(x) is the output and x is the input. If your input was 5, the function would look like this:

In R, we can give this function a name and then call on it later.

**f(x) = x + 1****f(5) = 5 + 1****f(5) = 6**In R, we can give this function a name and then call on it later.

**add.function <- function(x){x + 1}****add.function**is the name of our function.

**function**is the command in R that says "I'm going to make a function." The stuff inside the curly brackets is the function we wrote. Similarly to how

**for loops**find wherever you wrote an "i" and replace it with a number, your function will find wherever you wrote "x" and replace it with the value or vector you feed into it. (You also don't have to use "x."

**function(N){N + 1}**is identical to our

**add.function**.)

What's nice about writing a function is that now you can just say

Here's something slightly more complicated. R has a built-in function for finding the mean of a vector or matrix, but we can easily write our own version.

We can check it by creating a vector with some values and comparing the outputs from R's mean and our own.

You are essentially unhindered in what you make into a function. You can have multiple inputs that interact with each other, you can specify default values, you can nest functions within each other... let your imagination go nuts.

In

__________________________________________________________________

Let's say you want R to run a command only if a certain condition is met. You're curious if 1 + 1 is less than 5, for example, because you skipped that day of kindergarten and you're finally ready to address this gap in your knowledge... using R.

R will print the statement only if 1+1 is less than 5. To avoid spoilers, I won't post the output from R; it's on you to figure this one out! ;-)

We can easily turn this

Note that nothing happens when the input to

**add.function(5)**or**add.function(1:1000)**and R will run your inputs (here: "5" and "1 through 1000") through that equation for you. And if you ever forget the code that went into creating this function, just type**add.function**and R will print the full machinery.**Our own mean function**Here's something slightly more complicated. R has a built-in function for finding the mean of a vector or matrix, but we can easily write our own version.

**our.mean <- function(x){sum(x) / length(x)}**We can check it by creating a vector with some values and comparing the outputs from R's mean and our own.

**values <- seq(from = 4, to = 6, by = 0.5)****our.mean(values)****mean(values)****More advanced mechanics**You are essentially unhindered in what you make into a function. You can have multiple inputs that interact with each other, you can specify default values, you can nest functions within each other... let your imagination go nuts.

**multiply <- function(x, y){x * y}****multiply(5, 1:5)**

**mega.multiply <- function(x, y){multiply(x, y) * x * y}****mega.multiply(1e5, 1e-5)**

**super.duper <- function(x, y, z = 1){ (x + y) / z }**In

**super.duper**, the x and y inputs are required, but z is optional. If you don't include it, R will automatically set z = 1. Conversely, if you write**function(x, y, z)**and then don't specify z when you call the function (e.g.**super.duper(1, 1)**), R will give you an error message.__________________________________________________________________

**If statements**Let's say you want R to run a command only if a certain condition is met. You're curious if 1 + 1 is less than 5, for example, because you skipped that day of kindergarten and you're finally ready to address this gap in your knowledge... using R.

**if(1 + 1 < 5){print("Yup, less than 5")}**R will print the statement only if 1+1 is less than 5. To avoid spoilers, I won't post the output from R; it's on you to figure this one out! ;-)

We can easily turn this

**if**statement into a function that checks whether any input is less than 5:**less.than.five <- function(x){if(x < 5){return("Yup, less than 5")}}****less.than.five(3)****less.than.five(10)***(We use***return**here*instead of***print***for a subtle distinction that I'm not sure I fully understand...***both will take an output from***the*__internal environment__of the function (whatever is inside the curly brackets) and bring it into the__global environment__for you to see*. Functions in R have hidden***print**just announces its contents if triggered, which caused me some trouble in the**return**commands that give you the result of the last calculation performed by the function;**mean**, for example, doesn't show you the calculations it does to find the average, it just gives you the average. Specifically writing**return**is useful in case there are multiple calculations required but you care about a particular one.**for loop**we'll go through at the end of the post.*Here's some more info from a Code Academy forum on some Python thing called PygLatin.)*Note that nothing happens when the input to

**less.than.five**is greater or equal to five. That's because we didn't tell R what to do if this situation occurs, so it does nothing. (Computationally, this can be pretty useful in a script, like if you want to make a change to a variable if a condition is met but to just move on if the condition isn't met.) Here, a little extra code will address what happens when x__>__5:**less.than.five <- function(x){if(x < 5){****return("Yup, less than 5")} else if (x == 5) {****return("No, it's 5")} else {****return("No, greater than 5")}}**
In English, this reads as:

1 - If x < 5, return "Yup, less than 5."

2 - If x is not less than 5, check if x = 5. If it does, return "No, it's 5."

3 - If neither of the previous conditions were met, return "No, greater than 5."

The

In point #2, we use x == 5 because that notation asks R if the statement is true or false. If we wrote x = 5, we'd be assigning the value 5 to x, which causes an error within the

__________________________________________________________________

In the previous blog post, we talked about for loops and random walks. Much like a frustrating cumulative exam in school, let's incorporate everything we've learned to answer a tougher question: how do we write a function that finds the median?

No, using R's built-in

As a brief review of summary statistics, when we have a set of numbers, we often want a description of the "average" value of the set, or what we would expect if we drew a random value. Common descriptors are the mean, median, and mode. The median is the value that divides the set such that half of the values are below the median, and half are above. The median doesn't care about the actual values in the set; it will just find the middle-ranked one. It's therefore insensitive to outliers that would otherwise wreck the output from the mean. (If Bill Gates was visiting a high school classroom, the mean income of the people in the classroom would be millions of dollars, while the median would correctly be closer to zero.)

There are a few steps to finding the median. Let's write them here to clarify the plan of attack. We start with a set of values. Then:

1 - Sort the values from lowest to highest

2a - If there are an

2b - If there are an

This is how the above would look like in R:

A more detailed explanation:

The

**else****if**statement is basically another**if**statement, but it also incorporates the previous**if**statement. This isn't crucial here, but it'd be important if, for example, you had one**i****f**statement looking for "anything less than 10" and another looking for "anything greater than 5." The values 5 through 9 would trigger both**if**statements, which could be a problem if they have conflicting actions. Here,**else if**is saying "if the previous**if**statement wasn't met, check if x equals 5." Finally, the**else**tells R that if neither of the previous**if**statements were met, follow this action.In point #2, we use x == 5 because that notation asks R if the statement is true or false. If we wrote x = 5, we'd be assigning the value 5 to x, which causes an error within the

**if**statement. You can use the**logical operators**<, >, <=, >=, ==, |, &, and ! to ask R if a statement is true.**x == 4 # "x equals 4, true or false?"****x != 4 # "x does***not*equal 4, true or false?"**x > mega.multiply(10, 4) # "x is greater than the output of this function, true or****# false?"****x < 4 | x > 2 # "x is less than 4 OR x is greater than 2, true or false?"****x < 4 & x > 2 # "x is less than 4 AND x is greater than 2, true or****# false?"**__________________________________________________________________

**Integrating function, if, and for loops**In the previous blog post, we talked about for loops and random walks. Much like a frustrating cumulative exam in school, let's incorporate everything we've learned to answer a tougher question: how do we write a function that finds the median?

No, using R's built-in

**median**function won't cut it in this tutorial. Let's write our own. (**median***does*, however, serve as a nice check to see whether our function is doing what we think it is.)As a brief review of summary statistics, when we have a set of numbers, we often want a description of the "average" value of the set, or what we would expect if we drew a random value. Common descriptors are the mean, median, and mode. The median is the value that divides the set such that half of the values are below the median, and half are above. The median doesn't care about the actual values in the set; it will just find the middle-ranked one. It's therefore insensitive to outliers that would otherwise wreck the output from the mean. (If Bill Gates was visiting a high school classroom, the mean income of the people in the classroom would be millions of dollars, while the median would correctly be closer to zero.)

There are a few steps to finding the median. Let's write them here to clarify the plan of attack. We start with a set of values. Then:

1 - Sort the values from lowest to highest

2a - If there are an

**odd**number of values, the middle value (where half the values lie below this point and half the values lie above) is the median2b - If there are an

**even**number of values, find the middle two values. Take their mean. This value is the median.This is how the above would look like in R:

**med <- function(x){****if(length(x) %% 2 == 1){ # "If the number of values is odd"****return(sort(x)[length(x) / 2 + 1/2]) }****if(length(x) %% 2 == 0){ # "If the number of values is even"****return(mean(sort(x)[c(length(x) / 2, length(x) / 2 + 1)])) }****}**A more detailed explanation:

**med <- function(x){ # med will be a function that takes one input**

**# Number of values is odd**

**if(length(x) %% 2 == 1){ # If the length of the input, divided by 2, leaves a**

**# remainder of 1**

**return( # Return whatever will result from inside these**

**# parentheses**

**sort(x)[ # Sort the values of x from lowest to highest and**

**# find the value that...**

**length(x)/2 + 1/2])} # ...is 1/2 + the length of the input divided by 2 (i.e.**

**# the middle)**

**# Number of values is even**

**if(length(x) %% 2 == 0){ # If the length of the input, divided by 2, leaves a**

**# remainder of 0**

**return(mean( # Return the mean of...**

**sort(x)[c(length(x) / 2, # ...the sorted values of x, specifically the**

**# "low" middle value...**

**length(x)/2 + 1)]))} #... and the "high" middle value**

**Combining all that with a for loop**Nice work! Finally, let's create a one-dimensional random walk that uses

**if**statements and our median function. Let's say we want to impose boundaries on our random walk such that if it wanders too far from the origin, we restart it at the median of the walk so far.

**# Set the parameters**

**n.steps <- 1e5 # 100,000 steps (1.0 x 10^5)**

**mean.step <- 0 # The random walk is unbiased (no preference up or**

**# down)**

**sd.step <-0.5 # The standard deviation of step sizes (steps are drawn**

**# from a normal distribution)**

**up.boundary <- 50 # The upper boundary of our walk**

**down.boundary <- -50 # The lower boundary**

**# Prepare variables**

**walk <- 0 # Start the walk at zero**

**shifts <- c() # Create an empty variable to note when the walk hits a**

**# boundary**

**# The random walk**

**for(i in 2:n.steps){**

**# First, take a step**

**walk[i] <- walk[i-1] + rnorm(1, mean.step, sd.step)**

**# If the walk went above the upper boundary or below the lower boundary...**

**if(walk[i] > up.boundary | walk[i] < down.boundary){**

**walk[i] <- med(walk) # The walk resets at the median value of the**

**# walk up to this point**

**shifts[i] <- i # We also note the step at which this occurred**

**}**

**print(i) # Tell us what iteration of the walk we're on**

**}**

**# Plot it**

**plot(walk, type = 'l', main = "A supervised random walk", las = 1, xlab =**

**"Time", ylab = "Position", cex.main = 1.8, cex.lab = 1.4, font.lab = 2, col =**

**"gray40")**

**abline(v = shifts, col = "blue")**

**par(font = 2)**

**legend("topleft", lty = 1, col = "blue", legend = "Shifts", cex = 0.8)**

Thanks for reading. In the next post, we'll cover the

**apply**functions, which are very powerful methods for performing many calculations at once and efficiently analyzing matrices and data frames.

Cheers,

-Matt

## No comments:

## Post a Comment