Tuesday, December 15, 2015

Introduction to R: 1 - What is R?

Hi everyone,

I've gradually fallen in love with the programming language R and I use it every day for PhD work. However, it took me forever to get started with the language. What do you mean you just type commands into a blinking cursor? Where are the buttons? How do you even load your data into the program? Why can't I see my data? Everything was so much easier in Excel!

For the occasional metalhead who wants to learn to program, or the ambitious high school student who somehow found this blog instead of R Bloggers, here's a baby steps, 1 MPH introduction to R. This post is for people with little to no programming experience: I'll explain what R is and some very simple introductory commands to run in it. In later posts, I'll explain how to get over the learning curve in R so you can find ways to easily teach yourself.

This is the first post in a series on R. This post answers:
- What is R?
- How do I load an Excel file into R?
- How do I find the answers to my questions about R?

What is R? Why not Excel?
R is a programming language. Programming allows you to talk more directly to your computer and have it run any command you can imagine. If you want your computer to do a certain statistical analysis, for example, you don't need to wait until an engineer at Microsoft decides to include it in the next version of Excel. (For some trickier research-based questions, you might be waiting a long time.) Through R, you can tell your computer to do exactly what you have in mind.

By talking to your computer in a language closer to what it normally thinks in, you can ask it to do things that would just be too hard to do by hand or even in Excel.  Let's say, for example, that you're a fruit aficionado and want to know how much various fruit cost, on average.


This one's pretty easy to do by hand, and probably even easier in Excel. The mean of every row here would tell you how much the fruit cost, on average, over three days.

That's great, but let's say you want to ask a tougher question: how much does it matter where you buy your fruit from? Do some stores vary in the prices of their fruit more than others? You ride your environmentally-friendly bike to a few stores every day, suspiciously jot down fruit prices on a notepad, and leave without ever buying anything. You arrive with a table like the one below:


You analyze this and discover that Fruits'R'us is the clear winner for affordable produce, though if you're not into surprises when you weigh your oranges, you should shop at Food Palace.

However, you're not satisfied. You quit your job to bike around the U.S. and record fruit prices in stores in all 48 contiguous states. You dedicate two years to this project and go through several notebooks and pens as you collect your data. You finally return home, go to your parents' basement and turn on their computer, wait for Windows 95 to finish booting up, and finally begin inputting your data into Excel.

Figure 1. An apple.
Unlike your travels, you hit a roadblock. Just from looking at the data, fruit prices clearly cost more in the winter, but only in some states. The prices seem to fluctuate from day to day, but it's hard to tell if the price changes are the same size and the same for all fruit. What's worse, it's hard to even get a general idea of the data because you have 11,000 price measurements for 3,650 stores over 730 days. The Excel file itself takes a few minutes to scroll through.

Thankfully, it's almost the holidays so you ask your parents for R for Christmas. Your mom, a wealthy diplomat, buys you a gaming computer. You thank her, download R from the internet, and start running analyses on your data.

In all seriousness, R - and programming in general - lets you efficiently run analyses that would otherwise be just too hard. In R, for example, you can quickly and easily ask your computer to answer the following questions:
- What is the variance in pear prices in Texas in the summer months versus winter months?
- What is the distribution of apple prices in Food Palace on holidays?
- How spatially correlated are fruit prices? How correlated are those prices over time?
- If we forecast pear prices into the future, how will Illinois compare to Delaware in 2017?

And perhaps most importantly, it's easy to make beautiful graphs to visualize your computer's responses. I'll talk about that in the next blog post.

Do I have to pay for R? How do I get it?
R is free. You can download it from the R website. I infinitely recommend using RStudio, a working environment for R. It organizes your R windows (so plots are always in the same place, for example), and makes the code easier to work with.


Why is R free?
I don't know. Maybe it's part of that "knowledge is more important than money" mentality academics have. It might simply be because it's more efficient to collaborate and exchange information with other researchers if everyone uses the same programming language. If you're a researcher with a limited budget and there are two equally good programming languages, you'll probably start using the one that's free. At any rate, it's very cool that R developers have made it a priority for R to be accessible to anyone who wants it.

I took a class in college that used MATLAB, so I'll just keep using that.
Sure, go ahead. However, note that coding languages like MATLAB, Mathematica, SPSS, SAS, and Stata all require paid licenses. It's unlikely to be a problem if you're a grad student at a well-funded university, but don't code yourself into a corner: many industry jobs don't want to pay the tens of thousands of dollars for a license, so it might be a safer bet to learn a free software like R or Python.

I'm uncomfortable with a coding language whose name has only one syllable, but I still want my language to be free.
You look like a great candidate for the languages Python, Java, C++, Fortran, or Julia. These languages are all equally good to R in many ways, but I've learned too much R to be unbiased in this blog post. 

So why do you use R?
R is (currently) unparalleled in its ability to easily run complicated statistical tests and to produce beautiful data visualization. Instead of coding a non-linear least squares regression from scratch, you can download R code that'll do it for you. Similarly, investing a little time into learning how R plots data can let you produce almost any visualization you can imagine. Finally, R's community is sufficiently large that websites like Stack Overflow constantly have people asking and answering questions about how to code something in R. 99% of my solutions to R questions come from searching through this community. 

However, I use R because it fits my research needs: statistics and data visualization. If I was running computation-heavy evolutionary simulations, C++ would be a better bet. If I wanted to do engineering work, MATLAB or Python is stronger.
__________________________________________________________________
So I downloaded R. How do I do anything?
If you're coming from click-based statistical programs like Excel or SPSS, seeing a program that's just a command line can be a bit of a shock. Think of R as a language and less as a program. Of course you can't say anything in German when you start learning, or mastering all the tones in Mandarin can be frustrating (or hilarious).


Hopefully you're in RStudio right now.  In the TOP LEFT window, you can type whatever you want, and hitting enter doesn't make R run the code. This allows you to write several lines of code before you run anything, which is essential for trickier programming.  To run a line of code in this window, press ctrl + r on Windows or ctrl + ENTER on Mac.

The BOTTOM LEFT window is the terminal, where you can talk directly to R. When you open RStudio, there's some text here about R and what version you have. Typing here and then pressing ENTER makes R run what you wrote. This is nice if you just want to type a quick command to check something. For now, let's focus on this window. 

(The TOP RIGHT window is useful for having extra information about commands, like what arguments they take. For me, the BOTTOM RIGHT window generally displays plots.)

Below are the very first things you should do in R before you try any data analysis. Anything in blue can be directly copied into R and run.

Numbers
5
5 + 5

Yes, literally just type the number 5 into that box at the bottom of the screen and hit enter. Unsurprisingly, R says "5" back to you, confirming that 5 = 5 after all. Now try 5 +5. Great work.

x <- 5
x

The <- arrow thing is the equals sign in R. (You can use an actual equals sign if you want, but essentially everyone uses the arrow.) By typing x <- 5 and hitting enter, you've made a variable x that has the value 5. R will remember this until you tell it to forget, or you close R. Now if you ask R to tell you what x is, it will say 5.

Commands in R
Congratulations, you just coded! If you want the formal induction into being a coder, apparently you're supposed to do this:

print("Hello world!")

print() is a command that, well, prints whatever is inside the parentheses. It's straightforward when you're printing literally what's inside of the command, but we can make it more interesting like this:

y <- "Hello world!"
print(y)

print() is nice but you probably won't use it that often relative to other functions. A critically important function to know is the concatenate function, or c(). This tells R that there are multiple elements to remember.

z <- c(1, 2, 4, 8)
z

We can now run some pretty standard analyses on that vector of numbers.

mean(z)
median(z)
sd(z)
min(z)
max(z)

You could also just run it directly on the numbers if you prefer.

mean(c(1, 2, 4, 8))

Loading data into R
This one caused me so much confusion when I was first learning R. It involves thinking a bit like a computer.

Step 1: Save the data in a way R will understand
This post from R Bloggers goes into intricate detail on all the file formats R will accept and how to load them. For me, the most relevant format has been .CSV, or "comma separated values."

Say you have a Excel sheet you want to open in R. In Excel:

- Make sure there are no spaces in the column names. Change the names from e.g. "Time (seconds)" to "Time_sec" or "Time.sec"

- File --> Save As --> Save as Type --> CSV (Comma delimited)

- Excel will say some features of the workbook might be lost. Say that Yes, you do want to keep using CSV format

- When you exit, it will ask if you want to save your changes. Go ahead and save, even if you didn't make any changes

Step 2: Specify the working directory in R
Now you need to tell R where to find the data. When you use R, it focuses on one particular folder on your computer at a time, and you have to tell it which folder to look at. The folder R is looking at is called the "working directory." You can find out where you currently are by typing getwd(). You can change the working directory with the setwd() command. If you're on a Windows computer, your data are on the Desktop, and your name is Matt, you can write this to get to the Desktop:

setwd("C:/Users/Matt/Desktop")

Step 3: Load the data
This step will involve creating a variable called "data" and using the read.csv() function.

data <- read.csv("Data.csv", header = T)

If there's a file called Data.csv on the Desktop, R will take it and assign it to the variable "data." The header = T argument tells R that the top row of the data is column names. (If you just imported a table of numbers with no header, for example, you could say header = F instead.)

Step 4: Look at the data
Now you can look at the data. You could just type data and hit enter, but R will display everything, so if your data is that 11,000 x 5 matrix from your fruit travels from earlier, your screen will become overwhelmed with numbers. A better option is to look at only part of the data.

head(data)
tail(data)
data[1:5, c(2, 4)]

The head() and tail() commands tell R to only look at the first or last 6 rows of the data. 

The last command, data[1:3, 2:4] offers you more fine-tuned control. The brackets [ ] let you subset the data, which means selecting only part of it. The first argument, 1:5, means "rows 1, 2, 3, 4, and 5." The second argument, c(2, 4), means "columns 2 and 4." 

If you wanted to create a new variable for only columns 1, 3 to 5, and 7 of the data, you could write something like this:

data2 <- data[ , c(1, 3:5, 7)]

The empty first argument means "all rows."

Finally, here are three commands to get a feel for the data:

dim(data)
summary(data)
str(data)

dim() will tell you the dimensions of your data, i.e. how many rows and columns there are. summary() will summarize each column of the data, giving you values like the first quartile, median, etc. str() is more useful for when some of your columns have text like "Treatment A," "Treatment B" and other columns have numbers.

__________________________________________________________________
I still don't really know how to do anything in R.
That's ok. Again, think of R as a language instead of a program. It takes a while to gain fluency, but the more you invest in learning, the easier it'll be to say what you're thinking. 

One of the most important things for me when I first started learning R was to learn where to find answers to questions I had. Let's say you found a function but don't know how to use it:

?mean

This will bring up R's Help file for the mean() function.  There you can find what the function does, as well as what arguments R is looking for.

Say you don't know what the function is called in R. Let's say you're trying to find the command for standard deviation:

??"standard deviation"

This will give you a list of possible functions that match this. The "stats::sd" option is what you're looking for. "stats" refers to the package in R, and "sd" is the command.

Finally, say you're looking for a function for standard error and ??"standard error" only gives you really complicated-sounding options. Type the following into Google:

standard error in R

The first link, not surprisingly, takes you to Stack Overflow, where someone asked this exact question in 2011. The answer is that R doesn't have a function for standard error, but it's really easy to write one. I'll cover writing your own functions in a future blog post. When in doubt, Google what you're trying to do, followed by "in R." This is honestly the easiest way to find out how to code something in R.
__________________________________________________________________
What are other resources for learning R besides The Headbanging Behaviorist? No offense.
Here are some  invaluable resources that have helped me learn.


This book was exactly what I needed when I started learning in 2012. I needed something for an absolute beginner, and this book helped me get over that initial learning curve.


Quick-R provides a very useful overview of basic functions in R. I visit their page on graphical parameters all the time.


The writers behind R-bloggers are incredibly helpful. While they won't necessarily provide the well-rounded introduction to R you might need, they're very useful for coding a random, specific analysis that might be hard to find elsewhere. Their 2005 post on shading a polygon was exactly what I needed for the analysis in this blog post. I follow them on Twitter and will read the occasional article that pops up and is relevant to me.



I've never gone to their website directly, but I always end up there from Googling questions about R. Honestly, that's the easiest way to learn how to run a particular analysis in R. Google it and then see what Stack Overflow suggests.

Thanks for reading! This is the first post in a series on R. The next post will be on plotting and simple statistical tests. 

Best,
-Matt

 Image credits:
- R Studio: rstudio.com
- Apple: dreamatico.com
- An Introduction to R for Biologists: amazon.com
- Stack Overflow: stackoverflow.com
 

No comments:

Post a Comment