For me, coding redefines possibility. Technology now allows data analysis in ways never before possible, as well as a new avenue for human creativity. The best example of this, I believe, is data visualization. Producing effective and interesting graphs can not only tell the science better - it can draw viewers in who wouldn't otherwise read the science.

In this post, we'll use R's excellent plotting capabilities to learn about normal distributions, linear regression, and a bit more. With constant feedback from the plots in R, we'll start learning R syntax along the way.

The first post in the series is an introduction to R. It covers importing data, introduction to viewing data, assigning variables, and the help function. The most important code from that post is listed here:

In this post, we'll use R's excellent plotting capabilities to learn about normal distributions, linear regression, and a bit more. With constant feedback from the plots in R, we'll start learning R syntax along the way.

**This is the second post in a series on R. This post covers:****- How do I generate random data?****- How do I plot data?****- How do I run linear regression?**The first post in the series is an introduction to R. It covers importing data, introduction to viewing data, assigning variables, and the help function. The most important code from that post is listed here:

**c()**

*# "Concatenate," or combine. This is how you tell R that there are*

*# multiple elements.***x <- c(5, 10)**

*# x is now a two-element vector of the numbers 5 and 10***x[x > 7]**

*# Subset. This shows you only the elements of x that are greater than 7*If you're determined to start learning at this post instead of the previous one, you can download R here, and you can download RStudio here.

Finally, to give you a sample of the graphical possibilities in R, below are a few beautiful graphs I found online. All were produced in R. For my courteous Facebook friends supporting my blog but not terribly interested in coding: enjoy some pretty pictures!

**Miguel Rios: every geotagged tweet 2009 - 2013**

**Sample data adapted from plot.ly: 3D surface plot**

**Ramnath Vaidyanathan: baseball strikeouts using the R package Shiny**

(Also, here's a great link comparing the raw R exports and then the final publication graphics for the book

*London: The Information Capital*.)

__________________________________________________________________

**Random data: histograms**

**Introduction**Randomness might seem like a weird place to start, but it's actually very useful for learning about distributions and plotting. Let's start with the normal - also called Gaussian - distribution.

The above plots are a density plot and a histogram, which are pretty similar ways of showing the data. Density plots are particularly useful if you have a

*lot*of data and are estimating the underlying distribution generating that data. Histograms divide the data into bins and show how many points are in each bin.These distributions reflect data that were generated with one line of code:

**data <- rnorm(10000000, 0, 1)**

Yup, I generated 10 million random values just like that. I probably could have accomplished the same point with the graphs above by just using 1,000 or 10,000 values... but too late now! You can see these values by typing

**data**and hitting enter, but R will show you*all*the values... so I wouldn't do that unless you want to feel like you're in the Matrix. Use**head(data)**or**summary(data)**instead.

**The normal distribution**
The two most important parameters characterizing a normal distribution are the

**mean**and**standard deviation.**The mean is the "average" value, which you can see as the peak of the distributions above. The standard deviation is the "spread" of the data. The**rnorm()**command takes three arguments: how many numbers you want, and the mean and standard deviation of the distribution you're drawing from.**Here's where R gets fun.**Why trust anything

*I'm*telling you about normal distributions? I'm some random person on the internet. You now know enough about R to figure out anything you want about the Gaussian. Instead of puzzling about some abstract definition of standard deviation you read from a textbook, for example, why not generate some distributions and see for yourself?

**# STANDARD DEVIATION AND THE NORMAL DISTRIBUTION**

**# Tell R to place two plots side by side****par(mfrow = c(1, 2))**

*# Generate the data***x <- rnorm(1000, 0, 1)**

*# 1000 values, mean = 0, standard deviation = 1***y <- rnorm(1000, 0, 5)**

*# 1000 values, mean = 0, standard deviation = 5*

*# Plot it!***plot(density(x), xlim = c(-20, 20), ylim = c(0, 0.4))**

**plot(density(y), xlim = c(-20, 20), ylim = c(0, 0.4))**

*(Note: my plots are a bit cleaned up compared to the code above. The code I used for all plots are at the end of this blog post.)*

*(Note #2:*

*the*

**xlim**and**ylim**arguments above specify the limits on the x- and y-axes, respectively, which is important for showing this comparison. Otherwise, R scales each plot axis to zoom in as much as possible on the data.)
As is strikingly clear, having a small standard deviation concentrates most of the distribution around the mean: most of the values are similar. A large standard deviation, on the other hand, means there is a lot of spread in the data.

**Proportion of the normal distribution**

**within a standard deviation**
How much of a normal distribution is supposed to be within one standard deviation of the mean? Instead of looking it up on Wikipedia, you can figure it out through R by

**subsetting**and using the**length**of the data variable:**length(x[x > mean(x) - sd(x) & x < mean(x) + sd(x)]) / length(x)**

This code might seem like a lot, but it can be broken into a few simple parts:

**length(**

**# How many values are there?****x[x > mean(x) - sd(x)**

**# Subset x to only values > 1 SD below the mean...****&**

**# and...****x < mean(x) + sd(x)])**

**# ... less than one SD above the mean.****/ length(x)**

*# Now, divide all that by the total to get a proportion.*It should be about 68%. Visually, that looks like the plot below. Unsurprisingly, because the standard deviation is 1, the shaded region covers data values from -1 to 1.

You can modify the code to confirm that ~95% of the distribution lies within 2 standard deviations of the mean, too.

_________________________________________________________________The point of all this is...

**Try things for yourself.**

Hi!

Curiosity is a much better R instructor than anyone on the planet. Think of things, try them in R, learn R as a byproduct.

There's plenty to mess around with. You can create bimodal distributions, trimodal distributions, see how the smoothness of the density plot changes with the number of samples, try uniform and binomial distributions... go for it.

**# MULTIMODAL DISTRIBUTIONS**

**# Create bimodal and trimodal distributions**

**bimod <- c(rnorm(1000, -2, 1), rnorm(1000, 2, 1))**

**trimod <- c(bimod, rnorm(1000, 6, 1))**

**# Plot them****hist(bimod, breaks = 50)**

**# Divide the data into 50 bins****hist(trimod, breaks = 75)**

*# More bins because there are more data***# ALTERNATE DISTRIBUTIONS**

*# Uniform distribution***unif <- runif(1000, 0, 1)**

*# Number of samples, minimum, maximum*

*# Binomial distribution***binom <- rbinom(1000, 10, 0.1)**

*# Number of trials, number of attempts,*

*# probability of success*

*# Poisson distribution***pois <- rpois(1000, 10)**

*# Low-sample normal distribution***norm <- rnorm(10, 0, 1)**

**# Plot them****par(mfrow = c(2, 2))**

**hist(unif, breaks = 50, col = "dodgerblue", main = "Uniform distribution")**

**hist(binom, col = "dodgerblue", main = "Binomial distribution")**

**hist(pois, col = "dodgerblue", main = "Poisson distribution")**

**hist(norm, col = "dodgerblue", main = "Low-sample normal distribution")**

__________________________________________________________________

**Random data: scatter plots**

**Introduction**There've

**been an awful lot of histograms in this post so far, so let's end with scatter plots. Let's first create a data set. We'll use data we create ourselves so we know precisely what we're putting into the analysis before we run anything. In other posts, we'll use data provided by R.**

Say you're an ambitious yet eccentric graduate student interested in the relationship between people's opinions of Newark Airport, and how much time they spend watching cat videos every day. Let's also assume there's an insanely strong correlation between these two variables, so strong that if only Governor Christie knew, he'd pull out of the presidential race to focus his attention on finding ways to get more cat lovers flying out of EWR instead of JFK.

We'll start by creating a perfect relationship between the two, and then adding noise to represent real data. We'll alter the strength of this noise to better understand statistical significance in regression.

**Data generation**
We first create the "time spent looking at cats" data. We'll assume that if we had this data for all 320 million Americans, it'd be a normal distribution with a mean of 15 minutes and a standard deviation of 10. In our experiment, we get a random sample of 100 Americans.

**time.cat <- rnorm(100, 15, 10)**

**time.cat**is now a variable that represents the number of minutes 100 people spent watching cat videos. Each element of this vector (i.e. each number) is the amount of time a different person spent. However, note that you can't have a

*negative*amount of time spent watching videos. Let's change those negative values to zero.

**time.cat[time.cat < 0] = 0**

This line took the subset of

**time.cat**values that were less than zero and changed them to zero.

Now, let's generate the data for opinions on Newark airport. Let's assume it's a uniform distribution from 1 to 10.

**newark.opinion <- runif(100, 1, 10)**

To create a strong relationship between the two variables, let's sort both in ascending order.

*[Note: never do this with real data! It creates a relationship that doesn't exist between the variables.]***time.cat2 <- sort(time.cat)**

**new2 <- sort(newark.opinion)**

If you plot these data, you get a perfect relationship! Wow! It's amazing how great fabricated data look. The command below is saying "plot time spent watching cat videos

*as a function of*your view on Newark Airport."**plot(time.cat2 ~ new2)**

**Noise unravels the relationship**Let's now create three more data sets with varying levels of noise. We generate noise by adding random values to the cat video data to make the relationship between the variables weaker. Note: it's important we use lots of numbers, as opposed to a constant: adding/subtracting a constant would maintain the relationship but shift it up or down the y-axis.

**We generate 100 random values with**

**runif()**below, varying how far from zero the distribution stretches. For the second and third sets, we add noise to the previous set.

**noise1 <- time.cat2 + runif(100, -10, 10)**

**noise1[noise1 < 0] = 0**

**noise2 <- noise1 + runif(100, -10, 10)**

**noise2[noise2 < 0] = 0**

**noise3 <- noise2 + runif(100, -10, 10)**

**noise3[noise3 < 0] = 0**

**# Plot it****par(mfrow=c(2,2))**

**plot(time.cat2 ~ new2)**

**plot(noise1 ~ new2)**

**plot(noise2 ~ new2)**

**plot(noise3 ~ new2)**

How does the addition of noise change the relationship between opinion of Newark Airport and time spent watching cat videos? We can quantify the strength of the relationship by using

**linear regression.**Linear regression finds a line that goes through the data and minimizes the distance of each point to that line (the**residuals**). This line represents the relationship between the two variables, and linear regression can tell you if that relationship is positive or negative (and how strongly), as well as whether the relationship is statistically significant.*[Note: there are a few assumptions you have to make when you do linear regression, such as independent samples, homoskedasticity, and that the underlying relationship is actually linear. Definitely check these before you run a linear regression on real data.]*

The command in R to build a

**linear model**is**lm()**. We'll use this and then get our regressions.**reg1 <- lm(time.cat2 ~ new2)**

**reg2 <- lm(noise1 ~ new2)**

**reg3 <- lm(noise2 ~ new2)**

**reg4 <- lm(noise3 ~ new2)**

By using the

**summary()**command, we can look at the outcome of each test. That'll give you something like this:

The Pr(>|t|) tells you the

**p-value**for the linear regression, which is unsurprisingly insanely significant for our sorted data. This is saying that if you redid your experiment thousands of times, less than 2 x 10^-16 proportion of those experiments would get results as extreme as yours or greater. The cutoff for "statistical significance" in p-values is 0.05

*(though that's a huge can of worms on whether the cutoff should be much, much lower).*

Meanwhile, the

**R-squared**values at the bottom of the output refer to the direction of the relationship. The minimum possible R-squared value is 0, which would look like a flat line going through the data. As you increase or decrease one variable, there's no effect on the other. The maximum R-squared value is 1, which would be a perfect positive or negative relationship.
Using the

**summary()**command on our increasingly noisy data sets shows that, indeed, adding more noise weakens the relationship between the variables. However, it takes three rounds of adding noise before the relationship isn't statistically significant anymore, probably because the original relationship was unrealistically strong.
For one final figure, let's look at the "Noise #1" and "Noise #3" data sets. How well does linear regression fit a line through the data?

*Here, I color-coded the data points based on their residual value, i.e. how far they are from the regression line. The histograms below show the distribution of residuals. Because we can't have negative time spent watching cat videos, there's a cutoff on how negative the residuals can be. As the data grow noisier, however, we get larger positive residuals.*

*[Note that the colors don't correspond perfectly to the scatterplots... I tried!]*

__________________________________________________________________

**Summary**

This post covered a lot. We covered generating random data, altering the parameters of the normal distribution, creating a few other types of distributions, scatterplots, noise, and a bit on linear regression.

The main theme of this post, and one that I think is core to coding, is to

**try things out.**Just try things while coding; you almost never get new code right the first try, and that's fine. Think of coding as a language you can use to answer increasingly sophisticated and interesting questions. If you need to learn how to do a certain analysis in R*this week*, then go ahead and read up on the specific commands. If you have time, though, let your imagination have a bit of fun. Wander into territory you've never tried before and learn whatever R you need to answer that question. That's a lot more fun, I think, than memorizing commands and saving them for one day you might need them.- Matt

p.s. one final bit of advice. Say you spent hundreds of hours collecting data and it turns out there's
no relationship at all between Americans' opinion of Newark Airport and how
much time they spend watching cat videos every day. Setting the regression line width to 500 in the

**abline()**command can completely obscure your data and might make you feel better. :-)**Code for generating the plots in this blog post:**

(This doesn't include the fancy ones at the start, unfortunately!)

__Normal distribution: density plot and histogram__

data <- rnorm(1e7)

par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))

plot(density(data), las = 1, main = "Density plot", lwd = 3, xlab = "Value", col = "deepskyblue4",

font.lab = 2, xlim = c(-5, 5))

hist(data, breaks = 50, las = 1, xlab = "Value", main = "Histogram", freq = F, col =

*# Generate the data*data <- rnorm(1e7)

*# Set two plots side by side, and allow space for an overall header*par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))

*# Density plot*plot(density(data), las = 1, main = "Density plot", lwd = 3, xlab = "Value", col = "deepskyblue4",

font.lab = 2, xlim = c(-5, 5))

*# Histogram*hist(data, breaks = 50, las = 1, xlab = "Value", main = "Histogram", freq = F, col =

**"deepskyblue4", font.lab = 2, xlim = c(-5, 5))**

mtext("Normal distribution", outer = T, cex = 2, font = 2)

*# Title*mtext("Normal distribution", outer = T, cex = 2, font = 2)

__Normal distribution: sd = 1, sd = 5__

**# Generate the data**

**x <- rnorm(1000, 0, 1)**

y <- rnorm(1000, 0, 5)

y <- rnorm(1000, 0, 5)

**# Plot it**

plot(density(x), xlim = c(-20, 20), ylim = c(0, 0.4), lwd = 2, font.lab = 2, las = 1,

main = "Standard deviation = 1", xlab = "Value")

plot(density(y), xlim = c(-20, 20), ylim = c(0, 0.4), lwd = 2, font.lab = 2, las = 1,

main = "Standard deviation = 5", xlab = "Value")

plot(density(x), xlim = c(-20, 20), ylim = c(0, 0.4), lwd = 2, font.lab = 2, las = 1,

main = "Standard deviation = 1", xlab = "Value")

plot(density(y), xlim = c(-20, 20), ylim = c(0, 0.4), lwd = 2, font.lab = 2, las = 1,

main = "Standard deviation = 5", xlab = "Value")

__Shaded-in normal distribution__

*# Generate the data and save a variable of the density plot***x <- rnorm(1000)**

**d <- density(x)**

*# Find the x-values within the density plot that are at the -1 SD and +1 SD mark***m <- max(which(d$x <= mean(x) - sd(x)))**

m2 <- min(which(d$x >= mean(x) + sd(x)))

m2 <- min(which(d$x >= mean(x) + sd(x)))

coord.x <- c(d$x[m], d$x[m:m2], d$x[m2])

coord.y <- c(0, d$y[m:m2], 0)

*# Create the coordinates for the shaded region*coord.x <- c(d$x[m], d$x[m:m2], d$x[m2])

coord.y <- c(0, d$y[m:m2], 0)

*# Plot it***par(mfrow = c(1, 1))**

**plot(d, lwd = 2, main = "1 standard deviation within the mean", xlab = "Values", las = 1,**

font.lab = 2)

polygon(coord.x, coord.y, col = "darkorchid4")

font.lab = 2)

polygon(coord.x, coord.y, col = "darkorchid4")

__Alternate distributions__

unif <- runif(1000, 0, 1)

binom <- rbinom(1000, 10, 0.1)

# Poisson distribution

pois <- rpois(1000, 10)

norm <- rnorm(10, 0, 1)

par(mfrow = c(2, 2))

hist(unif, breaks = 50, col = "dodgerblue", main = "Uniform distribution", xlab = "Value",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

hist(binom, col = "dodgerblue", main = "Binomial distribution", xlab = "Number of successes",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

hist(pois, col = "dodgerblue", main = "Poisson distribution", xlab = "Value",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

hist(norm, col = "dodgerblue", main = "Low-sample normal distribution", xlab = "Value",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

*# Uniform distribution*unif <- runif(1000, 0, 1)

*# Number of samples, minimum, maximum*

# Binomial distribution# Binomial distribution

binom <- rbinom(1000, 10, 0.1)

*# Number of trials, number of attempts, probability of success*# Poisson distribution

pois <- rpois(1000, 10)

*# Number of trials, expected value**# Low-sample normal distribution*norm <- rnorm(10, 0, 1)

*# Plot them*par(mfrow = c(2, 2))

hist(unif, breaks = 50, col = "dodgerblue", main = "Uniform distribution", xlab = "Value",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

hist(binom, col = "dodgerblue", main = "Binomial distribution", xlab = "Number of successes",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

hist(pois, col = "dodgerblue", main = "Poisson distribution", xlab = "Value",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

hist(norm, col = "dodgerblue", main = "Low-sample normal distribution", xlab = "Value",

las = 1, font.lab = 2, cex.main = 2, cex.lab = 1.3)

__The perfect relationship__

*# Load library. If you don't have this package: install.packages('RColorBrewer')***library(RColorBrewer)**

*# Create the color palette***colors <- colorRampPalette(c("red", "orange", "yellow", "green", "blue", "purple"))(n = 100)**

*# Plot it***plot(time2 ~ new2, main = "The perfect relationship", xlab = "Opinion of Newark Airport", col =**

**colors, pch = 19, ylab = "Time watching cat videos (min)", las = 1, font.lab = 2, cex.main = 1.4,**

**cex = 1.2)**

**points(time2 ~ new2, cex = 1.2)**

*# Add a margin around the points*

*# Add a legend***par(font = 2)**

*# Bold the text inside the legend*

**legend("topleft", legend = "If only real data\n were this nice...",**

*# Break text into 2 lines***pch = 4, pt.cex = 1.7)**

__Noisy scatter plots__

*# Load library***library(RColorBrewer)**

*# Create the color palette***colors <- colorRampPalette(c("red","orange","blue"))(n = 100)**

*# Combine the data into a matrix for ease in plotting***plots <- cbind(time2, noise1, noise2, noise3)**

titles <- c("No noise", "Noise #1", "Noise #2", "Noise #3")

titles <- c("No noise", "Noise #1", "Noise #2", "Noise #3")

*# Use a for loop to avoid writing all the figure specifications 4 times***for(i in 1:ncol(plots)){**

plot(plots[, i] ~ new2, main = titles[i], pch = 19, ylim = c(0,150), las = 1,

col = colors, cex = 1.2, xlab = "Opinion of Newark Airport", ylab = "Watch time (min)",

font.lab = 2, cex.main = 2, cex.lab = 1.3)

}

plot(plots[, i] ~ new2, main = titles[i], pch = 19, ylim = c(0,150), las = 1,

col = colors, cex = 1.2, xlab = "Opinion of Newark Airport", ylab = "Watch time (min)",

font.lab = 2, cex.main = 2, cex.lab = 1.3)

}

__Residuals__

*# Load library (unnecessary if you already did in this R session)***library(RColorBrewer)**

*# Run the linear regressions for Noise #1 and Noise #3 data*

reg2 <- lm(noise1 ~ new2)

reg2 <- lm(noise1 ~ new2)

**reg4 <- lm(noise3 ~ new2)**

**par(mfrow = c(2,2))**

cols <- colorRampPalette(c("lightblue","black"))(n = 101)

*# Create a vector of color values from light blue to black*cols <- colorRampPalette(c("lightblue","black"))(n = 101)

*# Scatter plot #1***plot(noise1 ~ new2, pch = 19, col = cols[round(abs(reg2$resid))], cex = 1.6, font.lab = 2,**

las = 1, ylim = c(0, 130), main = "Noise #1", xlab = "Opinion of Newark Airport",

ylab = "Watch time (min)", cex.main = 2, cex.lab = 1.3)

las = 1, ylim = c(0, 130), main = "Noise #1", xlab = "Opinion of Newark Airport",

ylab = "Watch time (min)", cex.main = 2, cex.lab = 1.3)

*# Add the linear regression line***abline(reg2, lwd = 2)**

*# Scatter plot #2. Colors didn't work unless I added 1... don't know why***plot(noise3 ~ new2, pch = 19, col = cols[round(abs(reg4$resid + 1))], cex = 1.6, font.lab = 2,**

las = 1, ylim = c(0, 130), main = "Noise #3", xlab = "Opinion of Newark Airport",

ylab = "Watch time (min)", cex.main = 2, cex.lab = 1.3)

abline(reg4, lwd = 2)

las = 1, ylim = c(0, 130), main = "Noise #3", xlab = "Opinion of Newark Airport",

ylab = "Watch time (min)", cex.main = 2, cex.lab = 1.3)

abline(reg4, lwd = 2)

*# Second color vector. I cheated a bit to make the histograms look closer to scatter plots***cols2 <- colorRampPalette(c("#4A5C62", "lightblue", "black", "black"))(n = 20)**

*# Histograms***hist(reg2$resid, breaks = 10, col = cols2, xlim = c(-100, 100), freq = F, ylim = c(0, 0.025),**

las = 1, main = "Noise #1 residuals", xlab = "Residual value", cex.main = 2, cex.lab = 1.3,

font.lab = 2)

hist(reg4$resid, breaks = 20, col = cols2, xlim = c(-100, 100), freq = F, ylim = c(0, 0.025),

las = 1, main = "Noise #3 residuals", xlab = "Residual value", cex.main = 2, cex.lab = 1.3,

font.lab = 2)

las = 1, main = "Noise #1 residuals", xlab = "Residual value", cex.main = 2, cex.lab = 1.3,

font.lab = 2)

hist(reg4$resid, breaks = 20, col = cols2, xlim = c(-100, 100), freq = F, ylim = c(0, 0.025),

las = 1, main = "Noise #3 residuals", xlab = "Residual value", cex.main = 2, cex.lab = 1.3,

font.lab = 2)

__Shameful abline__

**par(mfrow = c(1,1))**

plot(noise3 ~ new2, pch = 19, cex = 1.6, font.lab = 2,

las = 1, ylim = c(0, 130), main = "Noise #3", xlab = "Opinion of Newark Airport",

ylab = "Watch time (min)", cex.main = 2, cex.lab = 1.3)

abline(reg4, lwd = 500, col = "purple")

plot(noise3 ~ new2, pch = 19, cex = 1.6, font.lab = 2,

las = 1, ylim = c(0, 130), main = "Noise #3", xlab = "Opinion of Newark Airport",

ylab = "Watch time (min)", cex.main = 2, cex.lab = 1.3)

abline(reg4, lwd = 500, col = "purple")

**Image credits:**

- Twitter map: Miguel Rios

- 3D surface plot: plot.ly

- Baseball time series: Ramnath Vaidyanathan

- Kitten: theheightsanimalhospital.com

Note however there is a typo in the script.

ReplyDeletedata1 <- data.frame (X, Y, ) should read

data1 <- data.frame (X, Y)

Thanks for the comment, but I think you might be commenting on the wrong post. I don't create any data frames in this post

Delete