Simulating data for simple linear regression

Marc A. Beer

Simple linear regression is useful for understanding the relationship between an unbounded continuous response variable and one or more explanatory variables. Simple relationships are also relatively easy to simulate in R.

Before we begin, let’s load in some useful R packages.

library(lme4)
library(lmerTest)
library(rstanarm)

library(ggplot2)
library(patchwork)
library(cowplot)
library(ggpubr)
library(knitr)

1. Simulating an X variable

We might simulate an explanatory or “X” variable by making a random draw of size N from a uniform distribution governed by specified minimum and maximum values.

set.seed(2)

#set a sample size
N <- 3000

#draw from a uniform distribution with specified min and max values
xmin <- 5
xmax <- 25
x <- runif(n=N, min=xmin, max=xmax)

Let’s plot the resulting random draw as a histogram. You’ll notice that the prevalence of X values is similar for each bin, which is what we expect from a uniform distribution.

ggplot(data.frame(x=x), aes(x = x)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "slateblue1", bins=N/100, alpha=0.5) +
  geom_density()+
  labs(x="X", y="Density")+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))

Note that we could have simulated X values using different distributions, and it may be informative to play around with this. For example, if we expect that extremely high or low X values will be rarer than middling values (a sensible expectation), we might prefer the normal distribution.

2. Simulating expected (mean) Y values

Great, we have our X values. Let’s simulate a response variable Y that has a simple linear relationship with X. To do this, we will need to specify a Beta coefficient (B) that relates X values to mean, or expected, values of Y.

Note that we say a expected values of Y because we will later introduce noise, or random error, to the expected Y value when we generate a dataset mimicking real data. Thus, our final dataset containing observed Y values will depart from these expected Y values. However, we will see that the general trend, and indeed, the expected Y values, will ultimately be preserved.

#specify a beta coefficient
B <- 5

We can also specify an intercept (I) describing the expected value of Y when X=0.

#specify an intercept
I <- 10

Having specified a Beta coefficient (B) and intercept (I), we can now calculate expected values of Y using a simple function. We’ll call our expected values of Y “y_e” because we still need to modify them somehow…

#calculate expected values of y ("y_e")
y_e <- B*x + I

Let’s plot our expected values of Y in relation to X.

#create a data frame of x and y_e values
df_x_ye <- data.frame(x=x, y_e=y_e)

#plot expected y (y_e) against x
ggplot()+
  geom_point(data=df_x_ye, mapping=aes(x=x, y=y_e), shape=21, colour="black", fill="slateblue1", size=2)+
  xlim(min(x)-2.5, max(x)+2.5)+
  ylim(0, max(y_e)+2.5)+
  labs(x="X", y="Expected Y")+
  theme_bw()+
  stat_cor(data=df_x_ye, mapping=aes(x=x, y=y_e), method = "pearson")

If you’ve carried out regression analysis on real data, you might think these data are a bit odd. The correlation between X and Y is [likely] unrealistically high. In fact, Pearson’s r = 1. X perfectly predicts Y. However, recall that these are expected values of Y.

As mentioned earlier, real data concerning correlated variables frequently show some magnitude of random error (or noise). This random error can be introduced by measurement or simply the randomness of nature itself, among other causes. We should introduce random error to generate a more realistic dataset.

3. Adding random error (noise) to our expected values of y

Because random error is… random, it should preserve the expected value of Y. That is, the error values added to each expected Y value should be centered on zero. For simple linear regression, we further assume that the random error is normally distributed with constant variance (i.e., the error variance does not depend on the expected value of y). We can generate random error by drawing from a normal distribution with a mean value of zero and a standard deviation governing the magnitude of variation in error values.

Here, I have specified the error standard deviation as sqrt(var(y_e)/2), which is equivalent to specifying the error variance as var(y_e)/2. More explicitly, we have set the error variance as half of the variance in expected Y values (var(y_e)). We will return to this in just a moment when we discuss how the addition of random error alters the appearance of relationship between Y and X.

set.seed(20)

#generate y values by adding random error to y_e
y <- y_e + rnorm(n=N, mean=0, sd=sqrt(var(y_e)/2))

#create new data frame merging x, ye, and y
df_x_ye_y <- data.frame(x=x, y_e=y_e, y=y)

#re-plot expected y (y_e) vs x
ye_plot<-ggplot()+
  geom_point(data=df_x_ye_y, mapping=aes(x=x, y=y_e), shape=21, colour="black", fill="slateblue1", size=2)+
  xlim(min(x)-2.5, max(x)+2.5)+
  ylim(min(y)-2.5, max(y)+2.5)+
  labs(x="X", y="Expected Y")+
  theme_bw()+
  stat_cor(data=df_x_ye_y, mapping=aes(x=x, y=y_e), method = "pearson")

#plot observed y vs x
y_plot<-ggplot()+
  geom_point(data=df_x_ye_y, mapping=aes(x=x, y=y), shape=21, colour="black", fill="slateblue1", size=2, alpha=0.5)+
  geom_line(data=df_x_ye_y, mapping=aes(x=x, y=y_e), colour="tan1", size=1.5)+
  xlim(min(x)-2.5, max(x)+2.5)+
  ylim(min(y)-2.5, max(y)+2.5)+
  labs(x="X", y="Observed Y")+
  theme_bw()+
  stat_cor(data=df_x_ye_y, mapping=aes(x=x, y=y), method = "pearson", digits=3)

#combine plots
cowplot::plot_grid(ye_plot, y_plot, nrow=2, ncol=1, vjust=-0.25)

The top plot shows the relationship between expected values of Y and X, which was plotted previously. The bottom plot shows our new *observed* values of Y.

We can see that considerable noise has been added to the data, but the general trend has been preserved. Indeed, in the lower plot, the orange line depicts the original expected Y values that we simulated (also shown as points in the top plot). Note that the value of Pearson’s r for the simulated observed Y values is now ~0.82; Pearson’s r has decreased.

Recall that when we added random error to the expected Y values, we used a normal distribution with a standard deviation sqrt(var(y_e)/2). By specifying the error variance as half the variance of the expected Y values, we generated data where X can explain 2/3 of variance in Y, with the remaining 1/3 governed by random error. How does this relate to Pearson’s r? Pearson’s r is equal to the square root of the variance in Y explained by X. By squaring Pearson’s r, we can confirm that X explains 2/3 of the variance in y. Let’s do that now.

#re-calculate the value for Pearson's r and confirm that it is the value shown in the lower panel of the above figure
pearson_r <- cor(x,y)

#square the value of Pearson's r and confirm that it is ~2/3
pearson_r_squared <- cor(x,y)^2

#print the values together
cat(paste("Pearson's r = ", pearson_r, "\n", "Squared Pearson's r = ", pearson_r_squared, sep=""))

## Pearson's r = 0.817767578544923
## Squared Pearson's r = 0.668743812519226

Note that due to our finite sample size and the nature of introducing randomness, we should expect some deviation from an exact value of 2/3. However, if we repeated the entirety of the previous sections numerous times and took the mean squared Pearson’s r across the replicates, we should obtain a value very close to 2/3.

Congratulations!

We’ve successfully simulated an X variable and a Y variable with added random error. In addition, we’ve managed to control the amount of variance in Y explained by X and random error. This experience is useful, for example in the context of doing a power analysis when some preliminary information is known about the relationships between different variables.

We could stop here, or we could confirm that the parameters we specified (B, I, y_e, and the error variance) can be recovered using simple linear regression. Let’s do that now.

4. Using a simple linear model to compare true and estimated parameter values

Base R allows us to carry out simple linear regression using the function lm().

#carry out simple linear regression
mod1 <- lm(y ~ x, data=df_x_ye_y)

Now that we’ve created a linear model, we can use the function summary() and the model name to get a - summary - of the information specified and estimated in the model.

#view model summary
summary(mod1)

## 
## Call:
## lm(formula = y ~ x, data = df_x_ye_y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.332 -13.870   0.052  13.836  73.649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.00823    1.03658   9.655   <2e-16 ***
## x            4.98774    0.06411  77.797   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.54 on 2998 degrees of freedom
## Multiple R-squared:  0.6687, Adjusted R-squared:  0.6686 
## F-statistic:  6052 on 1 and 2998 DF,  p-value: < 2.2e-16

The summary includes details regarding coefficients, which include the intercept (which we simulated under the name “I”) and Beta coefficient for X (“B”).

Let’s compare true (i.e., the values specified during simulation) and model parameter values. Let’s start with the B coefficient and I. We can obtain these coefficients from a model object using the coefficients() function.

#compare true and model values for B coefficient and I (intercept)
mod1_coef <- data.frame(c(I, B), coefficients(mod1))
colnames(mod1_coef) <- c("TRUE", "MODEL")
rownames(mod1_coef) <- c("Intercept", "B")
kable(mod1_coef)

	TRUE	MODEL
Intercept	10	10.008225
B	5	4.987741

As we can see, the true and model values for B and I are pretty close. You may try increasing the sample size during simulation to see whether the parameter estimates returned by the model improve.

What about the variance in Y explained by X and random error? Let’s access the R-squared portion of the model summary, which contains an estimate of the variance in Y explained by X (previously discussed as squared Pearson’s r). The random error variance can be calculated as the complement.

#obtain the R-squared value of the model
y_x_variance <- summary(mod1)$r.squared

#obtain the random error variance
error_variance <- 1 - summary(mod1)$r.squared

#print the two values together
cat(paste("R-squared (variance in y explained by x) = ", y_x_variance, "\n", "Error variance = ", error_variance, sep=""))

## R-squared (variance in y explained by x) = 0.668743812519226
## Error variance = 0.331256187480774

The values estimated are close to the expected values of 2/3 for the variance in Y explained by X and 1/3 for the random error variance.

Another interesting set of values returned by the linear model are the so-called “fitted values.” These refer to the expected value of Y given a value of x based on the parameter values estimated in the model. Sound familar? We saw previously that the linear model did a good job of recovering the true parameter values that we specified during simulation, so these fitted values should be very close to the expected Y values we simulated. If the fitted values were identical to the simulated expected Y values, they should fall perfectly along a 1:1 line when plotted against each other. Let’s do that now.

#create a data frame of x and y_e values
df_x_ye_y_fitted <- data.frame(x=x, y_e=y_e, y=y, mod1_fitted=mod1$fitted.values)

#plot fitted values against y_e
ggplot()+
  geom_point(data=df_x_ye_y_fitted, mapping=aes(x=y_e, y=mod1_fitted), shape=21, colour="black", fill="slateblue1", size=2)+
  geom_abline(slope=1, intercept=0, colour="orangered")+
  labs(x="Expected Y", y="Model fitted Y")+
  theme_bw()

The red line depicts a 1:1 relationship. The points likely depart somewhat from this line because the Beta coefficient and intercept depart slightly from the true simulated values. However, in general, our model did a good job recovering the simulated expected Y values.

This concludes the tutorial for simulating data on the basis of a simple linear relationship between a response variable Y and explanatory variable X. Together, we 1) simulated data that fit the assumptions of simple linear regression and 2) confirmed that a linear model returns fairly accurate estimates of the true simulated parameter values.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
simulate_simple_linear_regression_data_files/figure-gfm		simulate_simple_linear_regression_data_files/figure-gfm
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simulating data for simple linear regression

1. Simulating an X variable

2. Simulating expected (mean) Y values

3. Adding random error (noise) to our expected values of y

4. Using a simple linear model to compare true and estimated parameter values

About

Uh oh!

Releases

Packages

marcabeer/Simulate_regression_data

Folders and files

Latest commit

History

Repository files navigation

Simulating data for simple linear regression

1. Simulating an X variable

2. Simulating expected (mean) Y values

3. Adding random error (noise) to our expected values of y

4. Using a simple linear model to compare true and estimated parameter values

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages