Multiple linear regression

From Wikiversity
Jump to: navigation, search
Progress-0750.svg Completion status: this resource is ~75% complete.
Sciences humaines.svg Educational level: this is a tertiary (university) resource.
Ryanscontribs.svg Resource type: this resource consists of notes.

This learning resource summarises the main teaching points about multiple linear regression (MLR), including key concepts, principles, assumptions, and how to conduct and interpret MLR analyses.

Prerequisites:

  1. Correlation
  2. Linear regression.

For additional resources, go to see also.

What is MLR?[edit]

Assumptions[edit]

Gnome-settings-background.svg View the accompanying screencast: [1]
  1. Level of measurement
    1. IVs: MLR involves two or more continuous (interval or ratio) or dichotomous variables (may require recoding into dummy variables)
    2. DV: One continuous (interval or ratio) variable
  2. Sample size (some rules of thumb):
    1. Total N based on ratio of cases to IVs:
      1. Min. 5 cases per predictor (5:1) (basically, you need enough data to provide reliable correlation estimates)
      2. Ideally 20 cases per predictor (20:1), with an overall N of at least 100; this allows sufficient power to detect a medium ES of R2 of .13 (Francis, p. 128))
    2. Total N based on a constant plus ratio of cases to IVs:
      1. Tabachnick and Fidell (2007) suggest that N should ideally be 50 + 8(k) for testing a full regression model or 104 + k when testing individual predictors (where k is the number of IVs).
  3. Normality
    1. Check the univariate descriptive statistics (M, SD, skewness and kurtosis)
    2. Check the histograms with a normal curve imposed
    3. Estimates of correlations will be more reliable and stable when the variables are normally distributed
  4. Linearity
    1. Are the bivariate relationships linear?
    2. Check scatterplots and correlations between the DV (Y) and each of the IVs (Xs)
    3. Check for influence of bivariate outliers
  5. Homoscedasticity
    1. Are the bivariate distributions reasonably evenly spread about the line of best fit?
    2. Check scatterplots between Y and each of Xs and/or check scatterplot of the residuals (ZRESID) and predicted values (ZPRED))
  6. Multicollinearity
    1. Is there multicollinearity between the IVs? Predictors should not be overly correlated with one another. Ways to check:
      1. Examine bivariate correlations and scatterplots between each of the IVs (i.e., are the predictors overly correlated e.g., above .7?).
      2. Check the collinearity statistics in the coefficients table:
        1. The Variance Inflation Factor (VIF) should be low (< ~3-10) and/or
        2. Tolerance should be high (> .1 to .3) (Note that TOL=1/VIF so only one needs to be used).
  7. Multivariate outliers
    1. Check whether there are influential multivariate outlying cases using Mahalanobis' Distance (MD) & Cook’s D (CD).
    2. Using SPSS: Linear Regression - Save - Mahalanobis and Cook's D - OK
      1. New variables called mah_1 and coo_1 will be added to the data file.
      2. Check the Residuals Statistics table in the output for the maximum MD and CD.
    3. The maximum MD should not exceed the critical chi-square value with degrees of freedom (df) equal to number of predictors, with critical alpha =.001. CD should not be greater than 1.
    4. If outliers are detected, check each case, and consider removing the case from the analysis.
    5. See also Francis 5.1.4.2 Screening for influential case
  8. Normality of residuals
    1. Residuals are more likely to be normally distributed if each of the variables normally distributed
    2. Check histograms of all variables in an analysis
    3. Normally distributed variables will enhance the MLR solution
See also
  1. Allen & Bennett 13.3.2.1 Assumptions (pp. 178-179)
  2. Francis 5.1.4 Practical Issues and Assumptions (pp. 126-128)

Types[edit]

There are several types of MLR, including:

Type Characteristics
Direct (or Standard)
  • All IVs are entered simultaneously
Hierarchical
  • IVs are entered in steps, i.e., some before others
  • Interpret: R2 change, F change
Forward
  • The software enters IVs one by one until there are no more significant IVs to be entered
Backward
  • The software removes IVs one to one until there are no more non-significant IVs to removed
Stepwise
  • A combination of Forward and Backward MLR

Results[edit]

  • MLR analyses produce several diagnostic and outcome statistics which are summarised below and are important to understand.
  • Make sure that you can learn how to find and interpret these statistics from statistical software output.

Correlations[edit]

Examine the linear correlations between (usually as a correlation matrix, but also view the scatterplots):

  • IVs
  • each IV and the DV

Effect sizes[edit]

R[edit]

  1. (Big) R is the multiple correlation coefficient for the relationship between the predictor and outcome variables.
  2. Interpretation is similar to that for little r (the linear correlation between two variables), however R can only range from 0 to 1, with 0 indicating no relationship and 1 a perfect relationship. Large values of R indicate more variance explained in the DV.
  3. R can be squared and interpreted as for r2, with a rough rule of thumb being .1 (small), .3 (medium), and .5 (large). These R2 values would indicate 10%, 30%, and 50% of the variance in the DV explained respectively.
  4. When generalising findings to the population, the R2 for a sample tends to overestimate the R2 of the population. Thus, adjusted R2 is recommended when generalising from a sample, and this value will be adjusted downward based on the sample size; the smaller the sample size, the greater the reduction.
  5. The statistical significance of R can be examined using an F test and its corresponding p level.
  6. Reporting example: R2 = .32, F(6, 217) = 19.50, p = .001
    1. "6, 217" refers to the degrees of freedom - for more information, see about half-down this page

Cohen's ƒ2[edit]

Coefficients[edit]

An MLR analysis produces several useful statistics about each of the predictors. These regression coefficients are usually presented in a table which may include:

  • B (unstandardised) - used for building a prediction equation
  • β (standardised) - indicate the relative strength of the predictors on a scale ranging from -1 to 1.
  • Partial correlations - indicate the unique correlations between each IV and the DV
  • Semi-part correlations - similar to partial correlations; squaring this value provides the percentage of variance in the DV uniquely explained by each IV
  • t, p - indicate the statistical significance of each IV
  • Confidence intervals - indicate the probably range of population values for the βs

Equation[edit]

  • A prediction equation can be derived from the regression coefficients in a MLR analysis.
  • The equation is of the form

\hat{Y} = bx + a (for predicted values) or
Y = bx + a + e (for observed values)

Residuals[edit]

A residual is the difference between the actual value of a DV and its predicted value. Each case will have a residual for each MLR analysis. Three key assumptions can be tested using plots of residuals:

  1. Linearity: IVs are linearly related to DV
  2. Normality of residuals
  3. Equal variances (Homoscedasticity)

Power[edit]

Advanced concepts[edit]

Writing up[edit]

When writing up the results of an MLR, consider describing:

  • Assumptions: How were they tested? To what extent were the assumptions met?
  • Correlations: What are they? Consider correlations between the IVs and the DV separately to the correlations between the IVs.
  • Regression coefficients: Report a table and interpret
  • Causality: Be aware of the limitations of the analysis - it may be consistent with a causal relationship, but it is unlikely to prove causality

FAQ[edit]

What if there are univariate outliers?[edit]

Basically, explore and consider what the implications might be - do these "outliers" impact on the assumptions? A lot depends on how "outliers" are defined. It is probably better to consider distributions in terms of the shape of the histogram and skewness and kurtosi, and whether these values are unduely impacting on the estimates of linear relations between variables. In other words, what are the implications? Ultimately, the researcher needs to decide whether the outliers are so severe that they are unduely influencing results of analyses or whether they are relatively benign. If unsure, explore, test, try the analyses with and without these values etc. If still unsure, be conservative and remove the data points or recode the data.

See also[edit]

Wikipedia-logo.png Search for Multiple linear regression on Wikipedia.

External links[edit]