Multiple linear regression/Assumptions

From Wikiversity
Jump to: navigation, search
Multiple linear regression - Assumptions
Gnome-settings-background.svg View the accompanying screencast: [1]

Level of measurement[edit]

  1. DV: A normally distributed interval or ratio variable
  2. IVs: Two or more normally distributed interval or ratio variables or dichotomous variables. Note that it may be necessary to recode non-normal interval or ratio IVs or multichotomous categorical or ordinal IVs into dichotomous variables or a series of dummy variables).

Sample size[edit]

Enough data is needed to provide reliable estimates of the correlations. As the number of IVs increases, more inferential tests are being conducted, therefore more data is needed, otherwise the estimates of the regression line are probably unstable and are unlikely to replicate if the study is repeated.

Some rules of thumb:

  1. Use at least 50 cases plus at least 10 to 20 as many cases as there are IVs.
  2. Green (2001) and Tabachnick and Fidell (2007):
    1. 50 + 8(k) for testing an overall regression model and
    2. 104 + k when testing individual predictors (where k is the number of IVs)
    3. Based on detecting a medium effect size (β >= .20), with critical α <= .05, with power of 80%.

To be more accurate, study-specific power and sample size calculations should be conducted (e.g., use A-priori sample Size calculator for multiple regression; note that this calculator uses f2 for the anticipated effect size - see the Formulas link for how to convert R2 to to f2).

Normality[edit]

  1. Check the univariate descriptive statistics (M, SD, skewness and kurtosis). As a general guide, skewness and kurtosis should be between -1 and +1.
  2. Check the histograms with a normal curve imposed.
  3. Note: Be wary (i.e., avoid!) using inferential tests of normality (e.g., the Shapiro–Wilk test - they are notoriously overly sensitive for the purposes/needs of regression).
  4. Normally distributed variables will enhance the MLR solution. Estimates of correlations will be more reliable and stable when the variables are normally distributed, but regression will be reasonably robust to minor to moderate deviations from non-normal data when moderate to large sample sizes are involved. Also examine scatterplots for bivariate outliers because non-normal univariate data may make bivariate and multivariate outliers more likely.
  5. Further information:

Linearity[edit]

Check scatterplots between the DV (Y) and each of the IVs (Xs) to determine linearity:

  1. Are there any bivariate outliers? If so, consider removing the outliers.
  2. Are there any non-linear relationships? If so, consider using a more appropriate type of regression.

Homoscedasticity[edit]

Based on the scatterplots between the IVs and the DV:

  1. Are the bivariate distributions reasonably evenly spread about the line of best fit?
  2. Also can be checked via the normality of the residuals.

Multicollinearity[edit]

IVs should not be overly correlated with one another. Ways to check:

  1. Steps are shown in this screencast: [2]
  2. Examine bivariate correlations and scatterplots between each of the IVs (i.e., are the predictors overly correlated - above ~.7?).
  3. Check the collinearity statistics in the coefficients table:
    1. Various recommendations for acceptable levels of VIF and Tolerance have been published.
    2. Variance Inflation Factor (VIF) should be low (< 3 to 10) or
    3. Tolerance should be high (> .1 to .3)
    4. Note that VIF and Tolerance have a reciprocal relationship (i.e., TOL=1/VIF), so only one of the indicators needs to be used.
  4. For more information, see [3]

Multivariate outliers[edit]

Check whether there are influential MVOs using Mahalanobis' Distance (MD) and/or Cook’s D (CD):

  1. Steps are shown in these screencasts: [4][5][6]
  2. SPSS: Linear Regression - Save - Mahalanobis (can also include Cook's D)
    1. After execution, new variables called mah_1 (and coo_1) will be added to the data file.
    2. In the output, check the Residuals Statistics table for the maximum MD and CD.
    3. The maximum MD should not exceed the critical chi-square value with degrees of freedom (df) equal to number of predictors, with critical alpha =.001. CD should not be greater than 1.
  3. If outliers are detected:
    1. Go to the data file, sort the data in descending order by mah_1, identify the cases with mah_1 distances above the critical value, and consider why these cases have been flagged (these cases will each have an unusual combination of responses for the variables in the analysis, so check their responses).
    2. Remove these cases and re-run the MLR.
      1. If the results are very similar (e.g., similar R2 and coefficients for each of the predictors), then it is best to use the original results (i.e., including the multivariate outliers).
      2. If the results are different when the MVOs are not included, then these cases probably have had undue influence and it is best to report the results without these cases.

Normality of residuals[edit]

The residuals should be normally distributed around 0.

  1. Residuals are more likely to be normally distributed if each of the variables normally distributed, so check normality first.
  2. There are three ways of visualising residuals. In SPSS - Analyze - Regression - Linear - Plots:
    1. Scatterplot: ZPRED on the X-axis and ZRESID on the Y-axis
    2. Histogram: Check on
    3. Normal probability plot: Check on
  3. Scatterplot should have no pattern (i.e. be a "blob").
  4. Histogram should be normally distributed
  5. Normal probability plot should fall along the diagonal line
  6. If residuals are not normally distributed, there is probably something wrong with the distribution of one or more variables - re-check