Multiple linear regression/Assumptions

From Wikiversity
Jump to: navigation, search

Assumptions

Gnome-settings-background.svg View the accompanying screencast: [1]
  1. Level of measurement
    1. IVs: MLR involves two or more continuous (interval or ratio) or dichotomous variables (may require recoding into dummy variables)
    2. DV: One continuous (interval or ratio) variable
  2. Sample size (some rules of thumb):
    1. Total N based on ratio of cases to IVs:
      1. Min. 5 cases per predictor (5:1) (basically, you need enough data to provide reliable correlation estimates)
      2. Ideally 20 cases per predictor (20:1), with an overall N of at least 100; this allows sufficient power to detect a medium ES of R2 of .13 (Francis, p. 128))
    2. Total N based on a constant plus ratio of cases to IVs:
      1. Tabachnick and Fidell (2007) suggest that N should ideally be 50 + 8(k) for testing a full regression model or 104 + k when testing individual predictors (where k is the number of IVs).
  3. Normality
    1. Check the univariate descriptive statistics (M, SD, skewness and kurtosis)
    2. Check the histograms with a normal curve imposed
    3. Estimates of correlations will be more reliable and stable when the variables are normally distributed
  4. Linearity
    1. Are the bivariate relationships linear?
    2. Check scatterplots and correlations between the DV (Y) and each of the IVs (Xs)
    3. Check for influence of bivariate outliers
  5. Homoscedasticity
    1. Are the bivariate distributions reasonably evenly spread about the line of best fit?
    2. Check scatterplots between Y and each of Xs and/or check scatterplot of the residuals (ZRESID) and predicted values (ZPRED))
  6. Multicollinearity
    1. Is there multicollinearity between the IVs? Predictors should not be overly correlated with one another. Ways to check:
      1. Examine bivariate correlations and scatterplots between each of the IVs (i.e., are the predictors overly correlated e.g., above .7?).
      2. Check the collinearity statistics in the coefficients table:
        1. The Variance Inflation Factor (VIF) should be low (< ~3-10) and/or
        2. Tolerance should be high (> .1 to .3) (Note that TOL=1/VIF so only one needs to be used).
  7. Multivariate outliers
    1. Check whether there are influential multivariate outlying cases using Mahalanobis' Distance (MD) & Cook’s D (CD).
    2. Using SPSS: Linear Regression - Save - Mahalanobis and Cook's D - OK
      1. New variables called mah_1 and coo_1 will be added to the data file.
      2. Check the Residuals Statistics table in the output for the maximum MD and CD.
    3. The maximum MD should not exceed the critical chi-square value with degrees of freedom equal to number of predictors, with critical alpha =.001. CD should not be greater than 1.
    4. If outliers are detected, check each case, and consider removing the case from the analysis.
    5. See also Francis 5.1.4.2 Screening for influential case
  8. Normality of residuals
    1. Residuals are more likely to be normally distributed if each of the variables normally distributed
    2. Check histograms of all variables in an analysis
    3. Normally distributed variables will enhance the MLR solution
See also
  1. Allen & Bennett 13.3.2.1 Assumptions (pp. 178-179)
  2. Francis 5.1.4 Practical Issues and Assumptions (pp. 126-128)