Outliers and restricted range
1. Data file: aggr.sav
2. This dataset was collected by Bernd Heubeck (Division of Psychology, ANU) comparing a sample of 89 "normal" children, aged 8 to 14 years, from Western Sydney with a sample of 89 "clinic" children from the same area who had been referred to a children's psychiatric clinic. The variable clinic is coded 1 for normal children and 2 for clinic children. Separate aggressiveness ratings of the child were obtained independently fromfathers (faggr) and mothers (maggr). The aggressiveness rating scale can range from 0 (low) to 40(high).
3. To what extent do mothers' and fathers' Aggressive Behaviour ratings (maggr and faggr respectively) agree with one another for the normal sample? (i.e., what is the product-moment correlation and what % of variance does one variable explain in the other variable?)
• Select only the normal cases (Data - Select Cases\- If Clin = 1 - Filtered)
• Scatterplot maggr by faggr
• Chart editor - add bins for datapoints
• Correlation - bivariate
• r = .56, r2 = .31 (i.e., 31%), p < .001, N = 89
4. What happens if you remove the outliers (children in the normal sample with maggr and faggr ratings over 20)? Why?
• To identify the cases which are outliers you can sort the data in descending order by maggr and faggr. Then manually the two cases (case 4 and 15 after sorting) from the normal sample with maggr and faggr ratings over 20 and re-run the scatterplot and correlations.
• r = .38, r2 = .14, p < .001_, N = 87
• r drops from .56 because the outliers were "in-line" with the line of best fit, and therefore inflating the correlation; outliers can also deflate the correlation to the extent that they lie perpendicular to the line of best fit - if you don't understand this, go back to exploring the effect of an outlier in regressp.exe)
5. Now include the rest of the sample. Examine a new scatterplot, and compute the r and r2. What has happened? Why?
• To regain the outliers, close the datafile without saving, then open the datafile.
• r = .68, r2 = .46, N = 178
• r is greater than .38 and almost three times as much variance is now explained. The reason is RANGE RESTRICTION in the normal sample, since there tend to be only low ratings. By including the clinic sample, we now have high aggressiveness data and no range restrictedness.)