Effect size

Educational level: this is a tertiary (university) resource.

Completion status: this resource is ~75% complete.

In statistical inference, an effect size is a measure of the strength of the relationship between two variables. Effect sizes are a useful descriptive statistic. Effect sizes provide a standard metric for comparing across studies and thus are critical to meta-analysis.

When reporting statistical significance for an inferential test, effect size(s) should also be reported. This is emphasized by the American Psychological Association (APA) Task Force on Statistical Inference (Wilkinson & APA Task Force on Statistical Inference, 1999), "reporting and interpreting effect sizes in the context of previously reported effects is essential to good research" (p. 599, emphasis added).^[1]^[2]

This page provides an undergraduate-level introduction to effect sizes and their usage.

Why use effect sizes?

An inferential test may be statistically significant (i.e., unlikely to have occurred by chance), but this doesn’t necessarily indicate how large the effect is.

There may be non-significant, notable effects especially in low powered tests.

Effect sizes are influenced by sample size, and by the number of groups/conditions (e.g. see Murray and Dosser, 1987). Therefore, caution must be taken when interpreting effect sizes, and when comparing them between different studies! Effect sizes can give additional information about the effects of independent variables within the same study, but should not be applied to compare effects between studies. Unless, sample size and study design are carefully taken into account.

Types

Some commonly used effect size are:

Correlational

All correlations are effect sizes
r or r² (bivariate linear correlation or correlation squared)
R or R² (multiple correlation or coefficient of determination)

Mean difference

Cohen's d, Hedges' g, or other forms of standard deviation unit effect size which provide the difference between two means in standard deviation units:
A standardised measure of the difference between two Means:
- Cohen's d = (M₂ – M₁) / σ
- Cohen's d = (M₂ – M₁) / SD_pooled
Not readily available in SPSS, so use a separate calculator e.g., Cohensd.xls
Interpreting:
- -ve = negative difference/effect
- 0 = no difference/effect
- +ve = positive difference/effect
$\eta ^{2}$ and $\eta$ $_{p}^{2}$ (total variance explained and partial variance explained in ANOVA; each IV will have a $\eta$ $_{p}^{2}$ ; total $\eta ^{2}$ is equivalent to R²). See also: Eta-squared

Interpreting standardised mean differences

There are no agreed standards for how to interpret an ES. Interpretation is ultimately subjective.

Cohen (1977):
- .2 = small
- .5 = moderate
- .8 = large
Wolf (1986):
- .25 = educationally significant
- .50 = practically significant (therapeutic)
Standardised Mean ESs are proportional, e.g., .40 is twice as much change as .20
The meaning of effect size depends on context:
- A small ES can be impressive if, e.g., a variable is difficult to change (e.g. a personality construct) and/or very valuable (e.g. an increase in life expectancy).
- A large ES doesn’t necessarily mean that there is any practical value e.g. if it isn’t related to the aims of the investigation (e.g. religious orientation).

Converting effect sizes

Equating d and r

For equal and large sample sizes the following formulae are appropriate:

Eq. 1

r={\sqrt {\frac {d^{2}}{d^{2}+4}}}

and conversely,

Eq. 2

d={\frac {2r}{\sqrt {1-r^{2}}}}

as discussed by Cohen (1965, 1988), Friedman (1968), Glass, McGraw, and Smith(1981), Rosenthal (1984), and Wolf (1986).

The correct formula for unequal or small sample sizes is given by (see ^[3]):

Eq. 3

r^{2}={\frac {d^{2}}{d^{2}+(n_{1}+n_{2}-2)({\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}))}}

In the case that $n_{1}\approx n_{2}$ and both are large enough that ${\frac {n_{i}+1}{n_{i}}}\approx 1$ , this reduces to Eq. 1

Expressing d as a %

Look up a standard normal table (or use an auto-calculator such as [2]) for the area under the curve for a given d value. e.g., a d of .5 corresponds to 0.6915 which indicates that 69% of people were better off after treatment. [3]
1. "Effect size can also be thought of as the average percentile standing of the average treatment (or experimental) participant relative to the average untreated (or control) participant. A d of 0.0 indicates that the mean of the treatment group is at the 50th percentile of the control group. A d of 0.8 indicates that the mean of the treatment group is at the 79th percentile of the control group. An effect size of 1.7 indicates that the mean of the treatment group is at the 95.5 percentile of the untreated group."[4]
2. "Effect sizes can also be interpreted in terms of the percent of nonoverlap of the treatment group's scores with those of the untreated group. A d of 0.0 indicates that the distribution of scores for the treatment group overlaps completely with the distribution of scores for the control group, there is 0% of nonoverlap. A d of 0.8 indicates a nonoverlap of 47.4% in the two distributions. A d of 1.7 indicates a nonoverlap of 75.4% in the two distributions."[5]

Exercise

Q: 20 athletes rate their personal playing ability, M = 3.4 (SD = .6) (on a scale of 1 to 5). After an intensive training program, the players rate their personal playing ability again, M = 3.8 (SD = .6) What is the ES? How good was the intervention? (For simplicity, this example uses the same SD for both occasions.)

A: Standardised mean effect size = (M₂ - M₁) / SD_pooled = (3.8 - 3.4) / .6 = .4 / .6 = .67 = a moderate-large change over time

FAQ

(Partial) Eta-squared vs. Cohen's d

In ANOVA, is reporting partial eta-squared sufficient, or should Cohen's d also be reported?

Partial eta-squared and Cohen's d provide two different types of effect size and both may be appropriate and useful in reporting the results of ANOVA.

Partial eta-squared indicates the % of the variance in the Dependent Variable (DV) attributable to a particular Independent Variable (IV). If the model has more than one IV, then report the partial eta-squared for each.

Cohen's d indicates the size of the difference between two means in standard deviation units.

To be thorough, one should report partial eta-squared for each IV and Cohen's ds for each pairwise comparison of interest (e.g., if posthoc tests or planned comparisons were conducted, then Cohen's d should be provided for each contrast).

If F is non-significant for an IV, then reporting the partial eta-squared is sufficient and appropriate. If F is significant and you go on to do planned contrasts or posthoc tests, then each of these pairwise comparisons should also be accompanied by Cohen's d effect sizes.

Note that if the IV of interest only has two levels, then the partial eta-squared and the Cohen's d will communicate similar information but do so using different scales of measurement (% and SD units respectively).

Power and effect sizes in psychology

Ward (2002)^[4] examined articles in 3 psychology journals to assess the current status of statistical power and effect size measures:

Journal of Personality and Social Psychology
Journal of Consulting and Clinical Psychology
Journal of Abnormal Psychology

Ward (2002) found that:

7% of studies estimate or discuss statistical power.
30% calculate ES measures.
A medium ES was discovered as the average ES across studies
Current research designs typically do not have sufficient power to detect such an ES.

Summary

Effect sizes indicate the amount of difference or strength of relationship. They have historically been underutilized. Inferential tests should be accompanied by effect sizes and confidence intervals.

Commonly used effect sizes include Cohen’s d, r.

Whilst rules of thumb for interpreting effect sizes are available, it is best to compare effect sizes with similar or related studies.

Standardized mean effect sizes are not available in SPSS – hand calculate or use a spreadsheet calculator.

References

↑ [1]
↑ Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
↑ Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and d-based effect-size indices: Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED433353)
↑ Ward, R. M. (2002). Highly significant findings in psychology: A power and effect size survey.

Coe, R. (2002). It’s the effect size, stupid: What effect size is and why it is important. Paper presented at the Annual Conference of British Education Research Association, University of Essex England.

Thalheimer, W., & Cook, S. (2002, August). How to calculate effect sizes from published research articles: A simplified methodology.

External links

EffectSizeFAQ.com Jargon-free guidelines for interpreting the substantive significance of research results.
Neill, J. T. (2008). Why use effect sizes instead of significance testing in program evaluation? wilderdom.com.

[1] [1]

[2] Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

[3] Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and d-based effect-size indices: Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED433353)

[4] Ward, R. M. (2002). Highly significant findings in psychology: A power and effect size survey.

[1]

[2]

[3]

[4]