In statistical inference, an effect size is a measure of the strength of the relationship between two variables. Effect sizes are a useful descriptive statistic. Effect sizes provide a standard metric for comparing across studies and thus are critical to meta-analysis.
When reporting statistical significance for an inferential test, effect size(s) should also be reported. This is emphasised by the American Psychological Association (APA) Task Force on Statistical Inference (Wilkinson & APA Task Force on Statistical Inference, 1999), "reporting and interpreting effect sizes in the context of previously reported effects is essential to good research" (p. 599, emphasis added).
This page provides an undergraduate-level introduction to effect sizes and their usage.
Why use effect sizes?
An inferential test may be statistically significant (i.e., unlikely to have occurred by chance), but this doesn’t necessarily indicate how large the effect is.
There may be non-significant, notable effects especially in low powered tests.
Effect sizes are influenced by sample size, and by the number of groups/conditions (e.g. see Murray and Dosser, 1987). Therefore, caution must be taken when interpreting effect sizes, and when comparing them between different studies! Effect sizes can give additional information about the effects of independent variables within the same study, but should not be applied to compare effects between studies. Unless, sample size, and study design are carefully taken into account.
Some commonly used effect size are:
- All correlations are effect sizes
- r or r2 (bivariate linear correlation or correlation squared)
- R or R2 (multiple correlation or coefficient of determination)
- Cohen's d, Hedges' g, or other forms of standard deviation unit effect size which provide the difference between two means in standard deviation units:
- A standardised measure of the difference between two Means:
- Cohen's d = M2 – M1 / σ
- Cohen's d = M2 – M1 / SDpooled
- Not readily available in SPSS, so use a separate calculator e.g., Cohensd.xls
- -ve = negative difference/effect
- 0 = no difference/effect
- +ve = positive difference/effect
- and (total variance explained and partial variance explained in ANOVA; each IV will have a ; total is equivalent to R2). See also: Eta-squared
Interpreting standardised mean differences
There are no agreed standards for how to interpret an ES. Interpretation is ultimately subjective. Best approach is to compare with other studies
- Cohen (1977):
- .2 = small
- .5 = moderate
- .8 = large
- Wolf (1986):
- .25 = educationally significant
- .50 = practically significant (therapeutic)
- Standardised Mean ESs are proportional, e.g., .40 is twice as much change as .20
- The meaning of an effect size depends on context:
- A small ES can be impressive if, e.g., a variable is difficult to change (e.g. a personality construct) and/or very valuable (e.g. an increase in life expectancy).
- A large ES doesn’t necessarily mean that there is any practical value e.g., if it isn’t related to the aims of the investigation (e.g. religious orientation).
Converting effect sizes
Equating d and r
as discussed by Cohen (1965, 1988), Friedman (1968), Glass, McGraw, and Smith(1981), Rosenthal (1984), and Wolf (1986).
For a correction on this formula, see 
Expressing d as a %
- Look up a standard normal table (or use an auto-calculator such as ) for the area under the curve for a given d value. e.g., an d of .5 corresponds to 0.6915 which indicates that 69% of people were better off after treatment. 
- "Effect size can also be thought of as the average percentile standing of the average treatment (or experimental) participant relative to the average untreated (or control) participant. A d of 0.0 indicates that the mean of the treatment group is at the 50th percentile of the control group. A d of 0.8 indicates that the mean of the treatment group is at the 79th percentile of the control group. An effect size of 1.7 indicates that the mean of the treatment group is at the 95.5 percentile of the untreated group."
- "Effect sizes can also be interpreted in terms of the percent of nonoverlap of the treatment group's scores with those of the untreated group. A d of 0.0 indicates that the distribution of scores for the treatment group overlaps completely with the distribution of scores for the control group, there is 0% of nonoverlap. A d of 0.8 indicates a nonoverlap of 47.4% in the two distributions. A d of 1.7 indicates a nonoverlap of 75.4% in the two distributions."
Q: 20 athletes rate their personal playing ability, M = 3.4 (SD = .6) (on a scale of 1 to 5). After an intensive training program, the players rate their personal playing ability again, M = 3.8 (SD = .6) What is the ES? How good was the intervention? (For simplicity, this example uses the same SD for both occasions.)
A: Standardised mean effect size = (M2 - M1) / SDpooled = (3.8 - 3.4) / .6 = .4 / .6 = .67 = a moderate-large change over time
(Partial) Eta-squared vs. Cohen's d
In ANOVA, is reporting partial eta-squared sufficient, or should Cohen's d also be reported?
Partial eta-squared indicates the % of variance in the DV attributable to a particular IV. If the model has more than one IV, then report the partial eta-squared for each.
Cohen's d indicates the size of difference between two means in standard deviation units.
To be thorough, one should report partial eta-squared for each IV and Cohen's ds for each pairwise comparison of interest (e.g., if post-hoc tests or planned comparisons were conducted, then Cohen's d should be provided for each contrast).
If F is non-significant for an IV, then reporting the partial eta-squared is sufficient and appropriate. If F is significant and you go on to do planned contrasts or post-hoc tests, then each of these pairwise comparisons should also be accompanied by Cohen's d effect sizes.
Note that if the IV of interest only has two levels, then the partial eta-squared and the Cohen's d will communicate similar information but do so using different scales of measurement (% and SD units respectively).
Power and effect sizes in psychology
Ward (2002) examined articles in 3 psychology journals to assess the current status of statistical power and effect size measures:
- Journal of Personality and Social Psychology
- Journal of Consulting and Clinical Psychology
- Journal of Abnormal Psychology
Ward (2002) found that:
- 7% of studies estimate or discuss statistical power.
- 30% calculate ES measures.
- A medium ES was discovered as the average ES across studies
- Current research designs typically do not have sufficient power to detect such an ES.
Commonly used effect sizes include Cohen’s d, r.
Whilst rules of thumb for interpretating effect sizes are available, it is best to compare effect sizes with similar or related studies.
Standardised mean effect sizes are not available in SPSS – hand calculate or use a spreadsheet calculator.
- Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
- Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and d-based effect-size indices: Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED433353)
- Ward, R. M. (2002). Highly significant findings in psychology: A power and effect size survey.
Coe, R. (2002). It’s the effect size, stupid: What effect size is and why it is important. Paper presented at the Annual Conference of British Education Research Asoociation, Univeristy of Essex England.
Thalheimer, W., & Cook, S. (2002, August). How to calculate effect sizes from published research articles: A simplified methodology.
- Effect size data analysis tutorial
- Effect size (Wikipedia)
- Confidence interval
- Statistical power
- Statistical significance
- EffectSizeFAQ.com Jargon-free guidelines for interpreting the substantive significance of research results.
- Neill, J. T. (2008). Why use effect sizes instead of significance testing in program evaluation? wilderdom.com.