Reliability

This page focuses on psychometric reliability in the context of Evidence-Based Assessment. There are other more general and comprehensive discussions of reliability on Wikipedia and elsewhere.

Reliability describes the reproducibility of the score. It is expressed as a number that ranges from 0 (no reliable variance) to 1.0 (perfect reliability). Conceptually, the number can be thought of as the correlation of the score with itself. In classical test theory, it is the correlation of the observed score with the "true" score.

There are different facets of reliability. Internal consistency looks at whether different parts of the scale measure the same thing. One way of looking at internal consistency would be to split the items into two halves randomly, and then look at the correlation between the two scores (split-half reliability). Cronbach's alpha is the most widely used form of internal consistency reliability. Conceptually, alpha is the average of all possible split-half versions. Internal consistency is the most widely reported form of reliability because it is the most convenient and least expensive to estimate... not because it is always the most appropriate choice.

Evaluating norms and reliability

Rubric for evaluating norms and reliability for assessments (extending Hunsley & Mash, 2008, 2018; *indicates new construct or category)
Criterion	Adequate	Good	Excellent	Too Good
Norms	Mean and standard deviation for total score (and subscores if relevant) from a large, relevant clinical sample	Mean and standard deviation for total score (and subscores if relevant) from multiple large, relevant samples, at least one clinical and one nonclinical	Same as “good,” but must be from representative sample (i.e., random sampling, or matching to census data)	Not a concern
Internal consistency (Cronbach's alpha, split half, etc.)	Most evidence shows Cronbach's alpha values of .70 to .79	Most reported alphas .80 to .89	Most reported alphas >= .90	Alpha is also tied to scale length and content coverage - very high alphas may indicate that scale is longer than needed, or that it has a very narrow scope
w:Inter-rater reliability	Most evidence shows kappas of .60-.74, or intraclass correlations of .70-.79	Most reported kappas of .75-.84, ICCs of .80-.89	Most kappas ≥ .85, or ICCs ≥ .90	Very high levels of agreement often achieved by re-rating from audio or transcript
w:Test-retest reliability (stability)	Most evidence shows test-retest correlations ≥ .70 over period of several days or weeks	Most evidence shows test-retest correlations ≥ .70 over period of several months	Most evidence shows test-retest correlations ≥ .70 over a year or longer	Key consideration is appropriate time interval; many constructs would not be stable for years at a time
*Repeatability	Bland-Altman plots (Bland & Altman, 1986) plots show small bias, and/or weak trends; coefficient of repeatability is tolerable compared to clinical benchmarks (Vaz, Falkmer, Passmore, Parsons, & Andreou, 2013)	Bland-Altman plots and corresponding regressions show no significant bias, and no significant trends; coefficient of repeatability is tolerable	Bland-Altman plots and corresponding regressions show no significant bias, and no significant trends across multiple studies; coefficient of repeatability is small enough that it is not clinically concerning	Not a concern