Before reading this section, make sure you understand:
In psychological measurement, reliability refers to the reproducibility or consistency of scores from a measure in a particular sample.
We commonly describe cars, for example, in terms of their reliability. A reliable car is one that starts every time ignition is attempted. An unreliable car doesn't start all the time.
There are several types of reliability that are commonly examined in psychological measurement:
- Internal consistency (Internal reliability): Correlation amongst the items used in a multi-item measurement scale, providing a sense of whether the items are measuring the "same thing"
- Test-retest reliability: Correlation between scores on the same measure by the same participants at different times
- Inter-rater reliability: Correlation between scores by different raters (observers); for observational measures
It is possible to look at other factors that influence the reproducibility of scores. The most fully developed methods for studying these include generalizability theory and item response theory approaches.
Current thinking, as well as the standards (AERA, 2014), emphasizes that reliability is not an intrinsic property of a measure. Reliability refers to the performance of scores observed in a particular sample or setting. When the measure is used with someone similar to the sample used to evaluate the reliability, then the score is likely to perform similarly. When the person is different from the samples used to develop and benchmark the measure, then the scores may not generalize. A spelling test designed for eighth graders may do a good job ranking new eighth graders in terms of spelling performance, but the same test might not show good reliability for 4th graders or college students.