Orientation Observation In-depth interviews Document analysis and semiology Conversation and discourse analysis Secondary Data Surveys Experiments Ethics Research outcomes



Social Research Glossary

About Researching the Real World



© Lee Harvey 2012–2019

Page updated 17 June, 2019

Citation reference: Harvey, L., 2012–2019, Researching the Real World, available at
All rights belong to author.


A Guide to Methodology

CASE STUDY Test-retest reliability (Thornberry and Krohn, 2000)

Terence Thornberry and Marvin Krohn explored self-reporting as a technique for measuring involvement in delinquent and criminal behavior. As part of their analysis they reviewed the reliability of self-reporting referring to analyses in the literature that had used test-retest approaches. Thornberry and Krohn (2000, pp. 46–48) wrote as follows:

Thus, we will focus on the test-retest method of assessing reliability. This approach is quite straightforward. A sample of respondents is administered a self-reported delinquency inventory (the test); then, after a short interval, the same inventory is readministered (the retest). In doing this, the same questions and the same reference period should be used at both times.

It is also important to pay attention to the time lag between the test and the retest. If it is too short, answers to the retest likely will be a function of memory; respondents are likely to remember what they said the first time and simply repeat it. If so, estimates of reliability would be inflated. On the other hand, if the time period between the test and the retest is too long, responses to the retest would probably be less accurate than those to the test simply because of memory decay. In this case, the reliability of the scale would be underestimated. There is no hard and fast rule for assessing the appropriateness of this lag, but the optimal time lag appears to be in the range of 1 to 4 weeks.

The simplest way of deriving a reliability coefficient for the test-retest method is to correlate the first and second sets of responses. The correlations should be reasonably high, preferably in the range of 0.70 or greater.

A number of studies have assessed the test-retest reliability of self-reported delinquency measures. In general, the results of these studies indicate that these measures are acceptably reliable. The reliability coefficients vary somewhat, depending on the number and types of delinquent acts included in the index and the scoring procedures used (e.g., simple frequencies or ever-variety scores). But scores well above 0.80 are common. In summarizing previous literature in this area, Huizinga and Elliott (1986, 300) stated:

Test-retest reliabilities in the 0.85–0.99 range were reported by several studies employing various scoring schemes and numbers of items and using test-retest intervals of from less than 1 hour to over 2 months (Kulik et al., 1968; Belson, 1968; Hindelang et al., 1981; Braukmann et al., 1979; Patterson and Loeber, 1982; Skolnick et al., 1981; Clark and Tifft, 1966; Broder and Zimmerman, 1978).

Perhaps the most comprehensive assessment of the psychometric properties of the self-report method was conducted by Hindelang, Hirschi, and Weis (1981). Their self-report inventory was quite extensive, consisting of 69 items divided into the following major subindexes: official contacts, serious crimes, delinquency, drugs, and school and family offenses. To see whether the method of administration matters, some subjects were interviewed and others responded on a questionnaire. For both types of administration, some subjects responded anonymously and others were asked to provide their names.

To maximize variation in the level of delinquency, the study sample was selected from three different populations in Seattle, Washington. The first consisted of students without an official record of delinquency attending Seattle schools. The second consisted of adolescents with a police record but no court record, and the third group consisted of adolescents with a juvenile court record. Within these three major strata, subjects were further stratified by gender, race, and, among the whites, socioeconomic status.

Several self-reported measures of delinquency were created. The major ones include an ever-variety score (the number of delinquent acts the respondents report ever having committed), a last-year variety score (the same type of measure for the past year), and a last-year frequency score (the total number of times respondents report committing each of the delinquent acts).

As indicated earlier, internal consistency methods can be used to assess the reliability of self-reported responses. The classic way of doing so is with Cronbach's alpha. Although mindful of the limitations of internal consistency approaches, Hindelang, Hirschi, and Weis (1981) report alpha coefficients for a variety of demographic subgroups and for the ever-variety, last-year variety, and last-year frequency scores. The coefficients range from 0.76 to 0.93. Most of the coefficients are above 0.8, and 8 of the 18 coefficients are above 0.9.

Hindelang, Hirschi, and Weis (1981) also estimated test-retest reliabilities for these three self-report measures for each of the demographic subgroups. Unfortunately, only 45 minutes elapsed between the test and the retest, so it is quite possible that the retest responses are strongly influenced by memory effects. Nevertheless, they report substantial degrees of reliability for the self- report measures. Indeed, most of the test-retest correlations are above 0.9.

Thus, whether an internal consistency or test-retest approach is used, the Seattle data indicate a substantial degree of reliability for a basic self-reported delinquency measure. Hindelang, Hirschi, and Weis (1981, p. 82) point out that reliability scores of this magnitude are higher than those typically associated with many attitudinal measures and conclude that 'the overall implication is that in many of the relations examined by researchers, the delinquency dimension is more reliably measured than are many of the attitudinal dimensions studied in the research'.

The other major assessment of the psychometric properties of the self-report method was conducted by Huizinga and Elliott, using data taken from the well- known National Youth Survey. NYS began in 1976 with a nationally represen- tative sample of 1,725 American youths between the ages of 11 and 17. At the fifth interview, 177 respondents were randomly selected and reinterviewed approximately 4 weeks after their initial assessment. Based on these data, Huizinga and Elliott (1986) estimated test-retest reliability scores for the gener- al delinquency index and for several subindexes. They also estimated reliability coefficients for frequency scores and for variety scores.

The general delinquency index appears to have an acceptable level of reliability. The test-retest correlations are 0.75 for the frequency score and 0.84 for the variety score. For the various subindexes—ranging from public disorder offenses to the much more serious index offenses—the reliabilities vary from a low of 0.52 (for the frequency measure of felony theft) to a high of 0.93 (for the frequency measure of illegal services). In total, Huizinga and Elliott (1986) report 22 estimates of test-retest reliability—across indexes and across frequency and variety scores—and the mean reliability coefficient is 0.74.

Another way of assessing the level of test-retest reliability is by estimating the percentage of the sample who changed their frequency responses by two or less. If the measure is highly reliable, one would expect few such changes. For most subindexes, there appears to be acceptable precision and reliability based on this measure. For example, for index offenses, 97 percent of the respondents changed their answers by two delinquent acts or less. Huizinga and Elliott (1986, p. 303) summarize these results as follows:

Scales representing more serious, less frequently occurring offenses (index offenses, felony assault, felony theft, robbery), have the highest precision, with 96 to 100 percent agreement, followed by the less serious offenses (minor assault, minor theft, property damage), with 80 to 95 percent agreement. The public disorder and status scales have lower reliabilities (in the 40 to 70 percent agreement range), followed finally by the general SRD [self-reported delinquency] scale, which, being a composite of the other scales, not surprisingly has the lowest test-retest agreement.

Huizinga and Elliott also report little evidence of differential reliability across various subgroups. They found no consistent differences across sex, race, class, place of residence, or delinquency level in terms of test-retest reliabilities.

(Adapted from Thornberry and Krohn, 2000, pp. 46–48)


Return to Reliability: Test-retest (Section