core definition

A measure is reliable if it is consistent, that is, measures the same thing in the same way, each time.

explanatory context

Questionnaires or structured interview schedules are assumed to be reliable data collection instruments because they ask all respondents a standard set of questions in a given order. Even disregarding the likely possibility of mistakes or deviations from the set of questions by an interviewer or of a respondent reading the questions on a questionnaire in a different order, there is still doubt about the reliability of such data collection instruments. A metric ruler may be reliable as a tool for measuring length as it measures the same thing in the same way, but it is unlikely that a set of questions, which after all relies on communication is reliable in the measurement, for example, of attitudes.

There is no sure way of testing reliability. The problem is, of course, that the subject matter of the questionnaire is not inert. One cannot test the reliability of a questionnaire by administering it to the same sample again and hoping for the same results because the sample itself will have changed from the first to the second testing.


There are various ways of assessing reliability, by far the best is to divide the responses on the test instrument into two groups at random and correlate the answers on each half for the sample. A high correlation coefficient suggests, although by no means confirms, a reliable instrument. For more information on assessing reliability see Researching the Real World Section

analytical review

Colorado State University (1993–2013) defines the following:

Reliability: The extent to which a measure, procedure or instrument yields the same result on repeated trials.

Equivalency Reliability: The extent to which two items measure identical concepts at an identical level of difficulty.

Stability Reliability: The agreement of measuring instruments over time.

Synchronic Reliability: The similarity of observations within the same time frame; it is not about the similarity of things observed.

Trochim (2006) states:

In research, the term reliability means 'repeatability' or 'consistency'. A measure is considered reliable if it would give us the same result over and over again (assuming that what we are measuring isn't changing!).... There are four general classes of reliability estimates, each of which estimates reliability in a different way. They are:

• Inter-Rater or Inter-Observer Reliability Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.

• Test-Retest Reliability Used to assess the consistency of a measure from one time to another.

• Parallel-Forms Reliability Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.

• Internal Consistency Reliability Used to assess the consistency of results across items within a test.

Wojtczak (2002), in the Glossary of Medical Education Terms, states:

Trust in the accuracy or provision of one's results; in the case of tests, it is an expression of the precision, consistency and reproducibility of measurements. Ideally, measurements should be the same when repeated by the same person or made by different assessors. In tests, contributing factors to reliability are the consistency of marking, the quality of test and test items, and the type and size of the sample. Satisfactory reliability of objective tests can be achieved by having large numbers of well-constructed test items marked by computer. Reliability is characterized by the stability, equivalence, and homogeneity of test.

Stability or test-retest reliability is the degree to which the same test produces the same results when repeated under the same conditions;

Equivalence or alternate-form reliability is the degree to which alternate forms of the same measurement instrument produce the same result;

Homogeneity is the extent to which various items legitimately team together to measure a single characteristic, such as a desired attitude.


In a clinical examination, obtaining reliability depends on three variables: the students, the examiners and the patients. Such complexity makes it difficult to reproduce a comparable situation for tests of clinical skill and clinical problem-solving. In a reliable assessment procedure, the variability due to the patient and the examiner should be removed. Wherever possible, a subjective approach to marking should be replaced by a more objective one and students should be tested by a number of examiners. It is important to note that students are usually examined using different patients, which may enhance the performance of some students and harm the performance of others. Therefore, tests which aim to assess clinical skills and clinical problem-solving have to contain many samples of student performance if they are to achieve adequate levels of reliability. The development of the multi-station objective structured clinical examination (OSCE) represents an effort to do so.

associated issues


related areas

See also


Researching the Real World Section 1.9


