8.1 Introduction to surveys
8.2 Methodological approaches
8.3 Doing survey research
8.4 Statistical Analysis
8.4.1 Descriptive statistics
8.4.2 Exploring relationships
8.4.3 Analysing samples
8.4.3.1 Generalising from samples
8.4.3.2 Dealing with sampling error
8.4.3.3 Confidence limits
8.4.3.4 Statistical significance
8.4.3.5 Hypothesis testing
8.4.3.6 Significance tests
8.4.3.6.1 Parametric tests of significance
8.4.3.6.2 Non-parametric tests of significance
8.4.3.6.2.1 Chi-square test
8.4.3.6.2.2 Mann Witney U test
8.4.3.6.2.3 Kolmogorov-Smirnov Test
8.4.3.6.2.4 H test
8.4.3.6.2.5 Sign test
8.4.3.6.2.6 Wilcoxon test
8.4.3.6.2.6.1 A note on the Walsh test
8.4.3.6.2.7 Friedman test
8.4.3.6.2.8 Q test
8.4.3.7 Summary of significance testing and association: an example
8.4.4 Report writing
8.5 Summary and conclusion
When to use the Wilcoxon test
Hypotheses
What we need to know
Sampling distribution
Assumptions
Testing Statistic
Critical values
Worked examples
Unworked examples
When to use the Wilcoxon test
The Wilcoxon test compares two related samples of any size. It is a refinement of the basic Sign test. The Sign test considers matched pairs by allocating a positive or negative sign to indicate the direction of the change between matched pairs. The Wilcoxon test also incorporates allowance for the size of the change. Consequently, the Wilcoxon test can only be used on ordered metric sclae data, interval scale data, or ratio scale data as absolute changes are compared.
The Wilcoxon test is preferable to the parametric t-test of matched pairs in some circumstances and it does not assume normal distributions.
Top
Hypotheses
The null hypothesis (H0) tested by the Wilcoxon test is that the two matched samples come from the same distribution. In other words, changes in the matched pairs are randomly distributed.
The alternative hypothesis (HA)is that the matched samples indicate a change, in one specified direction (a one-tail test) or in either direction (a two-tail test).
Top
What we need to know
All the values of the matched pairs are required to use the Wilcoxon test.
Top
Sampling distribution
The sampling distribution is based on the null hypothesis that the positive and negative changes are equally likley to occur when the changes are random. If the actual distribution of changes is significantly different then the null hypothesis would be rejected.
The test works by identifying the absolute value of the change and ranking the values in order from 1, the lowest, to n (the sample size), retaining the direction (sign) of the chnage. Ties are awarded an average rank as in the Mann-Whitney test.
The ranks for all the negative changes are summed. Likewise the sum of the ranks for the positive chnages. The smallest sum of ranks is used in subsequent analysis and this smallest sum of ranks is designated as W.
The sampling distribution of the sum of the ranks approximates a normal distribution when the sample size is greater than or equal to fifteen.
Top
Assumptions
There are no assumptions as the the shape of the population when using the Wilcoxon test. However, the test is concerned with changes in scores and any 'no chnage' item is discarded. If there a re a lot of unchanged scores, the test becomes inapllicable.
Top
Testing statistic
(1) Small samples, 6 to 14 in size; sampling statistic is W, where W is the smallest sum of ranks.
(2) Large samples of 15 or more. The sampling statistic is z, as the sum of ranks approximates a normal distribution as n increases:
where z = (W-mean of W)/standard deviation of W
I.e. z = (W-µW)/σW
where mean of W = µW = (n(n+1))/4
standard deviation of W =σW = √((n(n+1)(2n+1))/24)
Top
Critical values
1. Small samples: the critical value can be approximated from the following formulae
95% significance level: (n2 -7n +10)/5
99% significance level: (11n2/60) -2n +5
Thus for samples between 6 and 14 the null hypothesis is rejected if calculated W is equal to or less than the value in the table below (two-tail test)
N |
0.05 signifcance level (2TT)
0.025 signifcance level (2TT) |
0.01 signifcance level (2TT)
0.005 significance level (1TT) |
6 |
0.5 |
- |
7 |
2.0 |
- |
8 |
3.5 |
0.5 |
9 |
5.5 |
1.5 |
10 |
8.0 |
3.0 |
11 |
10.5 |
5.0 |
12 |
14.0 |
7.0 |
13 |
17.5 |
10.0 |
14 |
21.5 |
13.0 |
2. Large samples. Critical values are as for all z tetsts and may be derived from areas under the normal curve. See Section 8.4.3.5 for some typical critical values for the z test.
Top
Worked examples
1. Small samples
Ten students were given a mathematics test after week five of their course and the test was repeated (with different questions but the same topics) two weeks later. Do the scores below suggest that retesting improves scores?
Student |
Test 1 score |
Test 2 score |
Change |
Rank |
1 |
85 |
94 |
+9 |
7 |
2 |
92 |
84 |
-8 |
6 |
3 |
83 |
93 |
+10 |
8 |
4 |
74 |
81 |
+7 |
5 |
5 |
86 |
87 |
+1 |
2 |
6 |
94 |
95 |
+1 |
2 |
7 |
88 |
99 |
+11 |
9 |
8 |
69 |
70 |
+1 |
2 |
9 |
77 |
80 |
+3 |
4 |
10 |
82 |
82 |
0 |
- |
Hypotheses:
H0: There is no significant difference in the scores between the tests.
HA: There is a significant increase in the the scores.
This is a one-tail test of significance.
Significance level: 2.5% (Due to restrictions of critical values table)
Testing statistic: W
W = smallest sum of ranks
Critical value: From tables of critical values above, critical value at 2.5% signifcance is 5.5 as the sample reduced to 9 as one no change.
Decision rule: reject H0 if calculated W < 5.5
Computation:
Compute change and rank as in last two columns of the table above. Then compute the sum of positive and negative ranks. The smallest sum of ranks W=8 (for the negative scores).
Decision: Cannot reject the null hypotheses (H0) as calculated W is not less than the critical value of 5.5.
Note that had student 2's score reduced by 5 rather than 8, then the null hypothesis would have been rejected. This suggests that the test is rather sensitive for small samples. It also means that one large score in the opposite direction can wipe out, in effect, clear evidence of a general (small) movement in the other direction.
Top
2. Large samples
The table below shows the times taken by 25 athletes when running 100 metres before and after a week's special training. Is there a significant change in performance?
Athlete |
Time before training |
Time after training |
Change (0.1 seconds) |
Rank |
A |
11.2 |
11.0 |
-2 |
13 |
B |
12.2 |
12.2 |
0 |
IGNORE |
C |
11.4 |
11.3 |
-1 |
5 |
D |
10.8 |
10.9 |
1 |
5 |
E |
11.1 |
11.4 |
3 |
18 |
F |
10.6 |
10.8 |
2 |
13 |
G |
10.9 |
10.8 |
-1 |
5 |
H |
11.3 |
11.5 |
2 |
13 |
I |
11.5 |
11.1 |
-4 |
21 |
J |
11.8 |
11.6 |
-2 |
13 |
K |
11.9 |
11.0 |
-9 |
24 |
L |
11.4 |
11.0 |
-4 |
21 |
M |
12.1 |
12.0 |
-1 |
5 |
N |
12.1 |
11.9 |
-2 |
13 |
O |
11.9 |
12.0 |
1 |
5 |
P |
11.3 |
11.2 |
-1 |
5 |
Q |
11.3 |
11.1 |
-2 |
13 |
R |
11.0 |
11.1 |
1 |
5 |
S |
12.1 |
11.8 |
-3 |
18 |
T |
12.3 |
11.6 |
-7 |
23 |
U |
12.1 |
11.7 |
-4 |
27 |
V |
11.2 |
11.0 |
-2 |
13 |
W |
11.5 |
11.6 |
1 |
5 |
X |
10.2 |
10.1 |
-1 |
5 |
Y |
12.1 |
11.8 |
-3 |
18 |
Hypotheses:
H0: There is no significant change in times.
HA: There is a significant change in times.
This is a two-tail test of significance.
Significance level: 5%
TTesting statistic:
z = (W-µW)/σW
where mean of W = µW = (n(n+1))/4
standard deviation of W =σW = √((n(n+1)(2n+1))/24)
Critical value: From tables of critical z values, critical z for 5% significance level, two-tail test is 1.96.
Decision rule: reject H0 if calculated z > 1.96 or <-1.96
Computation:
Calculate changes in scores (column 4 in the table above) and allocate ranks accommodating ties (column 5). Then compute the sum of positive and negative ranks. W is the smallest sum of ranks, in this case W=64 (the sum of the negative ranks = 236).
z = (W-µW)/σW
where mean of W = µW = (n(n+1))/4 = (24(25))/4 = 150
standard deviation of W =σW = √((n(n+1)(2n+1))/24) = √((24(25)(49))/24) =√25(49) = 5(7) = 35
so z = (W-µW)/σW = (64-150)/35 = -86/35 = 2.457
Decision: reject H0 at a 5% significance level as calculated z is less than -1.96. There is a significant change in times run by athletes.
NOTE: Although the smallest sum of ranks was used to calculate z, the same result would have been achieved by using the larger sum (236 in the example above), as (W-µW) would have been the same absolute size, just the sign would have changed. This does not matter with a two-tail test. If it had been a one-tail test using the normal approximation, then the sum of ranks used would be the one that accorded with the direction of the change being tested.
This is not the case with the one-tail, small sample situation, where the smallest sum of ranks is always taken and the validity of the one tail test depends on common sense identification of the correct direction of change.
Top
Unworked examples
1. The results of two statistics tests taken two months apart are shown in the table below for 20 students. The tests were similar. Is there a significant improvement in test results.
Student ID no. |
First test score |
Second test score |
1 |
23 |
24 |
2 |
24 |
28 |
3 |
25 |
23 |
4 |
30 |
22 |
5 |
31 |
34 |
6 |
34 |
32 |
7 |
35 |
36 |
8 |
38 |
40 |
9 |
40 |
42 |
10 |
41 |
44 |
11 |
43 |
43 |
12 |
46 |
50 |
13 |
47 |
55 |
14 |
49 |
53 |
15 |
50 |
58 |
16 |
53 |
54 |
17 |
54 |
53 |
18 |
55 |
60 |
19 |
57 |
66 |
20 |
58 |
67 |
Top
2. Carry out a Wilcoxon test on the unworked question 2 in Section 8.4.3.6.2.5. Is the Wilcoxon test preferable to the Sign test in this case?
Top
Walsh has developed a very powerful non-parametric test of two related samples. However, it is of limited use because it necessitates at least interval scale data and requires that both related samples are drawn from symmetrical populations. (A symmetrical population has its mean equal to its median). The test also requires its own table of critical values and has no operational advantages over the Wilcoxon test.
|