8.1 Introduction to surveys
8.2 Methodological approaches
8.3 Doing survey research
8.4 Statistical Analysis
8.4.1 Descriptive statistics
8.4.2 Exploring relationships
8.4.2.1 Association
8.4.2.2 Crosstabulation
8.4.2.3 Correlation and regression (bivariate)
8.4.2.3.1 Bivariate analysis
8.4.2.3.1.1 Worked Example
8.4.2.3.2 Coefficient of determination
8.4.2.3.3 Correlation coefficient for ordinal data
8.4.2.4 Multivariate analysis
8.4.3 Analysing samples
8.4.4 Report writing
8.5 Summary and conclusion
Activity 8.3.12.11
Activity 8.3.12.12
8.4.2.3 Correlation and regression
Correlation and regression procedures are used to measure the relationship between two or more variables, usually when the data is of an interval scale. This is something of a problem for social science given that so much data is not of an interval scale.
However, when interval scale data is available grouping the data into categories in order to construct readable crosstabulations loses the precision of the original data.
Grouped categories help us to get a picture of interval scale data (such as age data) but for analytic purposes you get a much better idea of the relationship between two interval scale variables (such as age and income) if you compare the original data rather than the grouped categories.
Correlation and regression analysis uses the original ungrouped data and are thus more powerful than measures of association applied to crosstabulated data.
Correlation and regression techniques attempt to show to what extent one variable is dependent upon or associated with one or more other variables. Rather than relate cells in a table, regression and correlation techniques operate on individual cases.
Regression analysis also attempts to specify the nature of the relationship between variables rather than simply assess whether there is some relationship or not.
So,
- correlation techniques measure the degree of association between variables,
- regression techniques specify the nature of the relationship.
For example, a simple relationship may be posited between height and weight. A sample of 100 mature males may be selected at random and their heights and weights measured. A pattern may emerge where the taller people tend to weigh more. However, it is unlikely that there are no exceptions to the general tendency, doubtless any such sample will contain short fat men as well as tall thin ones. Nonetheless there may still be evident a general trend that shows weight to be dependent on height.
Correlation and regression procedures presuppose some sort of underlying relationship, then attempt to
a. identify exactly what this relationship is (regression)
b. assess the extent to which the observed data 'fits' the underlying relationship (correlation).
In our example, if there was a perfect relation where for every inch increase in height individuals were two pounds heavier (e.g. someone 65 inches tall weighed 130 pounds, someone 66 inches tall weighed 132 pounds, someone 67 inches tall 134 pounds, etc) then the underlying relationship would be immediately evident, i.e. a straight line (or linear relationship), and there would be perfect correlation, i.e. no deviation from the linear relationship.
In such an ideal case the sample would generate a correlation coefficient (the measure of association) equal to 1. (Correlation coefficients are designed to have a value between 0 and 1, so that different cases can be compared)
The relationship could also be specified as a regression line, which would take the form of a straight line
Y=2X
(where Y = weight and X= height)
When attempting to measure relationships between variables it is usual to identify the independent variable and label it X (when there are several independent variables they are labelled, X1, X2, X3 and so on.).
The dependent variable is labelled Y. The dependent variable is the one that (if there is any relationship at all) depends on the other variable(s). E.g. weight may depend on height (and not vice versa), income may depend on gender (despite legislation!). Height and gender would be independent in these relationships, weight and income dependent.
Regression analysis, in attempting to measure the underlying relationship, and thus providing a basis for measuring the degree of relationship between the dependent and independent variables, must first determine the nature of the underlying relationship. The normal recourse is to assume that the underlying relationship is linear. (Although techniques exist to determine curvilinear relationships, these are not widely used in social science and will not be discussed here).
In order to specify regression equations, it is necessary to have interval scale data. (There are some relatively obscure attempts to create regression equations for ordinal data but these will not be dealt with).
Top
8.4.2.3.1 Bivariate analysis
Consider a simple two variable relationship (bi-variate), between height and weight. A sample of 100 people will probably show some tendency that would suggest height and weight are related. However, under normal conditions the extent of the relationship and the exact specification of the relationship will not be clear.
Regression analysis assumes that there is a relationship and (usually) that it is linear. It then attempts to locate a specific relationship (in the form of a straight line) that best represents the data, a sort of 'average' relationship. (Much in the same way that charts of height and weight, which show 'normal' weights for each height, attempt to provide a profile of the 'average' weight/height relationship).
There are various methods of determining this average or best underlying relationship. These range from the 'subjective' assessment of the researcher to more 'objective' measurements which determine the unique line which minimises variation.
When the data is interval, and the minimum variation is measured using the standard deviation this is known as the least squares regression line of best fit, and is the primary approach to regression analysis.
Correlation techniques then show the extent to which the data fits the underlying relationship. When the correlation coefficient is near to 1 then the data closely fits the underlying relationship that has been identified. When the correlation coefficient is close to zero then the data does not fit closely to the underlying relationship and there is little or no association observable between the two variables (despite being the best fit to the data).
Top
8.4.2.3.1.1 Worked Example
Table 8.4.2.3.1 shows the number of years in full-time work and the salary (in thousands of pounds) of 10 care workers. The least squares regression line and the Pearson's product moment correlation coefficient will be calculated to show the underlying (linear) relationship between service and salary (given the sample data) and the extent to which underlying relationship fits the data.
Table 8.4.2.3.1 Years of service and salary (£000s)
Teacher |
Years |
Salary |
A |
1 |
6 |
B |
2 |
7 |
C |
3 |
7 |
D |
4 |
8 |
E |
5 |
12 |
F |
6 |
8 |
G |
7 |
8 |
H |
8 |
14 |
I |
9 |
7 |
J |
10 |
13 |
We need to identify the dependent and independent variables. If there is any relationship then salary is dependent on years as a care worker (rather than vice versa). Years as a care worker precedes salary.
In general, the regression line takes the form:
Y = a + bX
(where Y refers to the values of the dependent variable, X refers to values of the independent variable, a is the point at which the straight line intersects the Y axis and b is the gradient of the straight line).
In order to specify the nature of the line we need to compute a and b
The following is how to compute it by hand. In practice, a computer program is likely to be used. There are very many options from SPSS through to specific regression and correlation calculators that you can find on the Internet, to using functions in Excel.
The formula for b is:
n(sum of XY)-(sum of X multiplied by sum of Y)
b= -----------------------------------------------------
n(sum of X squared) - (sum of X)squared
This is usually written with the Greek letter sigma (∑) as a short hand for 'sum of'.
n(∑XY)-(∑X.∑Y)
b= --------------------
n(∑X2) - (∑X)2
Note that (∑X2) is not the same as (∑X)2. The first involves squaring all the values of x and then adding them up. The second involves adding up all the X values and then squaring the result.
The formula for a is:
a= Mean of Y - b(Mean of X)
The formula for the Pearson's Product Moment Correlation Coefficient (r) is:
n(sum of XY)-(sum of X multiplied by sum of Y)
r= ----------------------------------------------------------------
√[(n(sum of X2)-(sum of X)2).(n(sum of Y2) - (sum of Y)2)]
Again this is usually written using the ∑ notation:
n(∑XY)-(∑X.∑Y)
r= -----------------------------------------
√[(n(∑X2)-(∑X)2).(n(∑Y2) - (∑Y)2)]
These look difficult but they are not too bad really. What we need to calculate are the following:
Sum of X =∑X
Sum of X squared =∑X2
Sum of Y=∑Y
Sum of Y squared=∑Y2
Sum of X multiplied by Y=∑XY
Thus:
Table 8.4.2.3.2 Years of service and salary (£000s) calculation of regression and correlation coefficients
Teacher |
Years |
Salary |
|
|
|
|
X |
Y |
XY |
X2 |
Y2 |
A |
1 |
6 |
6 |
1 |
36 |
B |
2 |
7 |
14 |
4 |
49 |
C |
3 |
7 |
21 |
9 |
49 |
D |
4 |
8 |
32 |
16 |
64 |
E |
5 |
12 |
60 |
25 |
144 |
F |
6 |
8 |
48 |
36 |
64 |
G |
7 |
8 |
56 |
49 |
64 |
H |
8 |
14 |
112 |
64 |
196 |
I |
9 |
7 |
63 |
81 |
49 |
J |
10 |
13 |
130 |
100 |
169 |
Totals (∑) |
55 |
90 |
342 |
385 |
884 |
n= sample size = 10
n(∑XY)-(∑X.∑Y)
b= --------------------
n(∑X2) - (∑X)2
(10x542) - (55x90)
b= -----------------------
(10x385) - (55x55)
5420 - 4950
b= ----------------
3850 - 3025
b= 470/825
b= 0.57
a= Mean of Y - b(Mean of X)
a= mean of Y - .57 (mean of X)
a= 90/10 - .57(55/10)
a= 9 - (.57)(5.5)
a= 9 - 3.135
a= 5.9 to one decimal place
The least squares regression line of best fit is, therefore,
Y = 5.9 + .57 X
So, for example, when X=10 (10 years as a care worker) then the predicted Y (salary) would be:
Y = 5.9 + (.57)(10)
Y = 5.9 + 5.7
Y = 11.6 (thousand pounds)
We can draw a scatter diagram of the original data and add to it the least squares line, thus:
Scatter diagram:
Salary
Y
15
14 x
13 x
12 x
11
10
9
8 x x x
7 x x x
6 x
0 1 2 3 4 5 6 7 8 9 10 X Years
From the regression formula:
when X=0 Y=5.9
when X=10 Y=11.6
Plot these two points and join them up to show the regression line.
The Pearson's Product Moment Correlation Coefficient (r)
n(∑XY)-(∑X.∑Y)
r= -----------------------------------------
√[(n(∑X2)-(∑X)2).(n(∑Y2) - (∑Y)2)]
(10x542) - (55x90)
r= ---------------------------------------
√[(10x385 - 55x55) (10x884 - 90x90)]
5420 - 4950
r= -------------------------------
√[(3850 - 3025)(8840 - 8100)]
470
r= -----------------
√[(825) (740)]
r= 470/ √[610500]
r= 470/781.3
r= 0.6
A correlation coefficient of 0.6 shows a reasonably strong relationship between years in teaching and salary.
Top
8.4.2.3.2 Coefficient of determination
r-squared (r2) is known as the coefficient of determination and shows the percentage of any change in Y which is attributable to a change in X.
In the worked example above (Section 8.4.2.3.1.1) r2 is 0.6x0.6 = 0.36.
Thus, 36% of any change in salary is attributable to length of time in the profession. Other factors account for 74% of the change in salary (based on this small sample).
Top
Table 8.4.2.3.3 shows the per capita income and the public expenditure per student by a sample of 10 states in the USA. Calculate the regression line of expenditure on income (either by hand or using a computer program). To what extent is expenditure correlated with income? If a state has a per capita income of $95000 what would be the best estimate of the likely expenditure on state education? How much of the expenditure on education is accounted for by per capita income across all 10 states?
Table 8.4.2.3.3 Per capita income and public education expenditure ($0000)
State |
Per capita income ($0000) |
Public Education Expenditure per Student ($0000) |
Arkansas |
4 |
4 |
California |
12 |
9 |
Kansas |
5 |
5 |
Michigan |
7 |
8 |
New York |
8 |
10 |
Oregon |
6 |
5 |
Rhode Island |
7 |
6 |
South Carolina |
5 |
3 |
South Dakota |
6 |
5 |
Wyoming |
5 |
6 |
Data is hypothetical
Top
8.4.2.3.3 Correlation coefficient for ordinal data
Pearson's Product Moment Correlation Coefficient requires interval scale data. However, it is possible to compute a correlation coefficient for ordinal scale data without using crosstabulation.
Both Spearman's Rank Order Correlation Coefficient and Kendall's Tau statistic enable us to do this.
Spearman's Rank Order Correlation Coefficient (rs)operates like Pearson's coefficient but instead of using the original data it ranks the original data in order and then computes the degree of association between the two sets of ranks. Thus, itrequires that both variables are at least of an ordinal scale.
For example, a group of ten students were given an 'authoritarianism' test and an IQ test to see if there was any relationship between intelligence quotient and tendency towards anti-authoritarianism. The authoritarian test provides a score of between 1 (authoritarian) to 20 (anti-authoritarian). Table 8.4.2.3.4 shows the original scores and the rank order of the students on each scale.
Table 8.4.2.3.4 Authoritarian score and IQ score
Student |
Authoritarian Score |
IQ Score |
Authoritarian Rank order |
IQ Rank order |
A |
1 |
95 |
1 |
3 |
B |
3 |
99 |
2 |
4 |
C |
4 |
120 |
3 |
7 |
D |
5 |
89 |
4 |
2 |
E |
9 |
87 |
5 |
1 |
F |
11 |
100 |
6 |
5 |
G |
12 |
121 |
7 |
8 |
H |
14 |
130 |
8 |
10 |
J |
15 |
125 |
9 |
9 |
K |
18 |
118 |
10 |
6 |
Clearly there is not a perfect correlation, IQ and authoritarianism are not perfectly matched although there is a tendency for those with the highest IQ to be more anti-authoritarian.
Spearman's rs gives an indication of the extent of this correlation. The formula for calculating Spearman's Rank Order Correlation Coefficient (rs) is as follows:
rs = 1 - (6∑d2/n(n2-1))
Where d is the difference between ranks for each pair of observations, n is the number of observations and ∑ means 'sum of'.
So to calculate rs you need to work out the differences in the ranks, square them, and add them up. Then substitute the figure in the formula. Thus:
Table 13.5 Authoritarian score and IQ score, calculation of rs
Student |
Authoritarian Rank order |
IQ Rank order |
d |
d2 |
A |
1 |
3 |
2 |
4 |
B |
2 |
4 |
2 |
4 |
C |
3 |
7 |
4 |
16 |
D |
4 |
2 |
2 |
4 |
E |
5 |
1 |
4 |
16 |
F |
6 |
5 |
1 |
1 |
G |
7 |
8 |
1 |
1 |
H |
8 |
10 |
2 |
4 |
J |
9 |
9 |
0 |
0 |
K |
10 |
6 |
4 |
16 |
|
|
|
|
∑d2 = 66 |
rs = 1 - (6∑d2/n(n2-1)) = 1- (6(66)/(10(100-1) = 1-(396/990)
rs =1 - 0.4 = 0.6
There is thus a reasonably strong correlation of 0.6 between IQ and authoritarianism for the sample of students.
Again, computer programs will do the computation quickly but it is important to know what the computer is doing when it generates statistics and what these statistics tell you and when they are appropriate.
|