8.1 Introduction to surveys
8.2 Methodological approaches
8.3 Doing survey research
8.4 Statistical Analysis
8.4.1 Descriptive statistics
8.4.2 Exploring relationships
8.4.2.1 Association
8.4.2.2 Crosstabulation
8.4.2.3 Correlation and regression (bivariate)
8.4.2.4 Multivariate analysis
8.4.2.4.1 Multiple correlation
8.4.2.4.2 Partial correlation
8.4.2.4.3 Multivariate analysis using crosstabulations
8.4.3 Analysing samples
8.4.4 Report writing
8.5 Summary and conclusion
8.4.2.4 Multivariate analysis
Multivariate analysis extends the bi-variate case to deal with situations in which Y is dependent upon more than one independent variable. In the social world it is unlikely that a simple association between one independent variable and a dependent variable can be established. For example, income may be dependent on occupation but that alone is insufficient, as age, gender, geographic location and so on, also effect income levels. What the multiple linear regression line shows then, is the relationship between a single dependent variable and any number of independent variables.
Thus, whereas the bi-variate linear regression line takes the general form
Y = a + bX
the multivariate linear regression line takes the form
Y = a + b1X1 + b2X2+ b3X3 +....+ bnXn
where there are n independent variables and each has a coefficient (b1, b2 etc).
The size of the coefficient provides an indicator of the relative importance of the individual independent variables (the Xs) in determining the dependent variable (Y).
E.g.:
If Y = income
X1 = occupation
X2 = age
X3 = gender
X4 = educational attainment score
X5 = introversion score
and the regression line takes the following form
Y = .3 + .34X1 + .28X2 + .15X3 + .2X4 + .002X5
then we can see that the most important factors in determining income (Y) are probably occupation, then age, then educational attainment then gender and finally introversion.
Indeed introversion has such a small coefficient that it might be excluded and the regression line recalculated using X1 to X4 only.
N.B. the 0.3 at the beginning of the equation is a 'constant' which has little use in interpreting the regression equation.
What does this example tell us, then?
It points to the relative importance of the various factors in determining income. However, you must be careful in making assertions on the basis of regression equations. The following must be borne in mind:
1. the equation is (usually) based on a sample and thus the results are prone to sampling variation (this is discussed in more detail below).
2. the equation shows a mathematical relationship between those items you choose as independent variables and the one you chose as dependent. It therefore is not a definitive statement of causal relationships. Why?
a. because you may have left out an important variable (such as geographic region in the example above)
b. your assumption of a deterministic relationship between Y and the selection of Xs may be entirely false.
3. the equation says nothing about the extent of the association between the dependent and independent variables.
Calculating multiple regression equations by hand is a lengthy and time-consuming process; for that reason no worked example is given here. The computation involves much the same procedures as the computation of bi-variate regression lines. A statistical program can be used to compute the coefficients for a multiple regression.
Top
8.4.2.4.1 Multiple correlation
A multiple correlation coefficient is a measure of the relationship of Y with the combined Xs. It is an indicator of the extent to which the multiple regression equation fits the data. As in the bi-variate case, the multiple correlation coefficient varies between 1 and 0, the nearer to 1, the stronger the degree of association.
The multiple correlation coefficient (usually Pearson's Multiple Product Moment Correlation Coefficient, (R)) allows us to specify the relationship between Y and the associated Xs.
The normal procedure when calculating a multiple regression line and multiple correlation coefficient is to identify likely factors that have a bearing on Y and (on the basis of available data) proceed in, what is called, a stepwise fashion.
The procedure works by entering each independent variable into the equation one at a time starting with the one that has the highest bivariate correlation with the dependent variable. Then each subsequent variable is entered on the basis of the which has the highest bivariate correlation with the dependent variable, allowing for the influence of the independent variables that have already been entered into the equation. (Any variable already entered whose relationship with the dependent variable disappears when other variables are entered into the equation is automatically removed from the equation by some computer programs such as SPSS). At each step a new regression line and multiple correlation coefficient is generated.
In the example above suppose the multiple regression coefficients for each step were as follows.
Y with X1 R = 0.2
Y with X1 and X2 R = 0.5
Y with X1 and X2 and X3 R = 0.9
Y with X1 and X2 and X3 and X4 R = 0.95
Y with X1 and X2 and X3 and X4 and X5 R = 0.952
This would show that X1, X2 and X3 together were closely correlated with Y (R = 0.9). Adding X4 (educational attainment score) increased the correlation coefficient a little but was of marginal importance. Introversion (X5) had virtually no effect on the multiple correlation and could easily be excluded from the analysis.
Note: the value of R will always be increased a little whatever variable is included, even a set of random values will lead to a marginal increase in R in its raw form. Usually, computer programs such as SPSS, which compute this stepwise process, will provide an adjusted value of R that takes account of the number of different Xs included. The adjusted score will then sometimes fall as additional items of no consequence are added to the list of variables to be included. In the above example, if the adjusted values were:
Y with X1 Adjusted R = 0.2
Y with X1 and X2 Adjusted R = 0.495
Y with X1 and X2 and X3 Adjusted R = 0.895
Y with X1 and X2 and X3 and X4 Adjusted R = 0.935
Y with X1 and X2 and X3 and X4 and X5 Adjusted R = 0.925
Then including X5 is quite clearly a negative step and the multiple relationship would include only X1 to X4 (with X4 being of relatively little importance).
Top
8.4.2.4.2 Partial correlation
Partial correlation is a procedure that allows you to measure the effects of one X on Y, while controlling for the effects of all the other independent variables (known as the control variables) in the multivariate analysis.
It operates through statistical manipulation, which has a basic assumption that there are linear relationships between all the variables (as in the linear (or first order) multiple regression equation). Once the linear relationship between the independent, the dependent and the control variables is known, it is possible to remove the effect of the control variables. This is done by predicting the values of the independent and dependent variable (separately) from the control variable(s) (on the basis of the correlation between the control variable(s) and X and the control variable(s) and Y).
Partial correlation is a useful tool in clarifying the relationships between three or more variables. While the multiple correlation coefficient gives an overall picture of the relationship between Y and the combined Xs, the partial correlation coefficients provide the basis for examining the relationship between each X and Y when all the influence of the other factors is removed.
For example, income might be seen to be a function of education, and it would be possible to assess the extent of this relationship (assuming that education can be measured). We could simply undertake a linear bi-variate analysis and show the degree of association between education (X) and income (Y). Such a correlation coefficient might be computed to be 0.45, for example.
This simple linear relationship would probably be unsatisfactory, however, as it ignores other factors, such as age, gender, and so on. A multiple analysis may then be preferable. We could compute a multiple linear regression equation such as:
Y = .2 + 0.53X1 + 0.24X2 + 0.1X3
where X1 = educational attainment score, X2 = age, X3 = gender
Suppose the bi-variate correlation matrix is as follows (this is the correlation of each variable with all the other variables, thus, in the example, the correlation between education and income is r=0.45):
|
income |
education |
age |
gender |
income |
1.00 |
|
|
|
education |
0.45 |
1.00 |
|
|
age |
0.30 |
0.05 |
1.00 |
|
gender |
0.15 |
0.19 |
0.05 |
1.00 |
This indicates that educational attainment is the most important of the various factors in effecting income. A multiple correlation coefficient may be calculated. Suppose, for the example above, it equalled 0.6, this would show that the three factors (weighted and combined linearly) correlated moderately well with income, (and better than the simple correlation of X and Y, where, for the example, the best correlation = 0.45). However, the multiple correlation of 0.6 would by no means provides a complete explanation of variations in income.
The partial correlation coefficients show the relationship of any one of the independent variables with Y, whilst controlling for the effect of the others. So, it would be possible to see what effect X1 (educational attainment score) had on Y whilst controlling for age and gender. (This would be the partial correlation of X1 on Y, controlling for X2 and X3).
This procedure is different from just calculating the bi-variate relationship of X1 on Y. What the partial correlation does, in effect, in this case is to show what the relationship between income and educational attainment is, irrespective of the influence of age and gender. The computation acts as though you had an enormous sample and took out all the males of a given age and analysed the correlation between income and educational attainment, then did it for all the males of another age group, then again for the next age group, and so on, then repeated the whole process for all the females in each age group, then worked out an average of all the correlations between income and educational attainment for all the different age/gender groups. The partial correlation is in effect what that average would be. However, because of the statistical predictive procedures used, one does not need such a large sample.
Suppose, in the example, that the partial correlation of X1 (education) on Y controlling for X2 and X3, was small, say 0.1. This would suggest that there was no direct correlation between income and educational attainment, that once the effect of gender and age were removed there was a more or less random variation between X1 and Y.
Partial correlation analysis, then, is useful in clarifying relationships. It can indicate spurious relationships by uncovering preceding or intervening variables.
A spurious relationship is one in which Y appears to be associated with X1 (as indicated by say the simple correlation of Y and X1) but that X1 merely varies with some other preceding independent variable(s), X2 (X3 etc) which is the real predictor of Y. When X2 is controlled for, the partial correlation coefficient for X1 on Y (controlling for X2) will be small if such a spurious relationship exists.
This is illustrated in the above example where the partial correlation of X1 on Y drops to 0.1. In such a case, the variation in X1 is reflected in the variation in age and gender (which precede educational attainment). What this would mean, then, is that, in the example, when one controls for age and gender, income is independent of (i.e. not related to) education.
Partial correlation can also be used to suggest causal chains of intervening variables between Y and X1. In other words, a relationship observed between Y and X1 (a simple correlation coefficient) may be spurious in the sense that Y is dependent on some intervening variable(s) that is dependent on X1.
Take for example, the relationship between social class (as measured by some socio-economic index based on income and occupation) of parents and of children. A simple correlation might reveal a high correlation coefficient, suggesting (possibly) a substantial degree of transfer of wealth from one generation of a family to the next. However, a second thesis might suggest that the relationship is not that simple (that such a correlation is misleading) that, in fact, the social class of the parents provides a basis for the generation of conditions by which the children attain the same class. This might be through the mediation of education, occupation etc., which the parents effect and which, in turn effect the social class of the child (in later life).
For the computation and interpretation of the partial correlation coefficient, the same procedures operate for intervening variables as in the case of preceding variables, only the theoretical model (suggesting a causal chain, rather than the dissolution of a causal link) is different.
So, generally, if the partial correlation between X1 and Y (controlling for X2, (X3 etc.)) drops below the simple correlation of X1 and Y, then the relationship between X1 and Y can be said to be likely to be spurious. The nearer the partial correlation approaches zero, the clearer is the spurious nature of the original observed relationship between X1 and Y. In such a case, X1 should be removed from the relationship and Y considered as a function of X2 (X3 etc).
Sometimes, however, the partial correlation coefficient of X1 and Y (controlling for X2 etc.) is larger than the simple correlation of X1 and Y.
Suppose that, in the example, correlating income with educational attainment, age and gender, the partial correlation of Y on X1 (controlling for X2 and X3) was 0.8, then this would suggest that, when the effect of age and gender was removed, there was a strong association between income and educational attainment, (stronger than the simple correlation which was 0.45). In effect, the role of gender and age in the initial regression equation, while necessary as controls, was to conceal the direct effect of X1 on Y. In such a case, controlling for other relevant variables has served to increase one's confidence in relationship between X and Y (in the example, between education and income).
Top
8.4.2.4.3 Multivariate analysis using crosstabulations
It is possible to analyse the relationship between two variables taking into account other factors when the data is crosstabulated (i.e. of a nominal or ordinal scale). This is done by constructing n-way crosstabulations (i.e. crosstabulations that are more than simple two-way crosstabulations. The approach is to create crosstabulations of X by Y controlling for a third (or any number) of other variables. So, for example, using sample data, one might construct a crosstabulation of voting preference by social class.
For example:
Table 8.4.2.4.3.1 Crosstabulation Vote by Class (Count)
Party |
Middle Class |
Working Class |
Totals |
Conservative |
125 |
75 |
200 |
Labour |
75 |
125 |
200 |
Totals |
200 |
200 |
400 |
This is a simple two-way crosstabulation. It shows a relationship between class and voting preference.
One may, however, think that this is too simplistic and want to take other variables into account. For example, gender may play a part in voting preference, thus one may want to control for gender. A three way crosstabulation, controlling for gender, would further breakdown the initial crosstabulation into the relationship between voting preference and social class for men and for women. I.e. create two tables, thus:-
Table 8.4.2.4.3.2 Crosstabulation Vote by Class (Count), Controlling for gender. Gender = male (n=200)
Party |
Middle Class |
Working Class |
Totals |
Conservative |
55 |
45 |
100 |
Labour |
45 |
55 |
100 |
Totals |
100 |
100 |
200 |
Table 8.4.2.4.3.3 Crosstabulation Vote by Class (Count), Controlling for gender. Gender = female (n=200)
Party |
Middle Class |
Working Class |
Totals |
Conservative |
75 |
25 |
100 |
Labour |
25 |
75 |
100 |
Totals |
100 |
100 |
200 |
These two tables could be further broken down using a second control variable, such as age. If we divide age into two groups ('old' as over 40 and 'young' those between 18 and 40), then we would create four tables of vote by class, one for young males, one for young females, one for old males and one for old females.
For example:
Table 8.4.2.4.3.4 Crosstabulation Vote by Class (Count), Controlling for gender and age. Gender = male, age = young (18-40 years) (n=90)
Party |
Middle Class |
Working Class |
Totals |
Conservative |
20 |
20 |
40 |
Labour |
30 |
20 |
50 |
Totals |
50 |
40 |
90 |
Table 8.4.2.4.3.5 Crosstabulation Vote by Class (Count), Controlling for gender and age. Gender = male, age = old (Over 40 years) (n=110)
Party |
Middle Class |
Working Class |
Totals |
Conservative |
35 |
25 |
60 |
Labour |
15 |
35 |
50 |
Totals |
50 |
60 |
110 |
Table 8.4.2.4.3.6 Crosstabulation Vote by Class (Count), Controlling for gender and age. Gender = female, age = young (18-40 years) (n=110)
Party |
Middle Class |
Working Class |
Totals |
Conservative |
30 |
20 |
50 |
Labour |
20 |
40 |
60 |
Totals |
50 |
60 |
110 |
Table 8.4.2.4.3.7 Crosstabulation Vote by Class (Count), Controlling for gender and age. Gender = female, age = old (Over 40 years) (n=90)
Party |
Middle Class |
Working Class |
Totals |
Conservative |
45 |
5 |
50 |
Labour |
5 |
35 |
40 |
Totals |
50 |
40 |
90 |
The big problem with this way of analysing multivariate relationships is that as we increase the number of control variables, the total numbers in each table decline. There is then, a practical limit to the number of controls one can introduce for any given sample size (otherwise one ends up with very small sub-samples which tell you nothing).
What does the above example tell us about specifying the relationship between vote and class, when we control for gender and age?
There was an initial relationship observed between vote and class. When we took account of gender, Tables 8.4.2.4.3.2 and Tables 8.4.2.4.3.3 reveal a stronger relationship between vote and class for women than for men. When this was further broken down, in using age, Tables 8.4.2.4.3.4 to Table 8.4.2.4.3.7, it was evident that the strongest relationships existed amongst the older voters, both male and female. (The female voters, however, exhibiting a stronger relationship than the males in both age categories). This would lead us to suppose that while class and vote are associated, the relationship is mediated by age and gender.
|