On this page:

SHS Lite -User Guide A guide to using the Scottish Household Survey simplified dataset

« Previous | Contents | Next »

Listen

SHS Lite - User Guide: A guide to using the Scottish Household Survey simplified dataset

5. Confidence Intervals and Statistical Significance

5.1 The Representativeness of the Scottish Household Survey

Although the SHS sample is chosen at random, the people who take part in the survey will not necessarily be a representative cross-section of the population. Like all sample surveys the results of the SHS are estimates of the corresponding figures for the whole population and these results might vary from the true values in the population for three main reasons.

  • The sample source does not completely cover the population because accommodation in hospitals, prisons, military bases, larger student halls etc. are excluded from the sampling frame. The SHS provides a sample of private households rather than all households. The effect of this on the representativeness of the data is not known.
  • Some people refuse to take part in the survey and some cannot be contacted by interviewers. If these people are systematically different from the people who are interviewed, this represents a potential source of bias in the data. Comparison of the SHS data with other sources suggests that for the survey as a whole, bias due to non-response is small ( see Section 2.4).
  • Samples always have some natural variability because of the random selection of households and people within households. In some areas where the sample is clustered, the selection of sampling points adds to this variability.

Each of these sources of variability becomes much more important when small sub-samples of the population are examined. For example, a sub-sample with only 100 households might have had very different results if the sampling had by chance selected four or five more households with children.

5.2 Confidence Intervals

The likely extent of sampling variability can be quantified by calculating the 'standard error' associated with an estimate produced from a random sample. Statistical sampling theory states that, on average:

  • Only about one sample in three would produce an estimate that differed from the (unknown) true value by more than one standard error.
  • Only about one sample in twenty would produce an estimate that differed from the true value by more than two standard errors.
  • Only about one sample in 400 would produce an estimate that differed from the true value by more than three standard errors.

By convention, the '95% confidence interval' is defined as the estimate plus or minus about twice the standard error because there is only a 5% chance (on average) that a sample would produce an estimate that differs from the true value of that quantity by more than this amount.

There is no simple "rule of thumb" for the size of standard errors. The standard error of the estimate of a percentage depends upon several things:

  • The value of the percentage itself.
  • The size of the sample (or sub-sample) from which it was calculated (i.e. the number of sample cases corresponding to 100%).
  • The sampling fraction (i.e. the fraction of the relevant population that is included in the sample).
  • The 'design effect' associated with the way in which the sample was selected (for example, a clustered random sample would be expected to have larger standard errors than a simple random sample of the same size).

The Estimated Sampling Error table shows the 95% confidence limits for a range of estimates calculated for a range of sample sizes. To estimate the potential variability for an estimate for the survey you should read along the row with the value closest to the estimate until you reach the column for the value closest to the sub-sample. This gives a value which, when added and subtracted from the estimate, gives the range (the 95% confidence interval) within which the true value is likely to lie.

Figure 16 - Local authority by Household type (row percentages displayed)

graphic

Figure 16 can be used to see the effect of smaller sample sizes. The survey estimates that in East Dunbartonshire 13% of households contain one non-pensioner adult (calculated by combining 8.4% single adults and 4.4% single parents). However, only 608 households in East Dunbartonshire were interviewed so from the sampling error table we can see that this estimate has a 95% confidence interval of approximately 3% suggesting that the true value lies between 10% and 16%. Clearly, the estimate for any single area is less reliable that the estimate for Scotland as a whole.

5.3 Statistical Significance

Because the survey's estimates may be affected by sampling errors, apparent differences of a few percentage points between sub-samples may not reflect real differences in the population. It might be that the true values in the population are similar but the random selection of households for the survey has, by chance, produced a sample which gives a high estimate for one sub-sample and a low estimate for the other.

A difference between two areas is significant if it is so large that a difference of that size (or greater) is unlikely to have occurred purely by chance. Conventionally, significance is tested at the 5% level, which means that a difference is considered significant if it would only have occurred once in 20 different samples. Testing significance involves comparing the difference between the two samples with the 95% confidence limits for each of the two estimates.

If you were to scroll down in the output page for this example, you would be able to see that the survey estimates that there are 8% single adult households in East Dunbartonshire (2.3%), 9% in Midlothian (2.5%) 13% in the Highlands (2.0%), and 22% in Edinburgh (1.7%). We can say, the following:

  • The difference between East Dunbartonshire and Midlothian is not significant because the difference between the two (1%) is smaller than either of the confidence limits. In general, if the difference is smaller than the larger of the two limits, it could have occurred by chance and is not significant.
  • The difference between East Dunbartonshire and Edinburgh is significant because the difference (14%) is greater than the sum of the limits (2.3% + 1.7% = 4%). In general, a difference that is greater than the sum of the limits is significant.
  • If the difference is greater than the larger of the two confidence limits but less than the sum of the two limits, the difference might be significant, although the test is more complex.

Statistical sampling theory suggests that the difference is significant if it is greater than the square root of the sum of the squares of the limits for the two estimates. The difference of 4% between Midlothian and the Highlands is greater than the largest confidence limit (2.5% in Midlothian) but it is less than the sum of the two limits (2.5% + 2.0% = 4.5%) so it might be significant. In this case, 2.5 2 = 6.25 and 2 2 = 4, giving a total of 10.25. The square root of this is 3.20, which means that the difference of 4% is significant. Similar calculations will indicate whether or not other pairs of estimates differ significantly.

5.4 Statistical Significance and Representativeness

Calculations of confidence limits and statistical significance only take account of sampling variability. The survey's results could also be affected by non-contact/non-response bias. If the characteristics of the people who should have been in the survey but who could not be contacted, or who refused to take part, differ markedly from those of the people who were interviewed, there might be bias in the estimates. If that is the case, the SHS's results will not be representative of the whole population.

Without knowing the true values (for the population as a whole) of some quantities, we cannot be sure about the extent of any such biases in the SHS. However, comparison of SHS results with information from other sources such as the 2001 Census and other government surveys suggests that they are broadly representative of the overall Scottish population, and therefore that any non-contact or non-response biases are not large overall. However, such biases could, of course, be more significant for some sub-groups of the population or in certain Council areas, particularly those that have the highest non-response rates.

As stated in Section 5.1, because it is a survey of private households, the SHS does not cover some sections of the population - for example, it does not collect information about many students in halls of residence (see the SHS Technical Reports for further information).

Estimated Sampling Error Table

Estimated sampling error associated with different proportions for different sample sizes

table

« Previous | Contents | Next »

Page updated: Tuesday, May 16, 2006