« Previous | Contents | Next »
Listen
2 Sample design, selection and allocation
The sampling requirements of the Scottish Household Survey are specified in terms of providing data with a level of precision equivalent to what would be achieved by a simple random sample ( SRS) of a particular size. Five requirements were identified in the survey specification and although the preference was for simple random sampling to be used across the survey, contractors were invited to propose design solutions that would most cost-effectively meet these requirements.
The sampling requirements were to provide the SRS equivalents of:
- 2,500 interviews in Scotland as a whole for each quarter in each year
- 500 interviews in each of the larger local authorities (those which have over 120,000 households) for each year
- 500 interviews in each local authority area (regardless of size) for each two-year period
- 500 interviews in each category of the six-fold urban rural classification in each year, and
- 2,000 interviews in the 15% most deprived areas of Scotland, taken together as a group, in each year.
These needs are met by a combination of:
- disproportionately stratifying the sample to ensure that each local authority and each other geographical area has enough interviews in each survey period
- modelling the impact of different sampling methods and combinations of clustered and unclustered sampling to assess the impact of the survey design on survey precision at different geographical levels and at different points in time
- using a combination of unclustered and clustered sampling to maximise the cost-effectiveness of fieldwork by minimising the impact of weighting and clustering on the survey's effective sample size
- allocating the selected sample appropriately over survey periods to provide the appropriate level of clustering at each period.
The survey has a number of other requirements that have been retained from previous sweeps such as:
- the sample should be fully national in character - while it is not uncommon for national surveys to exclude the area north of the Caledonian Canal or to be restricted to the Scottish mainland and the larger inhabited islands, the Scottish Household Survey includes all parts of Scotland, including small inhabited islands.
- the sample should be capable of producing data which are representative both of Scottish households and the adult (aged 16+) population resident in private households.
Each of these features of the sampling is discussed more fully below.
2.1 Stratification by local authority
In general, stratifying a sample by some known variable should improve the precision of survey estimates because structuring a sample in this way can be no worse than would be achieved by a random allocation. For example, selecting a national random sample should result in a geographical distribution that reflects the distribution of population but it may not and some random deviation from the known distribution would be expected. Stratification removes the element of chance by assigning sample to geographical areas and drawing smaller samples within each of the strata. In this way the sample distribution must reflect the known population distribution.
The sampling requirements for the SHS indicate a need for the sample to be allocated between local authorities that does not reflect the distribution of the population so that each can, after two years, have the SRS equivalent of 500 interviews. For a given sample size, meeting this need requires smaller local authorities to have more interviews that a proportionate allocation would give them and, as a result, larger local authorities have fewer interviews.
Analysis at a national level requires the data to be weighted so disproportionate stratification reduces the precision of survey estimates, making the gross sample equivalent to a smaller simple random sample. This impact needs to be considered in meeting the survey's sampling requirements.
2.1.1 Disproportionate stratification between local authorities
The underlying principle here is that the allocation of interviews by local authority area should be broadly proportionate to the number of households, except where the resulting sub-sample in any particular area would fall below a pre-determined accuracy threshold (the equivalent of a simple random sample of 500 interviews). The allocation was carried out in the following way.
1. The sample design needs to take account of the allocation of interviews to each local authority and for each allocation, the resulting impact on the number of interviews derived from the smallest category of the urban rural classification, the 15% most deprived areas and the number of interviews achieved each year in the five largest local authorities.
2. These estimates need to take account of the impact of clustering and weighting at each level and at each relevant time period to ensure that the sample will, after accounting for weighting and clustering, meet the SRS equivalent sample size requirements specified.
3. Having established that a particular design will meet the sampling objectives at each geographical level and at each time period, the total sample is built from the combinations of individual local authority samples.
2.1.2 Clustered and unclustered sampling within local authorities
In previous years, the SHS has used a combination of clustered and unclustered sampling but the tendering for the new contract allowed some reconsideration of the approach used. Although the survey specification expressed a preference for a wholly unclustered sample, it was necessary to test a number of designs to identify the extent to which a more cost-effective design could be achieved by retaining an element of clustering in areas where this would produce fieldwork efficiencies.
An unclustered sample involves, within each of the primary strata (the local authorities), either selecting a simple random sample of addresses (n) from all possible addresses (N) or selecting a systematic sample by identifying a random starting address and selecting addresses with a fixed interval equal to N/n. Systematic sampling has the advantage that by ordering the addresses by some characteristics (such as deprivation or postcode) it is possible to achieve a further level of implicit stratification as the ordering and fixed interval sampling ensures that the variables used to order the list are represented in proportion to their prevalence in the population.
Clustered sampling is a two-stage process of selecting a sample of geographical units within which a sample of individual addresses is selected. The benefit of clustered sampling is to reduce the mean distance between sampled addresses, improving the efficiency of survey fieldwork by reducing interviewer travel time between addresses and increasing the number of times interviewers can try to make contact at sampled addresses. Although clustering offers an administrative benefit, this comes at a statistical penalty because clustering increases the likelihood that the achieved sample will be more variable than the total population. The reason for this is that the number of sampled clusters is generally small relative to the total number of potential clusters and, within each cluster, the sampled addresses are likely to be more similar to each other than they are to the addresses sampled in other clusters. This increased variability can be estimated and compared with the level of variability of a simple random sample to give an estimate of the achieved sample's simple random sample equivalence.
The extent to which a sample should be clustered therefore requires some comparison of the precision of the sample and the cost of achieving that level of precision - a clustered sample is only more cost-effective than a simple random sample if it can achieve the same level of precision as the equivalent simple random sample at lower cost.
There are two key variables that influence whether a sample should be clustered
- Population density - the extent to which the population is naturally concentrated in a geographical area or spread across a number of small settlements separated by large distances.
- The sampling interval - the size of the survey sample in relation to the population from which it is selected - because for a given population, a larger sample will result in sampled addresses being closer together reducing the administrative gains from clustering compared with an SRS.
The implication of this is that there is likely to be no efficiency gain from clustering a large sample in an urban area whereas there is greater likelihood of efficiency gains from clustering a small sample in a dispersed or rural area.
This broad approach was used in the 2007 SHS sampling using the Scottish Government's urban rural classification to identify areas where sample should be clustered or unclustered. The general approach was that areas classified as 'large urban areas' or 'other urban areas' would use unclustered sampling while areas in the other four categories (accessible small towns, remote small towns, accessible rural and remote rural) would use clustered sampling.
This was applied within local authorities meaning that the sample was further stratified within each local authority using the urban rural classification and that each local authority potentially contains a combination of clustered and unclustered sampling.
In practice, this general approach is modified in two ways.
- Where more than 80% of households in a local authority fall into the 'urban' or 'non-urban' category, the whole local authority is treated as that category
- The three island authorities (Eilean Siar, Orkney and Shetland) use wholly unclustered sampling even though their urban rural classification suggests that they should use wholly clustered sampling. In these areas, the sampling interval is between 1 and 6 households and 1 in 8 households which means that clustered sampling would be no more efficient than unclustered sampling.
Table 2-1 shows the expected distribution of sample by local authority at the end of each two-year sampling period. It should be noted that this distribution does not meet the specified requirement of 2,000 interviews each year in the 15% most deprived datazones in Scotland. This was agreed as part of the post-tender negotiations because allowing this requirement to fall slightly below the target allowed an overall reduction in the survey sample of 3,000 interviews over the four-year contract period, achieving a significant reduction in the total survey cost.
Table 2-1: Projected two-year achieved sample size by local authority and SRS equivalent sample over target periods
| Total households | % of ints from unclustered sample | Sample size |
|---|
Gross in target period | SRS equivalent in target period |
|---|
Large local authorities - minimum of 500 SRS equivalent interviews each year |
|---|
Edinburgh, City of | 209,502 | 100% | 1,129 | 1,129 |
|---|
Fife | 153,040 | 63% | 914 | 825 |
|---|
Glasgow City | 276,291 | 100% | 1,489 | 1,489 |
|---|
North Lanarkshire | 134,700 | 100% | 726 | 726 |
|---|
South Lanarkshire | 128,238 | 100% | 691 | 691 |
|---|
Other local authorities - minimum of 500 SRS equivalent interviews after two years |
|---|
Aberdeen City | 98,859 | 100% | 1,065 | 1,065 |
|---|
Aberdeenshire | 92,067 | 0% | 1,344 | 992 |
|---|
Angus | 47,861 | 63% | 572 | 516 |
|---|
Argyll & Bute | 41,864 | 0% | 678 | 500 |
|---|
Clackmannanshire | 20,876 | 55% | 500 | 500 |
|---|
Dumfries & Galloway | 65,487 | 30% | 866 | 706 |
|---|
Dundee City | 67,032 | 100% | 722 | 722 |
|---|
East Ayrshire | 51,345 | 37% | 662 | 553 |
|---|
East Dunbartonshire | 42,763 | 100% | 500 | 500 |
|---|
East Lothian | 38,757 | 25% | 622 | 500 |
|---|
East Renfrewshire | 35,388 | 100% | 500 | 500 |
|---|
Eilean Siar | 11,360 | 100% | 500 | 500 |
|---|
Falkirk | 63,684 | 100% | 686 | 686 |
|---|
Highland | 92,514 | 22% | 1,255 | 997 |
|---|
Inverclyde | 37,883 | 100% | 500 | 500 |
|---|
Midlothian | 33,229 | 66% | 549 | 500 |
|---|
Moray | 36,515 | 25% | 622 | 500 |
|---|
North Ayrshire | 60,027 | 70% | 702 | 647 |
|---|
Orkney Islands | 8,380 | 100% | 500 | 500 |
|---|
Perth & Kinross | 60,866 | 36% | 788 | 656 |
|---|
Renfrewshire | 75,867 | 100% | 818 | 818 |
|---|
Scottish Borders | 48,790 | 28% | 648 | 526 |
|---|
Shetland Islands | 9,287 | 100% | 500 | 500 |
|---|
South Ayrshire | 50,754 | 69% | 595 | 547 |
|---|
Stirling | 37,321 | 54% | 568 | 500 |
|---|
West Dunbartonshire | 41,112 | 100% | 500 | 500 |
|---|
West Lothian | 65,030 | 70% | 761 | 701 |
|---|
| Target | Target period | Gross sample | SRS equivalent |
|---|
National | 2,500 | Quarterly | 3,552 | 3,052 |
|---|
Smallest category of urban rural classification | 500 | Annually | 1,301 | 650 |
|---|
15% most deprived datazones | 2,000 | Annually | 2,101 | 1,881 |
|---|
2.2 Allocating sample to different time periods
The consideration of clustering is complicated for the SHS because the sampling requirements are expressed in terms of the equivalent of a simple random sample at different points in time. Consideration needs to be given to the structure of the sample at these time points and the extent of clustering in the sample taken into account. For example, although the sample in Glasgow might be selected without clustering, in practice, the two-year sample is allocated to survey years, 'batched' into interviewer allocations and these are then assigned to months of the year - creating clusters of addresses. Thus, each month, the sample in Glasgow is made up of clusters of addresses, with the sample becoming progressively less clustered throughout the year. In practice, then, the sample in Glasgow is only completely unclustered after a full year. Even if the samples in all local authorities were sampled without clustering, the quarterly samples would be clustered and this needs to be considered in terms of the ability of the design to meet the quarterly target of an SRS equivalent of 2,500 interviews.
The way in which sample is grouped into clusters optimises the extent of unclustered sampling in appropriate areas to coincide with reporting requirements.
- Large local authorities identified as requiring separate reporting of results each year - with the exception of Fife, over two years, the sample is derived from wholly unclustered sampling. All of the unclustered sample is first randomly allocated to years, grouped into the most efficient fieldwork batches and then these batches are allocated to months within each year. In Fife, the clustered sample is treated in the same way as in all other local authorities.
- Other local authorities, which require separate reporting after two years - addresses in datazones classified for unclustered sampling are combined, sorted by deprivation indicator and a systematic sample selected. These addresses are batched, batches allocated to survey years and then to months within each year. Within each local authority, all of the datazones classified for clustered sampling are grouped and a sample of datazones selected with probability proportionate to size. Within the sampled datazones, a systematic sample of addresses is selected. Sampled datazones are randomly allocated to survey years and then to months within each year.
This has implications for how much of the SHS sample is unclustered at any point in time. In each quarter, the whole sample is clustered. Each year, only the sample from four of the five large local authorities is unclustered. After two years, the five large local authorities and the unclustered samples from all other local authorities are unclustered, leaving only the sampled datazones as clustered samples.
After one year approximately 34% of the sample should be from areas of unclustered sampling but after two years this will increase to over 70%.
2.2.1 Allocating sample across the calendar year
As the fieldwork for the survey runs throughout the calendar year, it is important to ensure an even distribution of batches over time and to ensure that the allocation of batches is geographically and demographically representative. There are two main reasons for this: an uneven distribution would jeopardise the requirement for the sample to be representative of the national population on each quarter and some of the variables measured by the survey are likely to exhibit seasonal patterns - e.g. rates of economic activity, modes of transport.
The procedure for allocating PSUs to months of the year is derived from that developed by the Office for National Statistics ( ONS) in managing the Family Expenditure Survey ( FES) 1 and differs only in the need for the SHS sample to be spread evenly across 24 rather than 12 months.
Batches of addresses are allocated to survey years and within each year, sorted by local authority and by deprivation within each authority. The list of batches is then labelled with a random permutation of the numbers 1 to 12 representing the twelve months covered by the fieldwork. This permutation is generated with certain properties to avoid 'bunching' of interviews within particular quarters:
- the first four months are from different quarters
- every subsequent month is from the same quarter as the one four places before.
The example given by ONS (and used to allocate the 1996 FES) is as follows:
Table 2-2: Procedure for allocating PSUs by month of fieldwork
Position in list | Month | Quarter |
|---|
1, 13, 25, etc. | 10 | 4 |
|---|
2, 14, 26, etc. | 8 | 3 |
|---|
3, 15, 27, etc. | 5 | 2 |
|---|
4, 16, 28, etc. | 1 | 1 |
|---|
5, 17, 29, etc. | 11 | 4 |
|---|
6, 18, 30, etc. | 7 | 3 |
|---|
7, 19, 31, etc. | 4 | 2 |
|---|
8, 20, 32, etc. | 2 | 1 |
|---|
9, 21, 33, etc. | 12 | 4 |
|---|
10, 22, 34, etc. | 9 | 3 |
|---|
11, 23, 35, etc. | 6 | 2 |
|---|
12, 24, 36, etc. | 3 | 1 |
|---|
As this sequence can be added automatically to the sampling procedures for the survey, no time is spent manually assigning batches to particular months.
2.3 Allocating sample between contractor organisations and questionnaire modules
Once all of the sampled addresses are batched for fieldwork, the batches are randomly assigned to one of the three contractor organisations in proportion to each contractor's fieldwork commitment (40% of interviews by TNS, 35% by Ipsos MORI and 25% by the Scottish Centre for Social Research).
The contract for 2007-2010 envisaged a requirement for greater modularisation of the SHS than had previously been the case. Modularisation requires the ability to only ask questions of a random sub-sample of respondents and for those sub-samples to be based on time periods or nationally representative sub-samples. One need identified in the survey specification was to create a module to measure participation in culture and sporting activities.
The allocation of sample to batches and these batches to months means that in theory, each month's sample is a random sub-set of the full sample, meeting any need for time-based modules. To meet the need for sub-sampling over the whole survey period, all sampled addresses were randomly assigned to one of 10 sub-samples or interview streams, which could be used as the basis for assigning samples of respondents to particular blocks of questions. For example, Culture and Sport module is intended to provide representative data on adults' participation and this is achieved by assigning the module to streams 1 and 6 meaning that 1 in 5 addresses and (assuming no difference in response rates) 1 in 5 interviews will be directed through those questions.
Other smaller blocks of questions are asked of sub-samples at various points in the questionnaire and the published version of the script indicates where and at what point in time streaming is used.
2.4 Sampling from the Postcode Address File
The Small User File of the Postcode Address File ( PAF) is now the standard sampling frame for general population surveys. 2 The principal advantages of the PAF are completeness (it is estimated to miss the addresses of only 2% of the adult population and is updated every three months) and lack of bias (those addresses which are missing from the PAF are not as likely to be concentrated among particular types of people). There are, however, a number of issues arising from its use.
2.4.1 Deadwood
The Small User File of the PAF, which forms the basis of the sample of addresses, contains a number of addresses that are not residential (usually small shops and offices) or which have been demolished or are unoccupied. In addition to PAF addresses that are out of scope for any household survey, there are also addresses that are deemed out of scope for this survey. These are mainly second home or holiday homes. In total, the extent of 'deadwood' in the PAF varies by area, but is usually estimated at around 10% in national samples. This is accounted for by drawing slightly more addresses than the response rate target would suggest. Thus, if the response rate target is 70% and deadwood is estimated to be 10% then for every 100 interviews to be achieved, 160 addresses are issued to interviewers rather than the 140 suggested by a response rate of 70% alone.
In practice, the number of additional addresses selected to allow for deadwood varies by local authority based on the contractors' experience of SHS fieldwork carried out between 1999 and 2005.
2.4.2 Accuracy and completeness
The sample for the survey is drawn for each two-year fieldwork period and so may exclude households in newly-built housing entering the PAF during the period of the survey. However, data suggests that new housing accounts for only around 1% of the housing stock in any year 3 and the impact of this is reduced by the fact that new properties are often entered onto the PAF some time before they are actually completed.
2.4.3 Exclusions
Samples of the general population exclude prisons, hospitals and military bases. While prisons and hospitals do not generally have significant numbers of private households, the same may not be true of military bases. These are classified as Special EDs in the Census and account for just 0.5% of the population. Interviewing on military bases would pose fieldwork problems relating to access and security so they are removed from the PAF before sampling.
Specific accommodation types - The following types of accommodation are excluded from the survey if they are not listed on the Small User file of the PAF:
- nurses' homes
- student halls of residence
- other communal establishments (e.g. hostels for the homeless and old people's homes)
- mobile homes
- sites for travelling people.
Households in these types of accommodation are included in the survey if they are listed on the Small User file of the PAF and the accommodation represents the sole or main residence of the individuals concerned. People living in bed and breakfast accommodation are similarly included if the accommodation is listed on PAF and represents the sole or main residence of those living there.
Students' term-time addresses are taken as their main residence (in order that they are counted by where they spend most of the year). Since halls of residence are generally excluded, however, there will be some under-representation of students.
2.5 Multiple dwellings
There are potential problems associated with the fact that a single entry on the PAF may actually represent multiple dwellings or that a dwelling may contain multiple households. For example, an address listed as 14 Milton Street may consist of a tenement block containing 8 separate flats. Often, the existence of these additional addresses is indicated in the PAF in a field known as the Multiple Occupancy Indicator ( MOI). To ensure that such households had an equal chance of inclusion, it is necessary to weight the address when drawing the sample. Thus 14 Milton Street would appear 8 times. In the address listings issued to interviewers, such addresses appear as '14 Milton Street - 3 of 8' etc., with interviewers given clear counting procedures for identifying the relevant selected dwelling.
Where the MOI is correct, this procedure is unproblematic. Sometimes, however, the MOI is incorrect or missing (in about 2% of cases) and the true number of dwellings at an address is only discovered once the survey is in the field.
Where an interviewer finds that the MOI is different from the actual number of dwellings observed (and there is more than one dwelling) he or she contacts the office where the correct details are used to randomly select one of the dwellings.
2.6 Respondent selection
As the survey is intended to collect information both about the structure and characteristics of Scottish households and about the people who occupy those households, the interview has a two-part structure. The respondent for the first part of the interview must be a householder - generally the Highest Income Householder or their spouse or partner 4. For the second part of the interview, one adult (aged 16+) member of the household is selected at random by the CAPI script. If this person is not available at the time, the interviewer will call back to complete the interview at a later date if necessary. 5
« Previous | Contents | Next »