« Previous | Contents | Next »
Listen
2. Replicating the Original SIMD 2004
The published SIMD 2004 data consist of the ranks of each data zone on each domain and on the SIMD6. Also published are the domain and SIMD scores, though for some indicators, the values for a few data zones are suppressed, since they represent very small numbers of individuals and as such could be viewed as partly identifiable data. The initial step to evaluating the SIMD 2004 was to verify that the published figures relating to rankings are the true result of applying the reported algorithm to the raw deprivation indicators.
2.1. Structure of the SIMD
The 6 domain scores1, each measuring a different aspect of deprivation, are: Current Income; Employment; Health; Education, Skills and Training; Geographic Access and Telecommunications; and Housing. The SIMD 2004 is formed by combining the 6 domain scores, to give a single measure of multiple deprivation.
2.1.1. Current Income Domain
The income domain is the percentage of the population living in households in receipt of means tested benefits. The number of income deprived people in a data zone is defined to be the sum of the numbers of individuals receiving one of the following eight benefits:
- Adults in Income Support households (April 2002)
- Children in Income Support households (April 2002)
- Adults in Income Based Job Seekers Allowance households (August 2001)
- Children in Income Based Job Seekers Allowance households (August 2001)
- Adults in Working Families Tax Credit Households below a low income threshold (April 2002)
- Children in Working Families Tax Credit Households below a low income threshold (April 2002)
- Adults in Disability Tax Credit households below a low income threshold (April 2002)
- Children in Disability Tax Credit households below a low income threshold (April 2002)
This is then expressed as a percentage of the total data zone population.
2.1.2. Employment Domain
Similar to the income domain, the employment domain is the percentage of the working age population that want to work but are excluded due to unemployment, ill health or disability. The number of employment deprived people in a data zone is defined to be the sum of the numbers of individuals falling into one of the following four categories:
- Unemployment Claimant Count averaged over 12 months of those men aged under 65 and women aged under 60 (2002 )
- Incapacity Benefit recipients, men aged under 65 and women aged under 60 (April 2002)
- Severe Disablement Allowance recipients, men aged under 65 and women aged under 60 (April 2002)
- Compulsory New Deal participants - New Deal for the under 25s and New Deal for the 25+ not included in the unemployment claimant count (April 2002) .
This is then expressed as a percentage of the total working age population (men up to age 65 years, women up to age 60 years) in each data zone.
2.1.3. Health Domain
The health domain is made up of seven indicators. Five of these can be expressed in relatively simple terms:
- Hospital episodes relating to alcohol use
- the number of alcohol-related discharges in each data zone during a four-year period from 1998-2002, expressed as a percentage of the total data zone population (estimated as 4 times the 2001 data zone population)
- Hospital episodes relating to drug use
- the number of drug-related discharges in each data zone during a four-year period from 1998-2002, expressed as a percentage of the total data zone population (estimated as 4 times the 2001 data zone population)
- Emergency admissions to hospital
- the number of emergency admissions in each data zone during a four-year period from 1998-2002, expressed as a percentage of the total data zone population (estimated as 4 times the 2001 data zone population)
- Proportion of the population being prescribed drugs for anxiety, depression or psychosis
- the sum of the estimated numbers of individuals being prescribed each class of drug in 2002, expressed as a percentage of the total data zone population registered with a GP
- Proportion of live singleton births of low birth weight (<2,500g)
- the number of live singleton births of low birthweight in each data zone over a four-year period between 1998 and 2002, expressed as a percentage of the total number of live singleton births over the same period
To improve the quality of these data when measured at small area level, a shrinkage technique is applied*.
* In short, the value of each indicator in each data zone is replaced by a combination of the indicator value for that data zone and the average indicator value for the local authority in which the data zone lies. The combination of data zone and local authority indicator values is weighted inversely to the within- and between-data zone variances. In general, large data zones, with more stable data (smaller within-data zone variance) are given greater weight than small data zones with less stable data. Therefore, the raw indicator value in each data zone moves towards the local authority average; the degree of movement is greater in smaller data zones with less stable indicator values.
The remaining two health domain indicators are directly-standardised measures of the level of ill-health in each data zone:
- Comparative Mortality Factor ( CMF)
- formed from the numbers of male and female deaths occurring in each data zone in 5-year age bands (0-4, 5-9, …, 85-89 and ³90 years) during a four-year period from 1998-2002, expressed as a percentage of the data zone sex- and age-group population (estimated as 4 times the 2001 population)
- Comparative Illness Factor ( CIF)
- formed from the average of the numbers of men and women in each data zone in 5-year age bands (0-4, 5-9, …, 80-84 and ³85 years) reporting poor general health at the 2001 Census and the numbers reporting a longstanding illness, expressed as a percentage of the data zone sex- and age-group population
The CMF and CIF indicators are age- and sex-standardised by the method used for the SIMD 2003* and represent the level of mortality and ill-health relative to the national average; the value of 100 corresponds to an average data zone, values greater than 100 imply higher than average levels of mortality or ill-health and therefore higher levels of deprivation.
* For each age- and sex-group, the observed indicator values are shrunk towards the local authority average level. The shrunken rates are applied to the national age-sex distribution and summed, then divided by the total number of deaths nationally and scaled by multiplication by 100. Data zones with above average mortality rates or rates of ill-health therefore have CMF or CIF values greater than 100.
These seven indicators are transformed by ranking and converting to standard Normal distributions, and then combined using maximum likelihood factor analysis, assuming a single latent factor; the weights derived from factor analysis are scaled to have unit sum, and the resulting factor score is defined to be the health domain score.
2.1.4. Education, Skills and Training Domain
The education domain is made up of five indicators:
- Pupil Performance at SQA Stage 4
- the average SQA score of pupils in each data zone in 2001-02, with the denominator including an estimate of the number of eligible pupils not taking any exams
- 16-18 Year-Olds not in Full-Time Education
- estimated from the difference between the 15-17 year-old data zone population in 2001 and the number of 16-18 year-old child benefit claimants (and therefore in full-time education) in 2002, expressed as a percentage of the 15-17 year-old data zone population in 200117-19 Year-Olds not Successfully Applied to Higher Education (2000-02)
- estimated from the difference between the 16-18 year-old data zone population in 2001 and the number of 17-19 year-olds who have successfully applied to higher education in 2002, expressed as a percentage of the 16-18 year-old data zone population in 2001
- Working Age Adults with no Qualifications
- formed from the numbers of working age men (25-64 years) and women (25-59 years) in 5-year age bands with no qualifications as determined by the 2001 Census, expressed as a percentage of the data zone sex- and age-group population
- Secondary Level Absences
- the estimated number of half-day absences during 2001-02, expressed as a percentage of the estimated total number of potential half-day absences during the same period
The Working Age Adults with no Qualifications indicator is age- and sex-standardised using the same method as the CMF and CIF indicators in the health domain. Pupil Performance at SQA Stage 4 is shrunk by an analogous method to that used for indicators expressed as rates, using the within- and between-data zone variance in SQA scores; since low average SQA scores reflect deprivation, the negative of the shrunken indicator is used. Other indicator variables are expressed as rates and are shrunk using the standard method.
Shrunken indicators are ranked and transformed to standard Normal distributions, then combined with factor analysis to create the education domain score.
2.1.5. Geographic Access and Telecommunications Domain
The access domain is made up of five indicators:
- Average drive time to GP
- Average drive time to supermarket
- Average drive time to petrol station
- Average drive time to primary school
- Average drive time to post office
No shrinkage is applied; the indicators are ranked and transformed to standard Normal distributions, then combined with factor analysis to create the access domain score.
2.1.6. Housing Domain
The housing domain is made up of two indicators, derived from the 2001 Census:
- Persons in households which are overcrowded
- Persons in households without central heating
Each is expressed as a percentage of the total data zone population. The housing domain is defined as the sum of the two indicators.
2.2. Programming Language
The original programs produced by the OCS for the calculation of the SIMD 2004, based on the techniques used by the SDRC for previous multiple deprivation indices, were written using the statistical programming language SAS7. For this project, it was decided that programs would be written using the statistical programming language SPlus 8. One benefit of using a different programming language is that the process of replicating the algorithms that produced the SIMD 2004, rather than merely copying those used previously, would act as a validation of the original programs. Secondly, the SPlus language is function-based, so that each element of the algorithm is incorporated into the program as a function, or subroutine; changing one part of the algorithm ( e.g. using a different method of shrinkage) is simply a matter of changing one function within the program ( e.g. the shrinkage function). SAS is more efficient for working with large datasets; however, it was felt that the advantages of SPlus outweighed those of SAS for this project.
2.3. Differences Between Original And Replicated SIMD 2004
The original domain scores and SIMD were successfully replicated. Figure 2.1 shows the replicated and original (published) SIMD 2004 scores to be virtually indistinguishable. However, the replicated and original SIMD 2004 scores, and more importantly, the data zone ranks, are not identical.
Table 2.1 summarises each domain score and the SIMD as originally produced as well as the differences between the original and replicated scores and ranks in each data zone. In general, the absolute differences between the original and replicated scores are small compared to the scale of each score, though these small absolute differences result in some data zones having different ranks under the replicated SIMD. The exceptions to this are the income and employment domains, where the absolute differences between the replicated and original scores are approximately ±0.05%, though the rankings under the two sets of scores are identical. This is a result of the original income and employment domain scores being reported to the nearest 0.1%, though with the ranks calculated from the exact scores.

Figure 2.1.Original and replicated SIMD 2004 scores.
Mean ( SD) Median ( IQR) [Range] | Published Data | Differences between Replicated and Published Data |
|---|
Scores | Scores | Ranks |
|---|
Income Domain | 14.9 (12.1) 11.6 (5.5, 20.9) [0.0, 80.5] | 0.00 (0.03) 0.00 (-0.02, 0.03) [-0.05, 0.05] | 0.0 (0.0) 0.0 (0.0, 0.0) [0.0, 0.0] |
|---|
Employment Domain | 14.0 (9.4) 11.6 (6.6, 19.3) [0.0, 64.7] | 0.00 (0.03) 0.00 (-0.03, 0.02) [-0.05, 0.05] | 0.0 (0.0) 0.0 (0.0, 0.0) [0.0, 0.0] |
|---|
Health Domain | 0.00 (0.79) -0.02 (-0.58, 0.54) [-2.37, 2.88] | 0.0000 (0.0014) 0.0000 (-0.0009, 0.0009) [-0.0056, 0.0051] | 0.0 (3.9) 0.0 (-2.0, 2.0) [-17.0, 16.0] |
|---|
Education Domain | 0.00 (0.87) -0.01 (-0.62, 0.64) [-2.67, 2.75] | 0.0000 (0.0002) 0.0000 (-0.0001, 0.0001) [-0.0033, 0.0032] | 0.0 (0.6) 0.0 (0.0, 0.0) [-8.0, 5.0] |
|---|
Access Domain | 0.00 (0.80) -0.05 (-0.51, 0.43) [-2.75, 3.07] | 0.0000 (0.0000) 0.0000 (0.0000, 0.0000) [0.0000, 0.0007] | 0.0 (0.2) 0.0 (0.0, 0.0) [-2.0, 2.0] |
|---|
Housing Domain | 19.6 (14.8) 15.8 (10.0, 24.6) [0.0, 113.4] | 0.0 (0.0) 0.0 (0.0, 0.0) [0.0, 0.0] | 0.0 (0.2) 0.0 (0.0, 0.0) [-3.0, 3.5] |
|---|
SIMD | 21.7 (16.6) 16.9 (9.1, 29.5) [0.5, 87.6] | 0.0 (0.0) 0.0 (0.0, 0.0) [-0.1, 0.1] | 0.0 (1.1) 0.0 (0.0, 0.0) [-7.0, 7.0] |
|---|
Table 2.1.Summaries of published domain and SIMD 2004 scores, and of differences in scores and data zone rankings between replicated and published data
The differences between the original and replicated domain scores and SIMD are too small to be due to coding errors in the programs that replicate the SIMD algorithms. They are most likely to be due to a combination of factors, including rounding error (supported by the observation that the greatest disparities between the two sets of scores lies in the health domain, where shrinkage is carried out on a total of 79 variables, if you include those shrunk as part of the calculations for the CMF and CIF indicators) and slight differences between SAS and SPlus in their internal calculations ( e.g. in the Factor Analysis algorithms, or the conversions to Normal or Exponential distributions).
Nevertheless, the fact that these minor differences brought about by the use of an alternative programming language result in changes in ranking between data zones, indicates the inherent sensitivity of the rankings. The differences observed during the replication process are between data zones with very similar levels of deprivation, which cannot be separated with any degree of confidence. Allowing only for this one source of uncertainty, a handful of data zones may be classified as lying within the most deprived 15% of areas (the threshold used for the allocation of LA funds) or not, depending on the programming language used.
Extending this argument, if we accept that the observed raw indicator data are themselves observations of random variables, reflecting an underlying level of deprivation on a particular domain with some imprecision, then it becomes apparent that there is uncertainty throughout the SIMD algorithm. Consequently, the observed rankings of each data zone on any one domain or on the SIMD cannot be viewed as 'truths', but as estimates of the data zone's ranking on the chosen score. It is therefore appropriate that attempts should be made to estimate the degree of uncertainty in these estimates.
Although it was possible to replicate the original SIMD algorithm, an error was noted in the original calculations. The formula used for the calculation of the within data zone variance of SQA scores was incorrect. OCS were alerted of this error as soon as it became apparent and are investigating the implications independently of this report.
The programming error noted above resulted in minor differences compared to the original education domain and SIMD scores and rankings. The correlations between the SIMD scores and ranks produced with or without this error are 98.9% and 99.6% respectively, and as a result of correcting the error, 2 data zones are reclassified as one of the most deprived 15% nationally, with another 2 data zones no longer achieving this classification.
For the remainder of this report, the corrected domain and SIMD scores and ranks, as produced by SPlus, will be referred to as the "Original" values, for the purposes of comparison with values produced under modifications to the original algorithm.
« Previous | Contents | Next »