On this page:

EVALUATION OF STATISTICAL TECHNIQUES IN THE SCOTTISH INDEX OF MULTIPLE DEPRIVATION

« Previous | Contents | Next »

Listen

10. Discussion

The Scottish Index of Multiple Deprivation ( SIMD) 2004 is the result of a number of years' research into the measurement of deprivation at small area level, and bears close resemblance to other deprivation indices being used throughout the United Kingdom. The statistical methods upon which these deprivation measures are based have not previously been subjected to empirical assessment. The broad aims of this project were to provide an evaluation of the statistical methods used in the calculation of the SIMD 2004. Particular areas under scrutiny were the uses of Shrinkage and Factor Analysis ( FA), the methods of Exponential transformation and the weights used for the combination of domain scores into the final SIMD, and the estimation of uncertainty in the final SIMD scores and ranks.

We have considered a number of modifications to the original algorithm for calculating the SIMD 2004, and have applied each of these and compared the results obtained with those from the original method. A naïve method of uncertainty estimation, based on application of methods to simulated replicate indicator datasets, has been applied to form one aspect of these evaluations. Methods have also been compared using correlations and summaries of resultant SIMD rankings across subgroups of data zones defined by Local Authorities ( LAs) and the Scottish Executive ( SE) 6-level Urban/Rural Indicator.

In the original algorithm, shrinkage is applied in the calculation of the Health and Education, Skills & Training domains to guard against extreme data in small areas, by modifying indicator values towards the LA average by an amount inversely dependent upon the precision with which the indicator variable is measured. In general, the level of precision for each indicator is greatest in data zones with large denominators and least in data zones with small denominators; thus the smaller data zones tend to be shrunk by the greatest amounts.

However, shrinkage involves a trade-off between variance and bias. A concern of using shrinkage is that indicator estimates are most biased in small data zones; if a small, but deprived data zone is present within an otherwise less deprived LA, this bias could result in the data zone being incorrectly ranked by deprivation. The use of LA as the higher-level shrinkage unit, when LAs are the areas at which the resultant indices are reported and at which decisions based on the SIMD 2004 are made, therefore might not be optimal.

To avoid the potential "overshrinkage" of isolated pockets of deprivation in otherwise less deprived LAs, a single (national) higher-level unit can be used, resulting in a small redistribution of the most deprived data zones from large urban areas to more rural areas. However, the results are similar when no shrinkage is used, leading to the conclusion that the use of shrinkage to guard against extreme values in small data zones is negated by the fact that so much data is being used to construct these indices, and the possibility of any one indicator severely affecting the results according to whether or not shrinkage is applied is very small. Also, the subsequent use of FA on those indicator variables that undergo shrinkage will tend to have similar effects to shrinkage, so that the current algorithm effectively shrinks the data twice. It is worth noting that the two domains in which shrinkage is employed contribute only 33% of the weighting towards the construction of the final SIMD score.

Not using shrinkage gives unbiased estimates of indicator variables in each data zone at the expense of less than 5% additional variation. If, as would be desirable, reliable estimates of SIMD score and rank uncertainty were available, small data zones would automatically be recognised as being ranked with less precision than larger data zones.

In the Health domain, several indicators are based on data covering four years. This will reduce variability in these indicators, and is an approach that could be used more widely ( e.g. within the Education, Skills & Training domain). However, this would also reduce the responsiveness of the final indices to changes in deprivation levels over time. Some consistency in the decisions about which variables are to be temporally smoothed by taking averages over several years would be desirable.

FA is currently used to combine indicators in three domains, namely Health, Education, Skills & Training and Geographic Access & Telecommunications, for which it is not possible to define the domain score as a simple sum of indicator variables. FA assumes the existence of a single latent variable, to which the ranked and Normally transformed indicators are linearly related in expectation. None of the evaluated methods, using alternative transformations prior to either FA or Principal Components Analysis, was shown to offer any clear benefits.

A more computationally complex method of Generalised Factor Analysis was explored, which retains the conceptual benefit of FA by assuming a single latent factor to which all indicators in a domain are linearly related, but models the distribution of each indicator directly, thereby preserving the degree of separation between data zones with respect to each indicator variable.

The methods used to combine domain scores into the final SIMD are designed to create an index of multiple deprivation, avoiding "canceling out" should a data zone demonstrate opposing levels of deprivation on different domains. However, the combined effect of ranking followed by transformation to an exponential distribution is different depending on the distribution of the underlying domain score. Those domains interpretable as a rate (the income, employment and, to some extent, housing domains) are largely unaffected by these transformations. Those produced by methods including Factor Analysis are, by design, approximately Normal, and an alternative transformation is possible, incorporating the standard Normal cumulative distribution function.

Using simple standardisation of each domain score (by subtracting the mean value and dividing by the standard deviation) results in only minor changes to the final SIMD rankings, with large urban areas having a few less highly deprived data zones as a result. The intermediate method, transforming the health, education and access domains in a way that mimics the original method but without ranking data zones, followed by simple standardisation, is even more closely correlated with the original algorithm.

Another feature of the data is the lack of effect of changing the weights used to combine domains to form the SIMD, brought about by the positive correlations between many of the domain scores. Whilst the weights currently used are therefore adequate, greater transparency could be achieved by explicitly separating the processes by which the weights are chosen, to reflect the prevalence and severity of each aspect of deprivation.

One area in which we had limited success was in the application of methods required for the estimation of uncertainty in the final SIMD ranks. The most direct method to achieve rank uncertainty estimates would be to incorporate the entire SIMD 2004 algorithm within a Markov Chain Monte Carlo ( MCMC) estimation procedure. However, the methodology would be prohibitively complex and time consuming to fit, in particular, the large number of individual deprivation indicators that undergo shrinkage, and the repeated use of ranking of data zones within the current algorithm. Alternative methods for rank uncertainty estimation could be based on simulation from residual distributions based on random effects models used in the shrinkage stages of the algorithm. However, this would introduce uncertainty based on one aspect only of the algorithm, and would not recognise uncertainty in rankings associated with those domains that do not include a shrinkage component.

Furthermore, some difficulties were found in the application of random effects models for multivariate shrinkage, for which too many covariance parameters had to be estimated to fit models with random effects at both the LA and data zone levels.

Nevertheless, we found greater success in the use of MCMC methods for the application of Generalised Factor Analysis. It is therefore feasible that if shrinkage were not to be used, the entire algorithm for the production of deprivation domains and SIMD scores and ranks could be incorporated into this framework. The uncertainty in the final SIMD ranks could then be extracted as a by-product of the model fitting procedure.

What is more, by applying (Generalised) FA to raw ( i.e. unshrunk) indicator variables, an element of shrinkage is being carried out, in the sense that extreme values on individual indicators, particularly when associated with small denominators and therefore large within-data zone variance, will tend to result in factor estimates that are less extreme, both as a result of the estimation process, and from their combination with other indicator variables. It might also be argued that a unified approach could be adopted, in which all domains undergo FA of some kind; for "single indicator" domains, such as the Current Income domain, this would in effect involve shrinkage of the observed data towards a single national average value.

The Generalised Factor Analysis procedure as applied in this project was not without difficulties, and additional investigation would be required to solve a number of problems. For example, a natural distribution for the CMF, CIF and Adults without qualifications indicators was not immediately apparent. However, if these indicators were to be age-sex standardised as observed:expected ratios, such as SMR statistics, a Poisson distribution may be appropriate.

Though not within the original project remit, we have compared the use of SMR-type standardisation of these three indicators to the original algorithm, with the results shown in Appendix E. Ten fewer data zones in large urban areas are determined to lie within the 15% most deprived nationally, with these data zones being distributed amongst the other types of area ( Table E.1). The pattern of redistribution with respect to LAs is predictable, with Glasgow City losing the greatest number of highly deprived data zones ( Table E.2). Similar patterns are observed using probability-weighted numbers of data zones ( Table E.3 and Table E.4), as determined by application of the method to the simulate indicator datasets. The variability of SIMD ranks is similar to the original algorithm (Figure E.5), with less than 2% additional variation over most of the deprivation distribution, except at the least deprived end.

« Previous | Contents | Next »

Page updated: Tuesday, October 18, 2005