On this page:

EVALUATION OF STATISTICAL TECHNIQUES IN THE SCOTTISH INDEX OF MULTIPLE DEPRIVATION

« Previous | Contents | Next »

Listen

4. Uncertainty

The measurement of uncertainty in the final SIMD rankings, and summaries of these over subgroups of data zones, was chosen to form the basis of evaluation of many of the modifications to the original algorithm. To allow an equitable comparison of alternative methods, it was chosen to generate simulated indicator datasets, to mimic the random variation in the indicators, and to apply modified algorithms to every simulated dataset. For each simulated dataset, each data zone has a ranking under the associated SIMD score. Over all simulated indicator datasets, there will be a distribution of these rankings for every data zone.

Since these simulated indicator datasets treat each indicator independently and make no allowance for correlation between indicators, they are likely to overstate the total variance amongst the indicators and therefore overestimate the degree of uncertainty in the final rankings. Alternative methods of uncertainty estimation, as proposed in the literature10 are based on random effects models, estimating the posterior distribution of ranks directly from an MCMC fit, or by simulating from estimated residual distributions from a multilevel model fit. However, the uncertainty estimates produced could not be compared directly between alternative models, since each would be based on different underlying assumptions.

4.1. Simulated Indicators

4.1.1. Number of Simulations

The original algorithm took approximately 30 seconds to calculate the SIMD ranks of each data zone, with the computing facilities available. Given the number of alternative methods to compare, and the timescale of the project, it was decide to generate 1,000 simulated indicator datasets; this is an approximate minimum number of simulations required to obtain reasonable estimates of extreme percentiles of resultant distributions. Thus it would take just over 8 hours to run the original algorithm on these simulated data. Simplifications of the method would take less time to run, but for more complex methods, even this minimal set of simulations may become prohibitively time-consuming to analyse.

4.1.2. Sampling Distributions

From a naïve point of view, each simulated indicator variable in each data zone should be simulated from a distribution with an expected value equal to the observed value of the indicator in that data zone. However, the majority of indicator variables are Binomial, and a large number of observed event counts are zero, due to small denominators and/or rare events, particularly those raw indicator variables that are used to construct age-sex standardised indicators.

Consequently, for a single Binomial indicator variable in a single data zone, where the observed number of events is r, and the denominator is n, the simulated datasets simulate that indicator from the distribution

scientific formula

where p is the observed proportion for the whole population*.

* The mean value chosen for the simulation of r s, ( r+p)/( n+1), is the posterior mean value based on an application of Bayes' Theorem, with a prior distribution for the proportion of events in a data zone taken as Beta( p,1 -p), i.e. a Beta distribution with mean equal to the national proportion of events.

The majority of indicator variables can be simulated in this way. Exceptions to this are:

  • Pupil Performance at SQA Stage 4
    • the indicator variable that contributes to the education domain is the mean SQA score in each data zone, but for the purposes of shrinkage, the variance of the mean SQA score is used, and is also treated as an indicator variable that must be simulated. Writing µ and d 2 for this mean and variance, simulation from N(µ,d 2) often produces negative simulated mean values. Consequently, simulation was performed using

scientific formula

to simulate mean values, and

scientific formula

to simulate variances.

  • Secondary Level Absences
    • the indicator variable is treated as a binomial rate for shrinkage, but the numerator and denominator are in general non-integer. Writing µ=logit( r/ n), and d 2=V[logit( r/ n)], simulation was performed using N(µ,d 2), and these simulated values were transformed back to generate simulated rates, rs.
  • Access Domain Indicators
    • The indicator variables used in the access domain are mean travel times to five key utilities: a GP surgery, a petrol station, a Post Office, a primary school and a supermarket. To obtain a sensible distribution for each of these indicators within data zones, it was assumed that the variance of travel times within a data zone is proportional to the mean travel time, and that the constant of proportionality can be estimated by the ratio of the between-data zone variance to the national mean travel time (both population-weighted). For each indicator variable, simulated travel times were therefore generated by sampling from

scientific formula

where µ is the observed mean travel time in a particular data zone, and µpop and d 2pop are the population-weighted mean and variance of travel times nationally.

4.1.3. Methods for Comparison

Each method is applied to the simulated datasets, producing for each simulation a corresponding set of rankings of the data zones based on the SIMD produced. For each data zone, the distribution of these rankings is of interest. For the original algorithm, these will be reported graphically, showing the variation in SIMD ranks over the simulations in relation to the observed SIMD rank. For each alternative method that can be applied to simulated data, the standard deviation ( SD) of SIMD ranks for each data zone will be expressed relative to the SD of SIMD ranks under the original algorithm. These SD ratios will be displayed graphically, in relation to the original SIMD rank, to determine whether alternative methods result in more or less variability of SIMD ranks, and over which sections of the underlying deprivation distribution.

For all of the alternative methods that can be applied to the simulated datasets, the probability that each data zone lies within the 15% most deprived nationally will be estimated. These will be tabulated over subgroups of data zones. Data zones will be grouped into LAs and subgroups defined by the 6-level classification of data zones according to the SE Urban/Rural Indicator.

4.2. Model-Based Uncertainty Estimates

The literature on institutional performance indicators and league tables present methods for the estimation of uncertainty in ranks that depend on fitting random effects models to indicator data. This can be achieved with MCMC methods, using the posterior distributions of the rank of each unit, or by fitting multilevel models and using the estimated distributions of random effects at each level of the model to simulate an approximate posterior distribution for the ranks of each unit.

Such models exploit the between-indicator correlations to obtain estimates of institutional performance. In principle, similar methods could be applied to deprivation indicators across data zones. However, to realistically reflect the variability in measures of deprivation, uncertainty at each step in the construction of the SIMD should be incorporated. The best methods for modeling this variability are not instantly clear.

Each indicator variable can itself be seen as an observation of a random variable, but could also be thought of as a fixed quantity, measured without error, so that the resultant deprivation domains and SIMD are viewed as conditional on the observed indicator variables. In the current methodology, however, the process of shrinkage implies that the indicators used in the construction of the health and education domain scores are random quantities, and the use of factor analysis implies an additional level of random variation within the health, education and access domains.

To construct realistic estimates of rank uncertainty would require a consistent approach to be taken to all indicators variables, and each indicator variable should be considered an observation of a random quantity. Nevertheless, there are approximately a hundred variables that undergo shrinkage and/or factor analysis, and to incorporate uncertainty estimates around the current methodology using MCMC methods could be computationally infeasible.

In this report, MCMC methods will be explored to illustrate how they could be used at various stages of the SIMD calculation. The complexity of the current algorithm prohibits the full application of these methods to generate uncertainty estimates for the rank of each data zone on the deprivation domains and the SIMD. However, if the current algorithm were to be simplified, it may become possible to calculate such measures of uncertainty within a reasonable timeframe.

« Previous | Contents | Next »

Page updated: Tuesday, October 18, 2005