Synthetic Data for the UK Longitudinal Studies – SYLLS
Adam Dennett, UCL
As we head into a new year, we draw closer to the end of the SYLLS project. Starting in April 2013, the project has been run as a joint venture between the three Longitudinal Studies Research Support Units (RSUs) and the CALLS-Hub, with the aim of generating Synthetic Longitudinal data which are not subject to the same access restrictions as the real Census-based longitudinal microdata for England, Wales, Scotland and Northern Ireland.
The project has been split between teams based at CeLSIUS at UCL and the SLS-DSU in Edinburgh / St Andrews. The London team have been tasked with generating the ‘Synthetic Spine’ dataset. This is a partial replication of the full set of individuals contained in the 1991 LSs of England and Wales, Scotland and Northern Ireland, who then were also enumerated in the 2001 Census. The replication is partial as we have not attempted to synthesise every variable contained in the LSs for every individual, rather we have focused on a selection of some of the most frequently requested variables in previous LS-based research projects (age, sex, ethnicity, health, births, deaths, geography).
In order to generate the synthetic spine dataset, we have used publicly available data from the 1991 Samples of Anonymised records (SARs) as our base. The SARs are similar to the LSs in that they are microdata records and so are prefect for this task. A bespoke microsimulation model has been built by Belinda Wu to generate the synthetic spine from the SARs data. We began with England and Wales: A baseline population for the 1991 synthetic LS was generated by constraining aggregated (local authority) area level from the SARs to similar area level data from the LS using the tried-and-tested iterative proportional fitting technique – individuals were then sampled from this new data set to build our synthetic LS population. Once the 1991 baseline population is created, transitional probabilities are calculated from the LS to age our simulated individuals on 10 years and give them the same characteristics that we would see for those LS members enumerated in both the 1991 and 2001 Censuses.
The England and Wales LS Synthetic Spine is now complete; we are currently working on finishing a similar dataset for the SLS and will soon be tackling the Northern Ireland LS. Northern Ireland is a slightly different case as the 1991 to 2001 link has not yet been completed, but as the NILS sample is around a quarter of the resident population, the aggregate distributions are likely to be very similar to the distributions for the full Census. We will therefore use the 1991 Census distributions to generate our 1991 baseline and calculate the transitions to 2001 using our microsimulation software as soon as the link project is complete.
While the London team have been beavering away on the synthetic spine, the team based in Scotland have been working feverishly on the other half of the synthetic project. The second half of the project is approaching the generation of synthetic data from a different angle entirely: rather than attempting to create a large, general use dataset, here we are tailoring synthetic data to the individual needs of the user. Very soon, if you formulate a project and submit a request to access data from any of the national LSs, you will be asked if you would like to also receive a bespoke, fully synthetic version of your specific data request to work with as you wish on your own computer – something which is not possible with the real data.
The bespoke data are generated using a new R package called ‘synthpop’ developed by Beata Nowok and Gillian Raab in the Scotland team. Synthpop is a multiple synthesis package which allows user support officers to quickly generate fully synthetic versions of the data requested by the user. The data are generated through a series of models which estimate the values of one variable from the values of all others in the dataset sequentially. One of the benefits of this approach is that the resulting data are statistically equivalent to the real data, despite containing no real values.
We are now in the process of testing the synthpop package, with the Edinburgh team coming to visit London and the ONS LS virtual microdata lab to train the CeLSIUS user support officers and test the package on different data. A similar visit to Belfast and the NILS-RSU ‘safe-setting’ is scheduled shortly after that.
On the 6th of March we will be very excited to launch both Synthetic data products at the UK LS 2011 Census Linkage Launch event, and we hope to be able to provide user access to both the synthetic spine and bespoke synthetic tabulations very shortly afterwards.