Synthetic LS data
The England and Wales Longitudinal Study (ONS LS), Scottish Longitudinal Study (SLS) and Northern Ireland Longitudinal Study (NILS) are incredibly rich micro-datasets linking census and other health and administrative data (births, deaths, marriages, cancer registrations) for individuals and their immediate families across several decades. Whilst unique and valuable resources, the sensitive nature of the information they contain means that access to the microdata is restricted to approved researchers and LS support staff, who can only view and work with the data in safe settings controlled by the national statistical agencies. Consequently, compared to other census data products such as the aggregate statistics or interaction data, the three longitudinal studies are used by a small number of researchers – a situation which limits their potential impact.
Given that confidentiality constraints mean that open access is not possible with the real microdata, alternative options are needed to allow academics and other users to carry out their research more freely. To address this the SYLLS project (Synthetic Data Estimation for UK Longitudinal Studies) has been set up. SYLLS is developing techniques to produce synthetic data which mimics the real data and preserves the relationships between variables and transitions of individuals over time, but is more freely accessible.
This project, a collaboration between the three UK Longitudinal Study Research Support Units – CeLSIUS, SLS-DSU and the NILS-RSU – makes use of two complementary methods for generating synthetic data products:
- Statistical modelling with conditional specification is used to generate bespoke synthetic datasets for individual research projects. Users can request synthetic versions of the data they request from the Longitudinal Studies for use outside of the secure microdata laboratories. These data will allow for simple tasks such as the refining of analysis scripts to be carried out more easily and we are confident that the synthetic data will be good enough to produce analysis results very close to those that would be carried out on the real data. After developing analyses on the synthetic data users will have the option of having them repeated and, we hope, confirmed on the actual LS data sets. Routines to generate a synthetic version of real datasets are implemented in the R package ‘synthpop’. The package and a vignette with a working example are freely available from the R website. For information about requesting a bespoke synthetic tabulation for your research, please speak to the particular Research Support Unit administering your project.
- A simple proportional fitting algorithm has been used to extend the national 2011 Census Microdata Teaching files (derived from the Samples of Anonymised Records) to include longitudinal transitions (currently a ten year transition back to 2001), derived from the real LS data. The Synthetic Spine state transitions are for some of the most commonly used variables in social science research, including health, marital status, social grade and religion, as well as estimates for the numbers of births and deaths. These data are for general use and will allow users who are unfamiliar with longitudinal data to get a feel for the kinds of questions that can be answered by following individuals through time and observing changes in their characteristics. You can now download a synthetic ONS LS (England and Wales), SLS and NILS spine:
- ONS LS synthetic data spine (XLSX 64MB)
- SLS synthetic data spine (XLSX 8.9MB)
- NILS synthetic data spine (XLSX 2.2MB)
- Practical exercise using the ONS LS spine data (link to Adam Dennett’s blog)
- R script used for creating the synthetic spine (ZIP 8KB)
- Journal paper about the methodology (open access)
- ONS LS synthetic data spine (XLSX 64MB)
NOTE: The datasets contain synthetic data based on the ONS Longitudinal Study for England and Wales (ONS LS), the Scottish Longitudinal Study (SLS) and the Northern Ireland Longitudinal Study (NILS) and do not contain real ONS LS, SLS or NILS data. The synthetic datasets have been developed for teaching purposes and for researchers to familiarise themselves with longitudinal data. Real ONS LS microdata are available in safe settings in London, Newport, South Wales and Titchfield in Hampshire. Real SLS microdata are available in the safe setting in Edinburgh. Real NILS microdata are available in the safe setting in Belfast. The microdata are never released outside these safe settings. Researchers may apply to use ONS LS data by contacting CeLSIUS, based at University College London, to use SLS data by contacting SLS-DSU, based at Ladywell House, Edinburgh or to use NILS data by contacting NILS-RSU, based at Colby House, Belfast. The ONS LS, SLS and NILS synthetic spines are released under an Open Government Licence agreement.
You can read more about the technique for creating synthetic spine datasets in this journal paper. More information can be found in this blog post by Dr Adam Dennett and our GitHub repository.
Further information on developments from the SYLLS project will be available on this website, and you can also contact us if you would like to find out more.