Generating synthetic microdata to widen access to sensitive data sets: method, software and empirical examples
Nowok, N., Raab, G. & Dibben, C. (2014) BSPS Annual Conference 2014, Univ of Winchester, 8 - 10 September 2014. [SLS][ONS LS][NILS][CALLS]
In many contexts, confidentiality constraints severely restrict access to unique and valuable microdata. The UK Longitudinal Studies linking census and other health and administrative data for individuals and their immediate families across several decades provide a good example which also motivated this study. In order to allow academics and other users to carry out their research more freely, synthetic version of a bespoke data set can be generated and provided to users with fewer access restrictions. Synthetic data mimic the real data and preserve the relationships between variables and transitions over time, but they do not include any real individuals. The basic idea of data synthesis is to replace genuine data with values sampled from conditional probability distributions. We develop a sequential algorithm for producing synthetic data set and implement it in R software (freely available R package called ‘synthpop’). The users have a choice between different parametric and non-parametric synthesising models. The latter includes classification and regression trees (CART) models. As a validation of the method we compare statistical inference based on real and synthetic data for research projects using the SLS data (including 2011 Census data). In addition, we compare the relative performance of parametric and non-parametric synthesising models.