Generating synthetic microdata to widen access to sensitive data sets: method, software and empirical examples
Recent News
Upcoming Events
Sorry, there are currently no upcoming Events.
Nowok, N., Raab, G. & Dibben, C. (2014) BSPS Annual Conference 2014, Univ of Winchester, 8 - 10 September 2014. [SLS][ONS LS][NILS][CALLS]
Other information:
SLS project pageONS LS project page
Abstract:
In many contexts, confidentiality constraints severely restrict access to unique and valuable microdata. The UK Longitudinal Studies linking census and other health and administrative data for individuals and their immediate families across several decades provide a good example which also motivated this study. In order to allow academics and other users to carry out their research more freely, synthetic version of a bespoke data set can be generated and provided to users with fewer access restrictions. Synthetic data mimic the real data and preserve the relationships between variables and transitions over time, but they do not include any real individuals. The basic idea of data synthesis is to replace genuine data with values sampled from conditional probability distributions. We develop a sequential algorithm for producing synthetic data set and implement it in R software (freely available R package called ‘synthpop’). The users have a choice between different parametric and non-parametric synthesising models. The latter includes classification and regression trees (CART) models. As a validation of the method we compare statistical inference based on real and synthetic data for research projects using the SLS data (including 2011 Census data). In addition, we compare the relative performance of parametric and non-parametric synthesising models.
Available online: Link
Output from project: 2013_012 (SLS), 30158 (ONS LS), 079 (NILS)
Cookie | Duration | Description |
---|---|---|
cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
Cookie | Duration | Description |
---|---|---|
__utma | 2 years | Used to distinguish users and sessions. The cookie is created when the javascript library executes and no existing __utma cookies exists. The cookie is updated every time data is sent to Google Analytics. |
__utmb | 30 minutes | Used to determine new sessions/visits. The cookie is created when the javascript library executes and no existing __utmb cookies exists. The cookie is updated every time data is sent to Google Analytics. |
__utmc | Not used in ga.js. Set for interoperability with urchin.js. Historically, this cookie operated in conjunction with the __utmb cookie to determine whether the user was in a new session/visit. | |
__utmt | 10 minutes | Used to throttle request rate. |
__utmz | 6 months | Stores the traffic source or campaign that explains how the user reached your site. The cookie is created when the javascript library executes and is updated every time data is sent to Google Analytics. |
_ga | 2 years | Used to distinguish users. |
_gat | 1 minute | Used to throttle request rate. |
_gid | 24 hours | Used to distinguish users. |