TY - RPRT
T1 - Regression Modeling and File Matching Using Possibly Erroneous Matching Variables
Y1 - 2016
A1 - Dalzell, N. M.
A1 - Reiter, J. P.
KW - Statistics - Applications
AB - Many analyses require linking records from two databases comprising overlapping sets of individuals. In the absence of unique identifiers, the linkage procedure often involves matching on a set of categorical variables, such as demographics, common to both files. Typically, however, the resulting matches are inexact: some cross-classifications of the matching variables do not generate unique links across files. Further, the matching variables can be subject to reporting errors, which introduce additional uncertainty in analyses. We present a Bayesian file matching methodology designed to estimate regression models and match records simultaneously when categorical matching variables are subject to reporting error. The method relies on a hierarchical model that includes (1) the regression of interest involving variables from the two files given a vector indicating the links, (2) a model for the linking vector given the true values of the matching variables, (3) a measurement error model for reported values of the matching variables given their true values, and (4) a model for the true values of the matching variables. We describe algorithms for sampling from the posterior distribution of the model. We illustrate the methodology using artificial data and data from education records in the state of North Carolina.
PB - ArXiv
UR - http://arxiv.org/abs/1608.06309
ER -
TY - JOUR
T1 - Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography
JF - ArXiv
Y1 - 2015
A1 - Quick, H.
A1 - Holan, S. H.
A1 - Wikle, C. K.
A1 - Reiter, J. P.
AB - Many data stewards collect confidential data that include fine geography. When sharing these data with others, data stewards strive to disseminate data that are informative for a wide range of spatial and non-spatial analyses while simultaneously protecting the confidentiality of data subjects' identities and attributes. Typically, data stewards meet this challenge by coarsening the resolution of the released geography and, as needed, perturbing the confidential attributes. When done with high intensity, these redaction strategies can result in released data with poor analytic quality. We propose an alternative dissemination approach based on fully synthetic data. We generate data using marked point process models that can maintain both the statistical properties and the spatial dependence structure of the confidential data. We illustrate the approach using data consisting of mortality records from Durham, North Carolina.
UR - http://arxiv.org/abs/1407.7795
IS - 1407.7795
ER -