Agencies seeking to disseminate public use microdata, i.e., data on individual records, can replace confidential values with multiple draws from statistical models estimated with the collected data. We present a famework for evaluating disclosure risks inherent in releasing multiply-imputed, synthetic data. The basic idea is to mimic an intruder who computes posterior distributions of confidential values given the released synthetic data and prior knowledge. We illustrate the methodology with artificial fully synthetic data and with partial synthesis of the Survey of Youth in Custody.

%B Journal of Privacy and Confidentiality %V 6 %8 2014 %G eng %U http://repository.cmu.edu/jpc/vol6/iss1/2 %N 1 %0 Journal Article %J Journal of Business and Economic Statistics %D 2014 %T Multiple imputation of missing or faulty values under linear constraints %A Kim, H. J. %A Reiter, J. P. %A Wang, Q. %A Cox, L. H. %A Karr, A. F. %XMany statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.

%B Journal of Business and Economic Statistics %V 32 %P 375-386 %G eng %& 375 %R 10.1080/07350015.2014.885435 %0 Report %D 2013 %T Bayesian multiple imputation for large-scale categorical data with structural zeros %A Manrique-Vallier, D. %A Reiter, J. P. %X Bayesian multiple imputation for large-scale categorical data with structural zeros Manrique-Vallier, D.; Reiter, J. P. We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. This approach has several appealing features: imputations are generated from coherent, Bayesian joint models that automatically capture complex dependencies and readily scale to large numbers of variables. We outline a Gibbs sampling algorithm for implementing the approach, and we illustrate its potential with a repeated sampling study using public use census microdata from the state of New York, USA. %I Duke University / National Institute of Statistical Sciences (NISS) %G eng %U http://hdl.handle.net/1813/34889 %9 Preprint