TY - JOUR T1 - A framework for sharing confidential research data, applied to investigating differential pay by race in the U. S. government Y1 - Submitted A1 - Barrientos, A. F. A1 - Bolton, A. A1 - Balmat, T. A1 - Reiter, J. P. A1 - Machanavajjhala, A. A1 - Chen, Y. A1 - Kneifel, C. A1 - DeLong, M. A1 - de Figueiredo, J. M. AB - Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. We present a framework for addressing this challenge. The framework uses an integrated system that includes fully synthetic data intended for wide access, coupled with means for approved users to access the confidential data via secure remote access solutions, glued together by verification servers that allow users to assess the quality of their analyses with the synthetic data. We apply this framework to data on the careers of employees of the U. S. federal government, studying differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up in the confidential data and which do not. We find differentials across races; for example, the gap between black and white female federal employees' pay increased over the time period. We present models for generating synthetic careers and differentially private algorithms for verification of regression results. ER - TY - CONF T1 - Differentially private regression diagnostics T2 - IEEE International Conference on Data Mining Y1 - 2017 A1 - Chen, Y. A1 - Machanavajjhala, A. A1 - Reiter, J. P. A1 - Barrientos, A. AB - Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. When intense redaction is needed to do so, one general strategy is to require users to do analyses without seeing the confidential data, for example, by releasing fully synthetic data or by allowing users to query remote systems for disclosure-protected outputs of statistical models. With fully synthetic data or redacted outputs, the analyst never really knows how much to trust the resulting findings. In particular, if the user did the same analysis on the confidential data, would regression coefficients of interest be statistically significant or not? We present algorithms for assessing this question that satisfy differential privacy. We describe conditions under which the algorithms should give accurate answers about statistical significance. We illustrate the properties of the methods using artificial and genuine data. JF - IEEE International Conference on Data Mining ER - TY - JOUR T1 - Itemwise conditionally independent nonresponse modeling for multivariate categorical data JF - Biometrika Y1 - 2017 A1 - Sadinle, M. A1 - Reiter, J. P. KW - Identification KW - Missing not at random KW - Non-parametric saturated KW - Partial ignorability KW - Sensitivity analysis AB - With nonignorable missing data, likelihood-based inference should be based on the joint distribution of the study variables and their missingness indicators. These joint models cannot be estimated from the data alone, thus requiring the analyst to impose restrictions that make the models uniquely obtainable from the distribution of the observed data. We present an approach for constructing classes of identifiable nonignorable missing data models. The main idea is to use a sequence of carefully set up identifying assumptions, whereby we specify potentially different missingness mechanisms for different blocks of variables. We show that the procedure results in models with the desirable property of being non-parametric saturated. VL - 104 ER - TY - JOUR T1 - Incorporating marginal prior information into latent class models JF - Bayesian Analysis Y1 - 2016 A1 - Schifeling, T. S. A1 - Reiter, J. P. VL - 11 UR - https://projecteuclid.org/euclid.ba/1434649584 ER - TY - JOUR T1 - Accounting for nonignorable unit nonresponse and attrition in panel studies with refreshment samples JF - Journal of Survey Statistics and Methodology Y1 - 2015 A1 - Schifeling, T. A1 - Cheng, C. A1 - Hillygus, D. S. A1 - Reiter, J. P. AB - Panel surveys typically su↵er from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, panel data alone cannot inform the extent of the bias from the attrition, so that analysts using the panel data alone must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new individuals during some later wave of the panel. Refreshment samples o↵er information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the refreshment sample itself. As we illustrate, nonignorable unit nonresponse can significantly compromise the analyst’s ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences—corrected for panel attrition—are to di↵erent assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study. VL - 3 UR - http://jssam.oxfordjournals.org/content/3/3/265.abstract IS - 3 ER - TY - JOUR T1 - Multiple imputation for harmonizing longitudinal non-commensurate measures in individual participant data meta-analysis JF - Statistics in Medicine Y1 - 2015 A1 - Siddique, J. A1 - Reiter, J. P. A1 - Brincks, A. A1 - Gibbons, R. A1 - Crespi, C. A1 - Brown, C. H. UR - http://onlinelibrary.wiley.com/doi/10.1002/sim.6562/abstract ER - TY - JOUR T1 - Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence JF - arXiv Y1 - 2015 A1 - Murray, J. S. A1 - Reiter, J. P. AB - We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. UR - arxiv.org/abs/1410.0438 IS - 1410.0438 ER - TY - JOUR T1 - Simultaneous Edit-Imputation for Continuous Microdata JF - Journal of the American Statistical Association Y1 - 2015 A1 - Kim, H. J. A1 - Cox, L. H. A1 - Karr, A. F. A1 - Reiter, J. P. A1 - Wang, Q. VL - 110 UR - http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1040881 ER - TY - JOUR T1 - Bayesian estimation of disclosure risks for multiply imputed, synthetic data JF - Journal of Privacy and Confidentiality Y1 - 2014 A1 - Reiter, J. P. A1 - Wang, Q. A1 - Zhang, B. AB -

Agencies seeking to disseminate public use microdata, i.e., data on individual records, can replace confidential values with multiple draws from statistical models estimated with the collected data. We present a famework for evaluating disclosure risks inherent in releasing multiply-imputed, synthetic data. The basic idea is to mimic an intruder who computes posterior distributions of confidential values given the released synthetic data and prior knowledge. We illustrate the methodology with artificial fully synthetic data and with partial synthesis of the Survey of Youth in Custody.

VL - 6 UR - http://repository.cmu.edu/jpc/vol6/iss1/2 IS - 1 ER - TY - JOUR T1 - Multiple imputation of missing or faulty values under linear constraints JF - Journal of Business and Economic Statistics Y1 - 2014 A1 - Kim, H. J. A1 - Reiter, J. P. A1 - Wang, Q. A1 - Cox, L. H. A1 - Karr, A. F. AB -

Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.

VL - 32 ER - TY - RPRT T1 - Bayesian multiple imputation for large-scale categorical data with structural zeros Y1 - 2013 A1 - Manrique-Vallier, D. A1 - Reiter, J. P. AB - Bayesian multiple imputation for large-scale categorical data with structural zeros Manrique-Vallier, D.; Reiter, J. P. We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. This approach has several appealing features: imputations are generated from coherent, Bayesian joint models that automatically capture complex dependencies and readily scale to large numbers of variables. We outline a Gibbs sampling algorithm for implementing the approach, and we illustrate its potential with a repeated sampling study using public use census microdata from the state of New York, USA. PB - Duke University / National Institute of Statistical Sciences (NISS) UR - http://hdl.handle.net/1813/34889 ER -