TY - JOUR T1 - Data fusion for correcting measurement errors Y1 - Submitted A1 - J. P. Reiter A1 - T. Schifeling A1 - M. De Yoreo AB - Often in surveys, key items are subject to measurement errors. Given just the data, it can be difficult to determine the distribution of this error process, and hence to obtain accurate inferences that involve the error-prone variables. In some settings, however, analysts have access to a data source on different individuals with high quality measurements of the error-prone survey items. We present a data fusion framework for leveraging this information to improve inferences in the error-prone survey. The basic idea is to posit models about the rates at which individuals make errors, coupled with models for the values reported when errors are made. This can avoid the unrealistic assumption of conditional independence typically used in data fusion. We apply the approach on the reported values of educational attainments in the American Community Survey, using the National Survey of College Graduates as the high quality data source. In doing so, we account for the informative sampling design used to select the National Survey of College Graduates. We also present a process for assessing the sensitivity of various analyses to different choices for the measurement error models. Supplemental material is available online. ER - TY - JOUR T1 - An empirical comparison of multiple imputation methods for categorical data JF - The American Statistician Y1 - 2017 A1 - F. Li A1 - O. Akande A1 - J. P. Reiter KW - latent KW - missing KW - mixture KW - nonresponse KW - tree AB - Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online. VL - 71 UR - http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1277158 IS - 2 ER - TY - JOUR T1 - Multiple imputation of missing categorical and continuous outcomes via Bayesian mixture models with local dependence JF - Journal of the American Statistical Association Y1 - 2017 A1 - J. S. Murray A1 - J. P. Reiter KW - Hierarchical mixture model KW - Missing data KW - Nonparametric Bayes KW - Stick-breaking process AB - We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. VL - 111 IS - 516 ER - TY - JOUR T1 - Stop or continue data collection: A nonignorable missing data approach for continuous variables JF - Journal of Official Statistics Y1 - 2017 A1 - T. Paiva A1 - J. P. Reiter AB - We present an approach to inform decisions about nonresponse follow-up sampling. The basic idea is (i) to create completed samples by imputing nonrespondents' data under various assumptions about the nonresponse mechanisms, (ii) take hypothetical samples of varying sizes from the completed samples, and (iii) compute and compare measures of accuracy and cost for different proposed sample sizes. As part of the methodology, we present a new approach for generating imputations for multivariate continuous data with nonignorable unit nonresponse. We fit mixtures of multivariate normal distributions to the respondents' data, and adjust the probabilities of the mixture components to generate nonrespondents' distributions with desired features. We illustrate the approaches using data from the 2007 U. S. Census of Manufactures. ER - TY - JOUR T1 - Bayesian latent pattern mixture models for handling attrition in panel studies with refreshment samples JF - Annals of Applied Statistics Y1 - 2016 A1 - Y. Si A1 - J. P. Reiter A1 - D. S. Hillygus VL - 10 UR - http://projecteuclid.org/euclid.aoas/1458909910 ER - TY - JOUR T1 - Categorical data fusion using auxiliary information JF - Annals of Applied Statistics Y1 - 2016 A1 - B. K. Fosdick A1 - M. De Yoreo A1 - J. P. Reiter KW - Imputation KW - Integration KW - Latent Class KW - Matching AB - In data fusion analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people's preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion. VL - 10 UR - http://projecteuclid.org/euclid.aoas/1483606845 ER - TY - JOUR T1 - Simultaneous edit-imputation and disclosure limitation for business establishment data JF - Journal of Applied Statistics Y1 - 2016 A1 - H. J. Kim A1 - J. P. Reiter A1 - A. F. Karr AB - Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks. ER - TY - RPRT T1 - Categorical data fusion using auxiliary information Y1 - 2015 A1 - Fosdick, B. K. A1 - Maria DeYoreo A1 - J. P. Reiter AB - In data fusion analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people's preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion. PB - arXiv UR - http://arxiv.org/abs/1506.05886 ER - TY - JOUR T1 - Nonparametric Bayesian models with focused clustering for mixed ordinal and nominal data JF - Bayesian Analysis Y1 - 2015 A1 - M. De Yoreo A1 - J. P. Reiter A1 - D. S. Hillygus AB - Dirichlet process mixtures can be useful models of multivariate categorical data and effective tools for multiple imputation of missing categorical values. In some contexts, however, these models can fit certain variables well at the expense of others in ways beyond the analyst's control. For example, when the data include some variables with non-trivial amounts of missing values, the mixture model may fit the marginal distributions of the nearly and fully complete variables at the expense of the variables with high fractions of missing data. Motivated by this setting, we present a Dirichlet process mixture model for mixed ordinal and nominal data that allows analysts to split variables into two groups: focus variables and remainder variables. The model uses three sets of clusters, one set for ordinal focus variables, one for nominal focus variables, and one for all remainder variables. The model uses a multivariate ordered probit specification for the ordinal variables and independent multinomial kernels for the nominal variables. The three sets of clusters are linked using an infinite tensor factorization prior, as well as via dependence of the means of the latent continuous focus variables on the remainder variables. This effectively specifies a rich, complex model for the focus variables and a simpler model for remainder variables, yet still potentially captures associations among the variables. In the multiple imputation context, focus variables include key variables with high rates of missing values, and remainder variables include variables without much missing data. Using simulations, we illustrate advantages and limitations of using focused clustering compared to mixture models that do not distinguish variables. We apply the model to handle missing values in an analysis of the 2012 American National Election Study. ER - TY - CHAP T1 - Analytical frameworks for data release: A statistical view T2 - Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches Y1 - 2014 A1 - A. F. Karr A1 - J. P. Reiter JF - Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches PB - Cambridge University Press CY - New York City, NY ER - TY - JOUR T1 - SynLBD 2.0: Improving the Synthetic Longitudinal Business Database JF - Statistical Journal of the International Association for Official Statistics Y1 - 2014 A1 - S. K. Kinney A1 - J. P. Reiter A1 - J. Miranda VL - 30 ER -