Software

Nodes develop and make available software.

Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation.  In a paper published in the Journal of the American Statistical Association, we developed an approach that fully...
Many datasets include a mix of continuous and categorical variables with missing values. In a paper published in the Journal of the American Statistical Association, we developed a joint model for such mixed data that can be used for multiple imputation. The approach uses a nonparametric Bayesian mixture model as the imputation engine. The mixture model comprises one set of mixture components with multivariate normal kernels for the continuous variables, and a separate set of mixture...
This tool set provides a set of functions to fit the nested Dirichlet process mixture of products of multinomial distributions (NDPMPM) model for nested categorical household data in the presence of impossible combinations. It has direct applications in generating synthetic nested household data. This package fits a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for...
These R routines create multiple imputations of missing at random categorical data, with or without structural zeros. Imputations are based on Dirichlet process mixtures of multinomial distributions, which is a non-parametric Bayesian modeling approach that allows for flexible joint modeling.  Many datasets comprise exclusively categorical variables that suffer from missing data.  When the number of variables is large, it can be challenging to specify models for use in multiple imputation (MI)...
To maintain confidentiality national statistical agencies traditionally do not include small counts in publicly released tabular data products.  They typically delete these small counts, or combine them with counts in adjacent table cells to preserve the totals at higher levels of aggregation.  In some cases these suppression procedures result in too much loss of information.  To increase data utility and make more data publicly available, we created methods and software to generate synthetic...