Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Secure the Future of the Federal Statistical System? Weinberg, Daniel; Abowd, John M.; Belli, Robert F.; Cressie, Noel; Folch, David C.; Holan, Scott H.; Levenstein, Margaret C.; Olson, Kristen M.; Reiter, Jerome P.; Shapiro, Matthew D.; Smyth, Jolene; Soh, Leen-Kiat; Spencer, Bruce; Spielman, Seth E.; Vilhuber, Lars; Wikle, Christopher The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives. This paper began as a May 8, 2015 presentation to the National Academies of Science’s Committee on National Statistics by two of the principal investigators of the National Science Foundation-Census Bureau Research Network (NCRN) – John Abowd and the late Steve Fienberg (Carnegie Mellon University). The authors acknowledge the contributions of the other principal investigators of the NCRN who are not co-authors of the paper (William Block, William Eddy, Alan Karr, Charles Manski, Nicholas Nagle, and Rebecca Nugent), the co- principal investigators, and the comments of Patrick Cantwell, Constance Citro, Adam Eck, Brian Harris-Kojetin, and Eloise Parker. We note with sorrow the deaths of Stephen Fienberg and Allan McCutcheon, two of the original NCRN principal investigators. The principal investigators also wish to acknowledge Cheryl Eavey’s sterling grant administration on behalf of the NSF. The conclusions reached in this paper are not the responsibility of the National Science Foundation (NSF), the Census Bureau, or any of the institutions to which the authors belong

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52650 %9 Preprint %0 Journal Article %J The American Statistician %D 2017 %T An empirical comparison of multiple imputation methods for categorical data %A F. Li %A O. Akande %A J. P. Reiter %K latent %K missing %K mixture %K nonresponse %K tree %X Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online. %B The American Statistician %V 71 %8 01/2017 %G eng %U http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1277158 %N 2 %& 162 %R 10.1080/00031305.2016.1277158 %0 Journal Article %J Journal of Survey Statistics and Methodology %D 2017 %T Examining Changes of Interview Length over the Course of the Field Period %A Kirchner, Antje %A Olson, Kristen %X It is well established that interviewers learn behaviors both during training and on the job. How this learning occurs has received surprisingly little empirical attention: Is it driven by the interviewer herself or by the respondents she interviews? There are two competing hypotheses about what happens during field data collection: (1) interviewers learn behaviors from their previous interviews, and thus change their behavior in reaction to the behaviors previously encountered; and (2) interviewers encounter different types of and, especially, less cooperative respondents (i.e., nonresponse propensity affecting the measurement error situation), leading to changes in interview behaviors over the course of the field period. We refer to these hypotheses as the experience and response propensity hypotheses, respectively. This paper examines the relationship between proxy indicators for the experience and response propensity hypotheses on interview length using data and paradata from two telephone surveys.Our results indicate that both interviewer-driven experience and respondent-driven response propensity are associated with the length of interview. While general interviewing experience is nonsignificant, within-study experience decreases interview length significantly, even when accounting for changes in sample composition. Interviewers with higher cooperation rates have significantly shorter interviews in study one; however, this effect is mediated by the number of words spoken by the interviewer. We find that older respondents and male respondents have longer interviews despite controlling for the number of words spoken, as do respondents who complete the survey at first contact. Not surprisingly, interviews are significantly longer the more words interviewers and respondents speak. %B Journal of Survey Statistics and Methodology %V 5 %P 84-108 %8 2017 %@ 2325-0984 %G eng %U http://dx.doi.org/10.1093/jssam/smw031 %N 1 %0 Report %D 2017 %T Formal Privacy Models and Title 13 %A Nissim, Kobbi %A Gasser, Urs %A Smith, Adam %A Vadhan, Salil %A O'Brien, David %A Wood, Alexandra %X Formal Privacy Models and Title 13 Nissim, Kobbi; Gasser, Urs; Smith, Adam; Vadhan, Salil; O'Brien, David; Wood, Alexandra A new collaboration between academia and the Census Bureau to further the Bureau’s use of formal privacy models. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52164 %9 Preprint %0 Journal Article %J Journal of Privacy and Confidentiality %D 2017 %T How Will Statistical Agencies Operate When All Data Are Private %A Abowd, John M %XHow Will Statistical Agencies Operate When All Data Are Private Abowd, John M The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

%B Journal of Privacy and Confidentiality %I Cornell University %V 7 %G eng %U http://repository.cmu.edu/jpc/vol7/iss3/1/ %N 3 %0 Journal Article %J Biometrika %D 2017 %T Itemwise conditionally independent nonresponse modeling for incomplete multivariate data %A M. Sadinle %A J.P. Reiter %K Loglinear model %K Missing not at random %K Missingness mechanism %K Nonignorable %K Nonparametric saturated %K Sensitivity analysis %X We introduce a nonresponse mechanism for multivariate missing data in which each study variable and its nonresponse indicator are conditionally independent given the remaining variables and their nonresponse indicators. This is a nonignorable missingness mechanism, in that nonresponse for any item can depend on values of other items that are themselves missing. We show that, under this itemwise conditionally independent nonresponse assumption, one can define and identify nonparametric saturated classes of joint multivariate models for the study variables and their missingness indicators. We also show how to perform sensitivity analysis to violations of the conditional independence assumptions encoded by this missingness mechanism. Throughout, we illustrate the use of this modeling approach with data analyses. %B Biometrika %V 104 %P 207-220 %8 01/2017 %G eng %U https://doi.org/10.1093/biomet/asw063 %N 1 %& 207 %R 10.1093/biomet/asw063 %0 Journal Article %J Biometrika %D 2017 %T Itemwise conditionally independent nonresponse modeling for multivariate categorical data %A Sadinle, M. %A Reiter, J. P. %K Identification %K Missing not at random %K Non-parametric saturated %K Partial ignorability %K Sensitivity analysis %X With nonignorable missing data, likelihood-based inference should be based on the joint distribution of the study variables and their missingness indicators. These joint models cannot be estimated from the data alone, thus requiring the analyst to impose restrictions that make the models uniquely obtainable from the distribution of the observed data. We present an approach for constructing classes of identifiable nonignorable missing data models. The main idea is to use a sequence of carefully set up identifying assumptions, whereby we specify potentially different missingness mechanisms for different blocks of variables. We show that the procedure results in models with the desirable property of being non-parametric saturated. %B Biometrika %V 104 %P 207-220 %8 01/2017 %G eng %0 Report %D 2017 %T Making Confidential Data Part of Reproducible Research %A Lars Vilhuber %A Carl Lagoze %I Labor Dynamics Institute, Cornell University %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/41/ %9 Document %0 Report %D 2017 %T Making Confidential Data Part of Reproducible Research %A Vilhuber, Lars %A Lagoze, Carl %X Making Confidential Data Part of Reproducible Research Vilhuber, Lars; Lagoze, Carl Disclaimer and acknowledgements: While this column mentions the Census Bureau several times, any opinions and conclusions expressed herein are those of the authors and do not necessarily represent the views of the U.S. Census Bureau or the other statistical agencies mentioned herein. %I Cornell University %G eng %U http://hdl.handle.net/1813/52474 %9 Preprint %0 Journal Article %J Chance %D 2017 %T Making Confidential Data Part of Reproducible Research %A Vilhuber, Lars %A Lagoze, Carl %B Chance %8 09/2017 %G eng %U http://chance.amstat.org/2017/09/reproducible-research/ %0 Journal Article %J Journal of Business & Economic Statistics %D 2017 %T Modeling Endogenous Mobility in Earnings Determination %A John M. Abowd %A Kevin L. Mckinney %A Ian M. Schmutte %X We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax exogenous mobility by modeling the matched data as an evolving bipartite graph using a Bayesian latent-type framework. Our results suggest that allowing endogenous mobility increases the variation in earnings explained by individual heterogeneity and reduces the proportion due to employer and match effects. To assess external validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The mobility-bias corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %B Journal of Business & Economic Statistics %P 0-0 %G eng %U http://dx.doi.org/10.1080/07350015.2017.1356727 %R 10.1080/07350015.2017.1356727 %0 Report %D 2017 %T Modeling Endogenous Mobility in Wage Determination %A John M. Abowd %A Kevin L. Mckinney %A Ian M. Schmutte %X We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/28/ %0 Journal Article %J Journal of the American Statistical Association %D 2017 %T Multiple imputation of missing categorical and continuous outcomes via Bayesian mixture models with local dependence %A J. S. Murray %A J. P. Reiter %K Hierarchical mixture model %K Missing data %K Nonparametric Bayes %K Stick-breaking process %X We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. %B Journal of the American Statistical Association %V 111 %P 1466 – 1479 %8 01/2017 %G eng %N 516 %0 Journal Article %D 2017 %T Multi-rubric Models for Ordinal Spatial Data with Application to Online Ratings from Yelp %A Linero, A.R. %A Bradley, J.R. %A Desai, A. %K Bayesian hierarchical model %K Data augmentation %K Nonparametric Bayes %K ordinal data %K recommender systems %K spatial prediction. %X Interest in online rating data has increased in recent years. Such data consists of ordinal ratings of products or local businesses provided by users of a website, such as \Yelp\ or \texttt{Amazon}. One source of heterogeneity in ratings is that users apply different standards when supplying their ratings; even if two users benefit from a product the same amount, they may translate their benefit into ratings in different ways. In this article we propose an ordinal data model, which we refer to as a multi-rubric model, which treats the criteria used to convert a latent utility into a rating as user-specific random effects, with the distribution of these random effects being modeled nonparametrically. We demonstrate that this approach is capable of accounting for this type of variability in addition to usual sources of heterogeneity due to item quality, user biases, interactions between items and users, and the spatial structure of the users and items. We apply the model developed here to publicly available data from the website \Yelp\ and demonstrate that it produces interpretable clusterings of users according to their rating behavior, in addition to providing better predictions of ratings and better summaries of overall item quality. %G eng %U https://arxiv.org/abs/1706.03012 %0 Report %D 2017 %T NCRN Meeting Spring 2017 %A Vilhuber, Lars %X NCRN Meeting Spring 2017 Vilhuber, Lars %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52163 %9 Preprint %0 Report %D 2017 %T NCRN Meeting Spring 2017: Formal Privacy Models and Title 13 %A Nissim, Kobbi %A Gasser, Urs %A Smith, Adam %A Vadhan, Salil %A O'Brien, David %A Wood, Alexandra %X NCRN Meeting Spring 2017: Formal Privacy Models and Title 13 Nissim, Kobbi; Gasser, Urs; Smith, Adam; Vadhan, Salil; O'Brien, David; Wood, Alexandra A new collaboration between academia and the Census Bureau to further the Bureau’s use of formal privacy models. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52164 %9 Preprint %0 Report %D 2017 %T NCRN Meeting Spring 2017: Welcome %A Vilhuber, Lars %X NCRN Meeting Spring 2017: Welcome Vilhuber, Lars %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52163 %9 Preprint %0 Report %D 2017 %T NCRN Newsletter: Volume 3 - Issue 3 %A Vilhuber, Lars %A Knight-Ingram, Dory %X NCRN Newsletter: Volume 3 - Issue 3 Vilhuber, Lars; Knight-Ingram, Dory Overview of activities at NSF-Census Research Network nodes from December 2016 through February 2017. NCRN Newsletter Vol. 3, Issue 3: March 10, 2017 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/46686 %9 Preprint %0 Report %D 2017 %T NCRN Newsletter: Volume 3 - Issue 4 %A Vilhuber, Lars %A Knight-Ingram, Dory %X NCRN Newsletter: Volume 3 - Issue 4 Vilhuber, Lars; Knight-Ingram, Dory The NCRN Newsletter is published quarterly by the NCRN Coordinating Office. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52259 %9 Preprint %0 Report %D 2017 %T Presentation: Introduction to Stan for Markov Chain Monte Carlo %A Simpson, Matthew %X Presentation: Introduction to Stan for Markov Chain Monte Carlo Simpson, Matthew An introduction to Stan (http://mc-stan.org/): a probabilistic programming language that implements Hamiltonian Monte Carlo (HMC), variational Bayes, and (penalized) maximum likelihood estimation. Presentation given at the U.S. Census Bureau on April 25, 2017. %I University of Missouri %G eng %U http://hdl.handle.net/1813/52656 %9 Preprint %0 Report %D 2017 %T Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy %A Vilhuber, Lars %A Schmutte, Ian %X Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau’s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas. %I Cornell University %G eng %U http://hdl.handle.net/1813/46197 %9 Preprint %0 Report %D 2017 %T Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy %A Vilhuber, Lars %A Schmutte, Ian M. %X Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian M. ese proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. is workshop was a successor to a previous workshop held in October 2016 (Vilhuber & Schmu e 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cu ing-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. e teams developing those applications were just starting out when our rst workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cu ing-edge formal privacy models, there had been very li le e ort in the academic literature to apply those methods in real-world se ings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject ma er challenges and address how data users might react to changes in data availability and publishing standards. In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four di erent areas. e four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality. As in our earlier workshop, our goals were to 1. Discuss the speci c challenges that have arisen in ongoing e orts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short wri en memos that summarize concrete suggestions for practical applications to speci c Census Bureau priority areas. Comments can be provided at h ps://goo.gl/ZAh3YE %I Cornell University %G eng %U http://hdl.handle.net/1813/52473 %9 Preprint %0 Report %D 2017 %T Proceedings from the Synthetic LBD International Seminar %A Vilhuber, Lars %A Kinney, Saki %A Schmutte, Ian M. %X Proceedings from the Synthetic LBD International Seminar Vilhuber, Lars; Kinney, Saki; Schmutte, Ian M. On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop. %I Cornell University %G eng %U http://hdl.handle.net/1813/52472 %9 Preprint %0 Report %D 2017 %T Recalculating - How Uncertainty in Local Labor Market Definitions Affects Empirical Findings %A Foote, Andrew %A Kutzbach, Mark J. %A Vilhuber, Lars %X Recalculating - How Uncertainty in Local Labor Market Definitions Affects Empirical Findings Foote, Andrew; Kutzbach, Mark J.; Vilhuber, Lars This paper evaluates the use of commuting zones as a local labor market definition. We revisit Tolbert and Sizer (1996) and demonstrate the sensitivity of definitions to two features of the methodology. We show how these features impact empirical estimates using a well-known application of commuting zones. We conclude with advice to researchers using commuting zones on how to demonstrate the robustness of empirical findings to uncertainty in definitions. The analysis, conclusions, and opinions expressed herein are those of the author(s) alone and do not necessarily represent the views of the U.S. Census Bureau or the Federal Deposit Insurance Corporation. All results have been reviewed to ensure that no confidential information is disclosed, and no confidential data was used in this paper. This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Much of the work developing this paper occurred while Mark Kutzbach was an employee of the U.S. Census Bureau. %I Cornell University %G eng %U http://hdl.handle.net/1813/52649 %9 Preprint %0 Journal Article %J Journal of the Royal Statistical Society -- Series B. %D 2017 %T Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error %A Bradley, J.R. %A Wikle, C.K. %A Holan, S.H. %K American Community Survey %K empirical orthogonal functions %K MAUP %K Reduced rank %K Spatial basis functions %K Survey data %X The modifiable areal unit problem and the ecological fallacy are known problems that occur when modeling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By "regionalization" we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers, but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error (CAGE), which we minimize to obtain an optimal regionalization. To define CAGE we draw a connection between spatial aggregation error and a new multiscale representation of the Karhunen-Loeve (K-L) expansion. This relationship between CAGE and the multiscale K-L expansion leads to illuminating theoretical developments including: connections between spatial aggregation error, squared prediction error, spatial variance, and a novel extension of Obled-Creutin eigenfunctions. The effectiveness of our approach is demonstrated through an analysis of two datasets, one using the American Community Survey and one related to environmental ocean winds. %B Journal of the Royal Statistical Society -- Series B. %G eng %U https://arxiv.org/abs/1502.01974 %0 Report %D 2017 %T Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A John M. Abowd %A Ian M. Schmutte %X We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial. %B Labor Dynamics Institute Document %8 04/2017 %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/37/ %0 Report %D 2017 %T Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A Abowd, John %A Schmutte, Ian M. %X Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John; Schmutte, Ian M. We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε, δ)-differential privacy with (α, β)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial. A complete archive of the data and programs used in this paper is available via http://doi.org/10.5281/zenodo.345385. %I Cornell University %G eng %U http://hdl.handle.net/1813/39081 %9 Preprint %0 Report %D 2017 %T Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A Abowd, John %A Schmutte, Ian M. %X Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John; Schmutte, Ian M. We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52612 %9 Preprint %0 Journal Article %J Total Survey Error in Practice %D 2017 %T The role of statistical disclosure limitation in total survey error %A A. F. Karr %K big data issues %K data quality %K data swapping %K decision quality %K risk-utility paradigms %K Statistical Disclosure Limitation %K total survey error %X This chapter presents the thesis, which is statistical disclosure limitation (SDL) that ought to be viewed as an integral component of total survey error (TSE). TSE and SDL will move forward together, but integrating multiple criteria: cost, risk, data quality, and decision quality. The chapter explores the value of unifying two key TSE procedures - editing and imputation - with SDL. It discusses “Big data” issues, which contains a mathematical formulation that, at least conceptually and at some point in the future, does unify TSE and SDL. Modern approaches to SDL are based explicitly or implicitly on tradeoffs between disclosure risk and data utility. There are three principal classes of SDL methods: reduction/coarsening techniques; perturbative methods; and synthetic data methods. Data swapping is among the most frequently applied SDL methods for categorical data. The chapter sketches how it can be informed by knowledge of TSE. %B Total Survey Error in Practice %P 71 – 94 %G eng %R 10.1002/9781119041702.ch4 %0 Generic %D 2017 %T Sequential Prediction of Respondent Behaviors Leading to Error in Web-based Surveys %A Eck, Adam %A Soh, Leen-Kiat %G eng %0 Report %D 2017 %T Sorting Between and Within Industries: A Testable Model of Assortative Matching %A John M. Abowd %A Francis Kramarz %A Sebastien Perez-Duarte %A Ian M. Schmutte %X We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated. %I Labor Dynamics Institute %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/40/ %9 Document %0 Journal Article %J Journal of Official Statistics %D 2017 %T Stop or continue data collection: A nonignorable missing data approach for continuous variables %A T. Paiva %A J. P. Reiter %X We present an approach to inform decisions about nonresponse follow-up sampling. The basic idea is (i) to create completed samples by imputing nonrespondents' data under various assumptions about the nonresponse mechanisms, (ii) take hypothetical samples of varying sizes from the completed samples, and (iii) compute and compare measures of accuracy and cost for different proposed sample sizes. As part of the methodology, we present a new approach for generating imputations for multivariate continuous data with nonignorable unit nonresponse. We fit mixtures of multivariate normal distributions to the respondents' data, and adjust the probabilities of the mixture components to generate nonrespondents' distributions with desired features. We illustrate the approaches using data from the 2007 U. S. Census of Manufactures. %B Journal of Official Statistics %G eng %0 Report %D 2017 %T Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files %A Green, Andrew %A Kutzbach, Mark J. %A Vilhuber, Lars %X Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files Green, Andrew; Kutzbach, Mark J.; Vilhuber, Lars Commuting flows and workplace employment data have a wide constituency of users including urban and regional planners, social science and transportation researchers, and businesses. The U.S. Census Bureau releases two, national data products that give the magnitude and characteristics of home to work flows. The American Community Survey (ACS) tabulates households’ responses on employment, workplace, and commuting behavior. The Longitudinal Employer-Household Dynamics (LEHD) program tabulates administrative records on jobs in the LEHD Origin-Destination Employment Statistics (LODES). Design differences across the datasets lead to divergence in a comparable statistic: county-to-county aggregate commute flows. To understand differences in the public use data, this study compares ACS and LEHD source files, using identifying information and probabilistic matching to join person and job records. In our assessment, we compare commuting statistics for job frames linked on person, employment status, employer, and workplace and we identify person and job characteristics as well as design features of the data frames that explain aggregate differences. We find a lower rate of within-county commuting and farther commutes in LODES. We attribute these greater distances to differences in workplace reporting and to uncertainty of establishment assignments in LEHD for workers at multi-unit employers. Minor contributing factors include differences in residence location and ACS workplace edits. The results of this analysis and the data infrastructure developed will support further work to understand and enhance commuting statistics in both datasets. %I Cornell University %G eng %U http://hdl.handle.net/1813/52611 %9 Preprint %0 Report %D 2017 %T Unique Entity Estimation with Application to the Syrian Conflict %A Chen, B. %A Shrivastava, A. %A Steorts, R. C. %K Computer Science - Data Structures and Algorithms %K Computer Science - Databases %K Statistics - Applications %X Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of $191,874 \pm 1772$ documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold. %B arXiv %G eng %U https://arxiv.org/abs/1710.02690 %0 Journal Article %J Proceedings of the 2017 ACM International Conference on Management of Data %D 2017 %T Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics %A Samuel Haney %A Ashwin Machanavajjhala %A John M. Abowd %A Matthew Graham %A Mark Kutzbach %X National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ε≥ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research. %B Proceedings of the 2017 ACM International Conference on Management of Data %@ 978-1-4503-4197-4 %G eng %U http://dl.acm.org/citation.cfm?doid=3035918.3035940 %R 10.1145/3035918.3035940 %0 Report %D 2017 %T Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics %A Haney, Samuel %A Machanavajjhala, Ashwin %A Abowd, John M %A Graham, Matthew %A Kutzbach, Mark %A Vilhuber, Lars %X Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Haney, Samuel; Machanavajjhala, Ashwin; Abowd, John M; Graham, Matthew; Kutzbach, Mark; Vilhuber, Lars National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ≥1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional %I Cornell University %G eng %U http://hdl.handle.net/1813/49652 %9 Preprint %0 Journal Article %J Stat %D 2017 %T Visualizing uncertainty in areal data estimates with bivariate choropleth maps, map pixelation, and glyph rotation %A Lucchesi, L.R. %A Wikle, C.K. %X In statistics, we quantify uncertainty to help determine the accuracy of estimates, yet this crucial piece of information is rarely included on maps visualizing areal data estimates. We develop and present three approaches to include uncertainty on maps: (1) the bivariate choropleth map repurposed to visualize uncertainty; (2) the pixelation of counties to include values within an estimate's margin of error; and (3) the rotation of a glyph, located at a county's centroid, to represent an estimate's uncertainty. The second method is presented as both a static map and visuanimation. We use American Community Survey estimates and their corresponding margins of error to demonstrate the methods and highlight the importance of visualizing uncertainty in areal data. An extensive online supplement provides the R code necessary to produce the maps presented in this article as well as alternative versions of them. %B Stat %V 6 %P 292–302 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/sta4.150/abstract %N 1 %0 Report %D 2016 %T 2017 Economic Census: Towards Synthetic Data Sets %A Caldwell, Carol %A Thompson, Katherine Jenny %X 2017 Economic Census: Towards Synthetic Data Sets Caldwell, Carol; Thompson, Katherine Jenny %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52165 %9 Preprint %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Assessing disclosure risks for synthetic data with arbitrary intruder knowledge %A McClure, D. %A Reiter , J. P. %K confidentiality %K Disclosure %K risk %K synthetic %X Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the released synthetic values to estimate confidential values for individuals in the collected data. We demonstrate and investigate this potential risk using two simple but informative scenarios: a single continuous variable possibly with outliers, and a three-way contingency table possibly with small counts in some cells. Beginning with the case that the intruder knows all but one value in the confidential data, we examine the effect on risk of decreasing the number of observations the intruder knows beforehand. We generally find that releasing synthetic data (1) can pose little risk to records in the middle of the distribution, and (2) can pose some risks to extreme outliers, although arguably these risks are mild. We also find that the effect of removing observations from an intruder's background knowledge heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and drops quickly if he/she cannot. %B Statistical Journal of the International Association for Official Statistics %V 32 %P 109-126 %8 02/2016 %G eng %U http://content.iospress.com/download/statistical-journal-of-the-iaos/sji957 %N 1 %& 109 %R 10.3233/SJI-160957 %0 Journal Article %J Statistics and Computing %D 2016 %T A Bayesian nonparametric Markovian model for nonstationary time series %A De Yoreo, M. %A Kottas, A. %K Autoregressive Models %K Bayesian Nonparametrics %K Dirichlet Process Mixtures %K Markov chain Monte Carlo %K Nonstationarity %K Time Series %X Stationary time series models built from parametric distributions are, in general, limited in scope due to the assumptions imposed on the residual distribution and autoregression relationship. We present a modeling approach for univariate time series data, which makes no assumptions of stationarity, and can accommodate complex dynamics and capture nonstandard distributions. The model for the transition density arises from the conditional distribution implied by a Bayesian nonparametric mixture of bivariate normals. This implies a flexible autoregressive form for the conditional transition density, defining a time-homogeneous, nonstationary, Markovian model for real-valued data indexed in discrete-time. To obtain a more computationally tractable algorithm for posterior inference, we utilize a square-root-free Cholesky decomposition of the mixture kernel covariance matrix. Results from simulated data suggest the model is able to recover challenging transition and predictive densities. We also illustrate the model on time intervals between eruptions of the Old Faithful geyser. Extensions to accommodate higher order structure and to develop a state-space model are also discussed. %B Statistics and Computing %8 01/2016 %G eng %& 1 %0 Journal Article %J Journal of the American Statistical Association %D 2016 %T A Bayesian Approach to Graphical Record Linkage and Deduplication %A Rebecca C. Steorts %A Rob Hall %A Stephen E. Fienberg %X ABSTRACTWe propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online. %B Journal of the American Statistical Association %V 111 %P 1660-1672 %G eng %U http://dx.doi.org/10.1080/01621459.2015.1105807 %R 10.1080/01621459.2015.1105807 %0 Journal Article %J Journal of the American Statistical Association - T&M. %D 2016 %T Bayesian Hierarchical Models with Conjugate Full-Conditional Distributions for Dependent Data from the Natural Exponential Family %A Bradley, J.R. %A Holan, S.H. %A Wikle, C.K. %X We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that are distributed according to a member from the natural exponential family of distributions. This problem requires extensive methodological advancements, as jointly modeling high-dimensional dependent data leads to the so-called "big n problem." The computational complexity of the "big n problem" is further exacerbated when allowing for non-Gaussian data models, as is the case here. Thus, we develop new computationally efficient distribution theory for this setting. In particular, we introduce something we call the "conjugate multivariate distribution," which is motivated by the univariate distribution introduced in Diaconis and Ylvisaker (1979). Furthermore, we provide substantial theoretical and methodological development including: results regarding conditional distributions, an asymptotic relationship with the multivariate normal distribution, conjugate prior distributions, and full-conditional distributions for a Gibbs sampler. The results in this manuscript are extremely general, and can be adapted to many different settings. We demonstrate the proposed methodology through simulated examples and analyses based on estimates obtained from the US Census Bureaus' American Community Survey (ACS). %B Journal of the American Statistical Association - T&M. %G eng %U https://arxiv.org/abs/1701.07506 %0 Journal Article %J Annals of Applied Statistics %D 2016 %T Bayesian latent pattern mixture models for handling attrition in panel studies with refreshment samples %A Y. Si %A J. P. Reiter %A D. S. Hillygus %B Annals of Applied Statistics %V 10 %P 118-–143 %G eng %U http://projecteuclid.org/euclid.aoas/1458909910 %R 10.1214/15-AOAS876 %0 Journal Article %J Bayesian Analysis %D 2016 %T Bayesian Lattice Filters for Time-Varying Autoregression and Time-Frequency Analysis %A Yang, W.H. %A Holan, S.H. %A Wikle, C.K. %X Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time-frequency estimation using time-varying autoregressive models. In this context, we take a fully Bayesian approach and allow both the autoregressive coefficients and innovation variance to vary over time. Importantly, our estimation method uses the lattice filter and is cast within the partial autocorrelation domain. The marginal posterior distributions are of standard form and, as a convenient by-product of our estimation method, our approach avoids undesirable matrix inversions. As such, estimation is extremely computationally efficient and stable. To illustrate the effectiveness of our approach, we conduct a comprehensive simulation study that compares our method with other competing methods and find that, in most cases, our approach performs superior in terms of average squared error between the estimated and true time-varying spectral density. Lastly, we demonstrate our methodology through three modeling applications; namely, insect communication signals, environmental data (wind components), and macroeconomic data (US gross domestic product (GDP) and consumption). %B Bayesian Analysis %P 977-1003 %G eng %U https://arxiv.org/abs/1408.2757 %0 Report %D 2016 %T Bayesian mixture modeling for multivariate conditional distributions %A Maria DeYoreo %A Jerome P. Reiter %X We present a Bayesian mixture model for estimating the joint distribution of mixed ordinal, nominal, and continuous data conditional on a set of fixed variables. The model uses multivariate normal and categorical mixture kernels for the random variables. It induces dependence between the random and fixed variables through the means of the multivariate normal mixture kernels and via a truncated local Dirichlet process. The latter encourages observations with similar values of the fixed variables to share mixture components. Using a simulation of data fusion, we illustrate that the model can estimate underlying relationships in the data and the distributions of the missing values more accurately than a mixture model applied to the random and fixed variables jointly. We use the model to analyze consumers' reading behaviors using a quota sample, i.e., a sample where the empirical distribution of some variables is fixed by design and so should not be modeled as random, conducted by the book publisher HarperCollins. %I ArXiv %G eng %U http://arxiv.org/abs/1606.04457 %0 Report %D 2016 %T A Bayesian nonparametric Markovian model for nonstationary time series %A Maria DeYoreo %A Athanasios Kottas %X Stationary time series models built from parametric distributions are, in general, limited in scope due to the assumptions imposed on the residual distribution and autoregression relationship. We present a modeling approach for univariate time series data, which makes no assumptions of stationarity, and can accommodate complex dynamics and capture nonstandard distributions. The model for the transition density arises from the conditional distribution implied by a Bayesian nonparametric mixture of bivariate normals. This implies a flexible autoregressive form for the conditional transition density, defining a time-homogeneous, nonstationary, Markovian model for real-valued data indexed in discrete-time. To obtain a more computationally tractable algorithm for posterior inference, we utilize a square-root-free Cholesky decomposition of the mixture kernel covariance matrix. Results from simulated data suggest the model is able to recover challenging transition and predictive densities. We also illustrate the model on time intervals between eruptions of the Old Faithful geyser. Extensions to accommodate higher order structure and to develop a state-space model are also discussed. %I ArXiv %G eng %U http://arxiv.org/abs/1601.04331 %0 Journal Article %J Journal of the American Statistical Association %D 2016 %T A Bayesian Partial Identification Approach to Inferring the Prevalence of Accounting Misconduct %A P. R. Hahn %A J. S. Murray %A I. Manolopoulou %X This article describes the use of flexible Bayesian regression models for estimating a partially identified probability function. Our approach permits efficient sensitivity analysis concerning the posterior impact of priors on the partially identified component of the regression model. The new methodology is illustrated on an important problem where only partially observed data are available—inferring the prevalence of accounting misconduct among publicly traded U.S. businesses. Supplementary materials for this article are available online. %B Journal of the American Statistical Association %V 111 %P 14–26 %G eng %U http://www.tandfonline.com/doi/full/10.1080/01621459.2015.1084307 %N 513 %R 10.1080/01621459.2015.1084307 %0 Journal Article %J Journal of the American Statistical Association %D 2016 %T Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data %A Daniel Manrique-Vallier %A Jerome P. Reiter %X In categorical data, it is typically the case that some combinations of variables are theoretically impossible, such as a three year old child who is married or a man who is pregnant. In practice, however, reported values often include such structural zeros due to, for example, respondent mistakes or data processing errors. To purge data of such errors, many statistical organizations use a process known as edit-imputation. The basic idea is first to select reported values to change according to some heuristic or loss function, and second to replace those values with plausible imputations. This two-stage process typically does not fully utilize information in the data when determining locations of errors, nor does it appropriately reflect uncertainty resulting from the edits and imputations. We present an alternative approach to editing and imputation for categorical microdata with structural zeros that addresses these shortcomings. Specifically, we use a Bayesian hierarchical model that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error free values. The latter model is restricted to have support only on the set of theoretically possible combinations. We illustrate this integrated approach to editing and imputation using simulation studies with data from the 2000 U. S. census, and compare it to a two-stage edit-imputation routine. Supplementary material is available online. %B Journal of the American Statistical Association %8 09/2016 %G eng %U http://dx.doi.org/10.1080/01621459.2016.1231612 %R 10.1080/01621459.2016.1231612 %0 Journal Article %J Journal of the American Statistical Association %D 2016 %T Bayesian Spatial Change of Support for Count-Valued Survey Data with Application to the American Community Survey %A Bradley, J.R. %A Wikle, C.K. %A Holan, S.H. %X We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year "period-estimates," and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on "new" spatial supports in "real-time." This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in "real-time." We demonstrate the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. %B Journal of the American Statistical Association %P 472-487 %G eng %U https://arxiv.org/abs/1405.7227 %0 Journal Article %J Annals of Applied Statistics %D 2016 %T Categorical data fusion using auxiliary information %A B. K. Fosdick %A M. De Yoreo %A J. P. Reiter %K Imputation %K Integration %K Latent Class %K Matching %X In data fusion analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people's preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion. %B Annals of Applied Statistics %V 10 %P 1907 – 1929 %G eng %U http://projecteuclid.org/euclid.aoas/1483606845 %R 10.1214/16-AOAS925 %0 Journal Article %J Computational Statistics and Data Analysis %D 2016 %T Computation of the Autocovariances for Time Series with Multiple Long-Range Persistencies %A McElroy, T.S. %A Holan, S.H. %X Gegenbauer processes allow for flexible and convenient modeling of time series data with multiple spectral peaks, where the qualitative description of these peaks is via the concept of cyclical long-range dependence. The Gegenbauer class is extensive, including ARFIMA, seasonal ARFIMA, and GARMA processes as special cases. Model estimation is challenging for Gegenbauer processes when multiple zeros and poles occur in the spectral density, because the autocovariance function is laborious to compute. The method of splitting–essentially computing autocovariances by convolving long memory and short memory dynamics–is only tractable when a single long memory pole exists. An additive decomposition of the spectrum into a sum of spectra is proposed, where each summand has a single singularity, so that a computationally efficient splitting method can be applied to each term and then aggregated. This approach differs from handling all the poles in the spectral density at once, via an analysis of truncation error. The proposed technique allows for fast estimation of time series with multiple long-range dependences, which is illustrated numerically and through several case-studies. %B Computational Statistics and Data Analysis %P 44 - 56 %G eng %U http://www.sciencedirect.com/science/article/pii/S0167947316300202 %0 Generic %D 2016 %T Data management and analytic use of paradata: SIPP-EHC audit trails %A Lee, Jinyoung %A Seloske, Ben %A Córdova Cazar, Ana Lucía %A Eck, Adam %A Kirchner, Antje %A Belli, Robert F. %G eng %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Differentially private publication of data on wages and job mobility %A Schmutte, Ian M. %K Demand for public statistics %K differential privacy %K job mobility %K matched employer-employee data %K optimal confidentiality protection %K optimal data accuracy %K technology for statistical agencies %X Brazil, like many countries, is reluctant to publish business-level data, because of legitimate concerns about the establishments' confidentiality. A trusted data curator can increase the utility of data, while managing the risk to establishments, either by releasing synthetic data, or by infusing noise into published statistics. This paper evaluates the application of a differentially private mechanism to publish statistics on wages and job mobility computed from Brazilian employer-employee matched data. The publication mechanism can result in both the publication of specific statistics as well as the generation of synthetic data. I find that the tradeoff between the privacy guaranteed to individuals in the data, and the accuracy of published statistics, is potentially much better that the worst-case theoretical accuracy guarantee. However, the synthetic data fare quite poorly in analyses that are outside the set of queries to which it was trained. Note that this article only explores and characterizes the feasibility of these publication strategies, and will not directly result in the publication of any data. %B Statistical Journal of the International Association for Official Statistics %V 32 %P 81-92 %8 02/2016/2016 %G eng %U http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji962 %N 1 %& 81 %R 10.3233/SJI-160962 %0 Report %D 2016 %T Differentially Private Verification of Regression Model Results %A Reiter, Jerry %X Differentially Private Verification of Regression Model Results Reiter, Jerry %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52167 %9 Preprint %0 Journal Article %J Survey Practice %D 2016 %T Do Interviewers with High Cooperation Rates Behave Differently? Interviewer Cooperation Rates and Interview Behaviors %A Olson, Kristen %A Kirchner, Antje %A Smyth, Jolene D. %X Interviewers are required to be flexible in responding to respondent concerns during recruitment, but standardized during administration of the questionnaire. These skill sets may be at odds. Recent research has shown a U-shaped relationship between interviewer cooperation rates and interviewer variance: the least and the most successful interviewers during recruitment have the largest interviewer variance components. Little is known about why this association occurs. We posit four hypotheses for this association: 1) interviewers with higher cooperation rates more conscientious interviewers altogether, 2) interviewers with higher cooperation rates continue to use rapport behaviors from the cooperation request throughout an interview, 3) interviewers with higher cooperation rates display more confidence which translates into different interview behavior, and 4) interviewers with higher cooperation rates continue their flexible interviewing style throughout the interview and deviate more from standardized interviewing. We use behavior codes from the Work and Leisure Today Survey (n=450, AAPOR RR3=6.3%) to evaluate interviewer behavior. Our results largely support the confidence hypothesis. Interviewers with higher cooperation rates do not show evidence of being “better” interviewers. %B Survey Practice %V 9 %P no pp. %8 2016 %G eng %U http://www.surveypractice.org/index.php/SurveyPractice/article/view/351 %N 2 %0 Report %D 2016 %T Estimating Compensating Wage Differentials with Endogenous Job Mobility %A Kurt Lavetti %A Ian M. Schmutte %X We demonstrate a strategy for using matched employer-employee data to correct endogenous job mobility bias when estimating compensating wage differentials. Applied to fatality rates in the census of formal-sector jobs in Brazil between 2003-2010, we show why common approaches to eliminating ability bias can greatly amplify endogenous job mobility bias. By extending the search-theoretic hedonic wage frame- work, we establish conditions necessary to interpret our estimates as preferences. We present empirical analyses supporting the predictions of the model and identifying conditions, demonstrating that the standard models are misspecified, and that our proposed model eliminates latent ability and endogenous mobility biases. %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/29/ %0 Journal Article %J Journal of the Royal Statistical Society - Series A %D 2016 %T Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk Using Differential Smoothing %A Quick, H. %A Holan, S.H. %A Wikle, C.K. %X When collecting geocoded confidential data with the intent to disseminate, agencies often resort to altering the geographies prior to making data publicly available due to data privacy obligations. An alternative to releasing aggregated and/or perturbed data is to release multiply-imputed synthetic data, where sensitive values are replaced with draws from statistical models designed to capture important distributional features in the collected data. One issue that has received relatively little attention, however, is how to handle spatially outlying observations in the collected data, as common spatial models often have a tendency to overfit these observations. The goal of this work is to bring this issue to the forefront and propose a solution, which we refer to as "differential smoothing." After implementing our method on simulated data, highlighting the effectiveness of our approach under various scenarios, we illustrate the framework using data consisting of sale prices of homes in San Francisco. %B Journal of the Royal Statistical Society - Series A %G eng %U https://arxiv.org/abs/1507.05529 %0 Report %D 2016 %T Hours Off the Clock %A Green, Andrew %X Hours Off the Clock Green, Andrew To what extent do workers work more hours than they are paid for? The relationship between hours worked and hours paid, and the conditions under which employers can demand more hours “off the clock,” is not well understood. The answer to this question impacts worker welfare, as well as wage and hour regulation. In addition, work off the clock has important implications for the measurement and cyclical movement of productivity and wages. In this paper, I construct a unique administrative dataset of hours paid by employers linked to a survey of workers on their reported hours worked to measure work off the clock. Using cross-sectional variation in local labor markets, I find only a small cyclical component to work off the clock. The results point to labor hoarding rather than efficiency wage theory, indicating work off the clock cannot explain the counter-cyclical movement of productivity. I find workers employed by small firms, and in industries with a high rate of wage and hour violations are associated with larger differences in hours worked than hours paid. These findings suggest the importance of tracking hours of work for enforcement of labor regulations. %I Cornell University %G eng %U http://hdl.handle.net/1813/52610 %9 Preprint %0 Journal Article %J Monthly Labor Review %D 2016 %T How Should We Define Low-Wage Work? An Analysis Using the Current Population Survey %A Fusaro, V. %A Shaefer, H. Luke %X Low-wage work is a central concept in considerable research, yet it lacks an agreed-upon definition. Using data from the Current Population Survey’s Annual Social and Economic Supplement, the analysis presented in this article suggests that defining low-wage work on the basis of alternative hourly wage cutoffs changes the size of the low-wage population, but does not noticeably alter time trends in the rate of change. The analysis also indicates that different definitions capture groups of workers with substantively different demographic, social, and economic characteristics. Although the individuals in any of the categories examined might reasonably be considered low-wage workers, a single definition obscures these distinctions. %B Monthly Labor Review %8 October %G eng %U http://www.bls.gov/opub/mlr/2016/article/pdf/how-should-we-define-low-wage-work.pdf %0 Report %D 2016 %T How Will Statistical Agencies Operate When All Data Are Private? %A Abowd, John M. %X How Will Statistical Agencies Operate When All Data Are Private? Abowd, John M. The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies. %I Cornell University %G eng %U http://hdl.handle.net/1813/44663 %9 Preprint %0 Journal Article %J Bayesian Analysis %D 2016 %T Incorporating marginal prior information into latent class models %A Schifeling, T. S. %A Reiter, J. P. %B Bayesian Analysis %V 11 %P 499-518 %G eng %U https://projecteuclid.org/euclid.ba/1434649584 %R doi:10.1214/15-BA959 %0 Journal Article %J Journal of Economic and Social Measurement %D 2016 %T Measuring Poverty Using the Supplemental Poverty Measure in the Panel Study of Income Dynamics, 1998 to 2010 %A Kimberlin, S. %A Shaefer, H.L. %A Kim, J. %X The Supplemental Poverty Measure (SPM) was recently introduced by the U.S. Census Bureau as an alternative measure of poverty that addresses many shortcomings of the official poverty measure (OPM) to better reflect the resources households have available to meet their basic needs. The Census SPM is available only in the Current Population Survey (CPS). This paper describes a method for constructing SPM poverty estimates in the Panel Study of Income Dynamics (PSID), for the biennial years 1998 through 2010. A public-use dataset of individual-level SPM status produced in this analysis will be available for download on the PSID website. Annual SPM poverty estimates from the PSID are presented for the years 1998, 2000, 2002, 2004, 2006, 2008, and 2010 and compared to SPM estimates for the same years derived from CPS data by the Census Bureau and independent researchers. We find that SPM poverty rates in the PSID are somewhat lower than those found in the CPS, though trends over time and impact of specific SPM components are similar across the two datasets. %B Journal of Economic and Social Measurement %V 41 %G eng %U http://content.iospress.com/articles/journal-of-economic-and-social-measurement/jem425 %N 1 %& 17 %R 10.3233/JEM-160425 %0 Generic %D 2016 %T Mismatches %A Smyth, Jolene %A Olson, Kristen %G eng %0 Report %D 2016 %T Modeling Endogenous Mobility in Earnings Determination %A Abowd, John M. %A McKinney, Kevin L. %A Schmutte, Ian M. %X Modeling Endogenous Mobility in Earnings Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. Replication code can be found at DOI: http://doi.org/10.5281/zenodo.zenodo.376600 and our Github repository endogenous-mobility-replication . %I Cornell University %G eng %U http://hdl.handle.net/1813/40306 %9 Preprint %0 Journal Article %J Journal of the American Statistical Association %D 2016 %T Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence %A Jared S. Murray %A Jerome P. Reiter %X We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. %B Journal of the American Statistical Association %G eng %U http://dx.doi.org/10.1080/01621459.2016.1174132 %R 10.1080/01621459.2016.1174132 %0 Journal Article %J Stat %D 2016 %T Multivariate Spatio-Temporal Survey Fusion with Application to the American Community Survey and Local Area Unemployment Statistics %A Bradley, J.R. %A Holan, S.H. %A Wikle, C.K %X There are often multiple surveys available that estimate and report related demographic variables of interest that are referenced over space and/or time. Not all surveys produce the same information, and thus, combining these surveys typically leads to higher quality estimates. That is, not every survey has the same level of precision nor do they always provide estimates of the same variables. In addition, various surveys often produce estimates with incomplete spatio-temporal coverage. By combining surveys using a Bayesian approach, we can account for different margins of error and leverage dependencies to produce estimates of every variable considered at every spatial location and every time point. Specifically, our strategy is to use a hierarchical modelling approach, where the first stage of the model incorporates the margin of error associated with each survey. Then, in a lower stage of the hierarchical model, the multivariate spatio-temporal mixed effects model is used to incorporate multivariate spatio-temporal dependencies of the processes of interest. We adopt a fully Bayesian approach for combining surveys; that is, given all of the available surveys, the conditional distributions of the latent processes of interest are used for statistical inference. To demonstrate our proposed methodology, we jointly analyze period estimates from the US Census Bureau's American Community Survey, and estimates obtained from the Bureau of Labor Statistics Local Area Unemployment Statistics program. Copyright © 2016 John Wiley & Sons, Ltd. %B Stat %P 224 - 233 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/sta4.120/full %0 Report %D 2016 %T NCRN Meeting Fall 2016 %A Vilhuber, Lars %X NCRN Meeting Fall 2016 Vilhuber, Lars Taken place at the U.S. Census Bureau HQ, Washington DC. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/45885 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Fall 2016: Audit Trails, Parallel Navigation, and the SIPP %A Lee, Jinyoung %X NCRN Meeting Fall 2016: Audit Trails, Parallel Navigation, and the SIPP Lee, Jinyoung Thanks to Dr. Robert Belli, Ana Lucía Córdova Cazar, and Ben Seloske for the team effort. %I University of Nebraska %G eng %U http://hdl.handle.net/1813/45823 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Fall 2016: Scanner Data and Economic Statistics: A Unified Approach %A Redding, Stephen J. %A Weinstein, David E. %X NCRN Meeting Fall 2016: Scanner Data and Economic Statistics: A Unified Approach Redding, Stephen J.; Weinstein, David E. %I University of Michigan %G eng %U http://hdl.handle.net/1813/45821 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016 %A Vilhuber, Lars %X NCRN Meeting Spring 2016 Vilhuber, Lars Taken place at U.S. Census Bureau HQ, Washington DC. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/45899 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016: A 2016 View of 2020 Census Quality, Costs, Benefits %A Spencer, Bruce D. %X NCRN Meeting Spring 2016: A 2016 View of 2020 Census Quality, Costs, Benefits Spencer, Bruce D. Census costs affect data quality and data quality affects census benefits. Although measuring census data quality is difficult enough ex post, census planning requires it to be done well in advance. The topic of this talk is the prediction of the cost-quality curve, its uncertainty, and its relation to benefits from census data. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting %I Northwestern University %G eng %U http://hdl.handle.net/1813/43897 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016: Attitudes Towards Geolocation-Enabled Census Forms %A Brandimarte, Laura %A Chiew, Ernest %A Ventura, Sam %A Acquisti, Alessandro %X NCRN Meeting Spring 2016: Attitudes Towards Geolocation-Enabled Census Forms Brandimarte, Laura; Chiew, Ernest; Ventura, Sam; Acquisti, Alessandro Geolocation refers to the automatic identification of the physical locations of Internet users. In an online survey experiment, we studied respondent reactions towards different types of geolocation. After coordinating with US Census Bureau researchers, we designed and administered a replica of a census form to a sample of respondents. We also created slightly different forms by manipulating the type of geolocation implemented. Using the IP address of each respondent, we approximated the geographical coordinates of the respondent and displayed this location on a map on the survey. Across different experimental conditions, we manipulated the map interface between the three interfaces on the Google Maps API: default road map, Satellite View, and Street View. We also provided either a specific, pinpointed location, or a set of two circles of 1- and 2-miles radius. Snapshots of responses were captured at every instant information was added, altered, or deleted by respondents when completing the survey. We measured willingness to provide information on the typical Census form, as well as privacy concerns associated with geolocation technologies and attitudes towards the use of online geographical maps to identify one’s exact current location. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting %I Carnegie-Mellon University %G eng %U http://hdl.handle.net/1813/43889 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016: Developing job linkages for the Health and Retirement Study %A Mccue, Kristin %A Abowd, John %A Levenstein, Margaret %A Patki, Dhiren %A Rodgers, Ann %A Shapiro, Matthew %A Wasi, Nada %X NCRN Meeting Spring 2016: Developing job linkages for the Health and Retirement Study McCue, Kristin; Abowd, John; Levenstein, Margaret; Patki, Dhiren; Rodgers, Ann; Shapiro, Matthew; Wasi, Nada This paper documents work using probabilistic record linkage to create a crosswalk between jobs reported in the Health and Retirement Study (HRS) and the list of workplaces on Census Bureau’s Business Register. Matching job records provides an opportunity to join variables that occur uniquely in separate datasets, to validate responses, and to develop missing data imputation models. Identifying the respondent’s workplace (“establishment”) is valuable for HRS because it allows researchers to incorporate the effects of particular social, economic, and geospatial work environments in studies of respondent health and retirement behavior. The linkage makes use of name and address standardizing techniques tailored to business data that were recently developed in a collaboration between researchers at Census, Cornell, and the University of Michigan. The matching protocol makes no use of the identity of the HRS respondent and strictly protects the confidentiality of information about the respondent’s employer. The paper first describes the clerical review process used to create a set of human-reviewed candidate pairs, and use of that set to train matching models. It then describes and compares several linking strategies that make use of employer name, address, and phone number. Finally it discusses alternative ways of incorporating information on match uncertainty into estimates based on the linked data, and illustrates their use with a preliminary sample of matched HRS jobs. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting %I University of Michigan %G eng %U http://hdl.handle.net/1813/43895 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016: Evaluating Data quality in Time Diary Surveys Using Paradata %A Córdova Cazar, Ana Lucía %A Belli, Robert %X NCRN Meeting Spring 2016: Evaluating Data quality in Time Diary Surveys Using Paradata Córdova Cazar, Ana Lucía; Belli, Robert Over the past decades, time use researchers have been increasingly interested in analyzing wellbeing in tandem with the use of time (Juster and Stafford, 1985; Krueger et al, 2009). Many methodological issues have arose in this endeavor, including the concern about the quality of the time use data. Survey researchers have increasingly turned to the analysis of paradata to better understand and model data quality. In particular, it has been argued that paradata may serve as proxy of the respondents’ cognitive response process, and can be used as an additional tool to assess the impact of data generation on data quality. In this presentation, data quality in the American Time Use Survey (ATUS) will be assessed through the use of paradata and survey responses. Specifically, I will talk about a data quality index I have created, which includes measures of different types of ATUS errors (e.g. low number of reported activities, failures to report an activity), and paradata variables (e.g. response latencies, incompletes). The overall objective of this study is to contribute to data quality assessment in the collection of timeline data from national surveys by providing insights on those interviewing dynamics that most impact data quality. These insights will help to improve future instruments and training of interviewers, as well as to reduce costs. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting %I University of Nebraska %G eng %U http://hdl.handle.net/1813/43896 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016: The ATUS and SIPP-EHC: Recent Developments %A Belli, Robert F. %X NCRN Meeting Spring 2016: The ATUS and SIPP-EHC: Recent Developments Belli, Robert F. One of the main objectives of the NCRN award to the University of Nebraska node is to investigate data quality associated with timeline interviewing as conducted with the American Time Use Survey (ATUS) time diary and the Survey of Income and Program Participation event history calendar (SIPP-EHC). Specifically, our efforts are focused on the relationships between interviewing dynamics as extracted from analyses of paradata with measures of data quality. With the ATUS, our recent efforts have revealed that respondents differ in how they handle difficulty with remembering activities, with some overcoming these difficulties and others succumbing to them. With the SIPP-EHC, we are still in the initial stages of extracting variables from the paradata that are associated with interviewing dynamics. Our work has also involved the development of a CATI time diary in which we are able to analyze audio streams to capture interviewing dynamics. I will conclude this talk by discussing challenges that have yet to be overcome with our work, and our vision of moving forward with the eventual development of self-administered timeline instruments that will be respondent-friendly due to the assistance of intelligent-agent driven virtual interviewers. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting %I University of Nebraska %G eng %U http://hdl.handle.net/1813/43893 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2017: 2017 Economic Census: Towards Synthetic Data Sets %A Caldwell, Carol %A Thompson, Katherine Jenny %X NCRN Meeting Spring 2017: 2017 Economic Census: Towards Synthetic Data Sets Caldwell, Carol; Thompson, Katherine Jenny %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52165 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2017: Differentially Private Verification of Regression Model Results %A Reiter, Jerry %X NCRN Meeting Spring 2017: Differentially Private Verification of Regression Model Results Reiter, Jerry %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52167 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2017: Practical Issues in Anonymity %A Clifton, Chris %A Merill, Shawn %A Merill, Keith %X NCRN Meeting Spring 2017: Practical Issues in Anonymity Clifton, Chris; Merill, Shawn; Merill, Keith %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52166 %9 Preprint %0 Report %D 2016 %T NCRN Newsletter: Volume 2 - Issue 4 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %XNCRN Newsletter: Volume 2 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from September 2015 through December 2015. NCRN Newsletter Vol. 2, Issue 4: January 28, 2016.

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/42394 %9 Preprint %0 Report %D 2016 %T NCRN Newsletter: Volume 3 - Issue 1 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 3 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2016 through May 2016. NCRN Newsletter Vol. 3, Issue 1: June 10, 2016 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/44199 %9 Preprint %0 Report %D 2016 %T NCRN Newsletter: Volume 3 - Issue 2 %A Vilhuber, Lars %A Knight-Ingram, Dory %X NCRN Newsletter: Volume 3 - Issue 2 Vilhuber, Lars; Knight-Ingram, Dory Overview of activities at NSF-Census Research Network nodes from June 2016 through December 2016. NCRN Newsletter Vol. 3, Issue 2: December 23, 2016 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/46171 %9 Preprint %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Noise infusion as a confidentiality protection measure for graph-based statistics %A Abowd, John M. %A McKinney, Kevin L. %X We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau's Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs. %B Statistical Journal of the International Association for Official Statistics %V 32 %P 127-135 %G eng %U http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji958 %N 1 %& 127 %R 10.3233/SJI-160958 %0 Report %D 2016 %T The NSF-Census Research Network in 2016: Taking stock, looking forward %A Vilhuber, Lars %X The NSF-Census Research Network in 2016: Taking stock, looking forward Vilhuber, Lars An overview of the activities of the NSF-Census Research Network as of 2016, given on Saturday, May 21, 2016, at a workshop on spatial and spatio-temporal design and analysis for official statistics, hosted by the Spatio-Temporal Statistics NSF Census Research Network (STSN) at the University of Missouri, and sponsored by the NSF-Census Research Network (NCRN) %I University of Missouri %G eng %U http://hdl.handle.net/1813/46210 %9 Preprint %0 Journal Article %J Journal of Applied Research in Memory and Cognition %D 2016 %T Parallel associations and the structure of autobiographical knowledge %A Belli, R.F. %A T. Al Baghal %K Autobiographical memory; Autobiographical knowledge; Autobiographical periods; Episodic memory; Retrospective reports %X The self-memory system (SMS) model of autobiographical knowledge conceives that memories are structured thematically, organized both hierarchically and temporally. This model has been challenged on several fronts, including the absence of parallel linkages across pathways. Calendar survey interviewing shows the frequent and varied use of parallel associations in autobiographical recall. Parallel associations in these data are commonplace, and are driven more by respondents’ generative retrieval than by interviewers’ probing. Parallel associations represent a number of autobiographical knowledge themes that are interrelated across life domains. The content of parallel associations is nearly evenly split between general and transitional events, supporting the importance of transitions in autographical memory. Associations in respondents’ memories (both parallel and sequential), demonstrate complex interactions with interviewer verbal behaviors during generative retrieval. In addition to discussing the implications of these results to the SMS model, implications are also drawn for transition theory and the basic-systems model. %B Journal of Applied Research in Memory and Cognition %V 5 %P 150–157 %8 03/2016 %G eng %N 2 %R 10.1016/j.jarmac.2016.03.004 %0 Report %D 2016 %T Practical Issues in Anonymity %A Clifton, Chris %A Merill, Shawn %A Merill, Keith %X Practical Issues in Anonymity Clifton, Chris; Merill, Shawn; Merill, Keith %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52166 %9 Preprint %0 Journal Article %J Journal of Privacy and Confidentiality %D 2016 %T Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering, %A Murray, J. S. %X Probabilistic record linkage, the task of merging two or more databases in the absence of a unique identifier, is a perennial and challenging problem. It is closely related to the problem of deduplicating a single database, which can be cast as linking a single database against itself. In both cases the number of possible links grows rapidly in the size of the databases under consideration, and in most applications it is necessary to first reduce the number of record pairs that will be compared. Spurred by practical considerations, a range of methods have been developed for this task. These methods go under a variety of names, including indexing and blocking, and have seen significant development. However, methods for inferring linkage structure that account for indexing, blocking, and additional filtering steps have not seen commensurate development. In this paper we review the implications of indexing, blocking and filtering within the popular Fellegi-Sunter framework, and propose a new model to account for particular forms of indexing and filtering. %B Journal of Privacy and Confidentiality %V 7 %G eng %U http://repository.cmu.edu/jpc/vol7/iss1/2 %N 1 %0 Report %D 2016 %T Regression Modeling and File Matching Using Possibly Erroneous Matching Variables %A Dalzell, N. M. %A Reiter, J. P. %K Statistics - Applications %X Many analyses require linking records from two databases comprising overlapping sets of individuals. In the absence of unique identifiers, the linkage procedure often involves matching on a set of categorical variables, such as demographics, common to both files. Typically, however, the resulting matches are inexact: some cross-classifications of the matching variables do not generate unique links across files. Further, the matching variables can be subject to reporting errors, which introduce additional uncertainty in analyses. We present a Bayesian file matching methodology designed to estimate regression models and match records simultaneously when categorical matching variables are subject to reporting error. The method relies on a hierarchical model that includes (1) the regression of interest involving variables from the two files given a vector indicating the links, (2) a model for the linking vector given the true values of the matching variables, (3) a measurement error model for reported values of the matching variables given their true values, and (4) a model for the true values of the matching variables. We describe algorithms for sampling from the posterior distribution of the model. We illustrate the methodology using artificial data and data from education records in the state of North Carolina. %I ArXiv %G eng %U http://arxiv.org/abs/1608.06309 %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Releasing synthetic magnitude micro data constrained to fixed marginal totals %A Wei, Lan %A Reiter, Jerome P. %K Confidential %K Disclosure %K establishment %K mixture %K poisson %K risk %X We present approaches to generating synthetic microdata for multivariate data that take on non-negative integer values, such as magnitude data in economic surveys. The basic idea is to estimate a mixture of Poisson distributions to describe the multivariate distribution, and release draws from the posterior predictive distribution of the model. We develop approaches that guarantee the synthetic data sum to marginal totals computed from the original data, as well approaches that do not enforce this equality. For both cases, we present methods for assessing disclosure risks inherent in releasing synthetic magnitude microdata. We illustrate the methodology using economic data from a survey of manufacturing establishments. %B Statistical Journal of the International Association for Official Statistics %V 32 %P 93-108 %8 02/2016 %G eng %U http://content.iospress.com/download/statistical-journal-of-the-iaos/sji959 %N 1 %& 93 %R 10.3233/SJI-160959 %0 Journal Article %J Journal of Applied Statistics %D 2016 %T Simultaneous edit-imputation and disclosure limitation for business establishment data %A H. J. Kim %A J. P. Reiter %A A. F. Karr %X Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks. %B Journal of Applied Statistics %8 12/2016 %G eng %R 10.1080/02664763.2016.1267123 %0 Journal Article %J Demography %D 2016 %T Spatial Variation in the Quality of American Community Survey Estimates %A Folch, David C. %A Arribas-Bel, Daniel %A Koschinsky, Julia %A Spielman, Seth E. %B Demography %V 53 %P 1535–1554 %G eng %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Synthetic establishment microdata around the world %A Vilhuber, Lars %A Abowd, John M. %A Reiter, Jerome P. %K Business data %K confidentiality %K differential privacy %K international comparison %K Multiple imputation %K synthetic %X In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature. %B Statistical Journal of the International Association for Official Statistics %V 32 %P 65-68 %G eng %U http://content.iospress.com/download/statistical-journal-of-the-iaos/sji964 %N 1 %& 65 %R 10.3233/SJI-160964 %0 Thesis %B Statistics %D 2016 %T Topics on Official Statistics and Statistical Policy %A Zachary Seeskin %X My dissertation studies decision questions for government statistical agencies, both regarding data collection and how to combine data from multiple sources. Informed decisions regarding expenditure on data collection require information about the effects of data quality on data use. For the first topic, I study two important uses of decennial census data in the U.S.: for apportioning the House of Representatives and for allocating federal funds. Estimates of distortions in these two uses are developed for different levels of census accuracy. Then, I thoroughly investigate the sensitivity of findings to the census error distribution and to the choice of how to measure the distortions. The chapter concludes with a proposed framework for partial cost-benefit analysis that charges a share of the cost of the census to allocation programs. Then, I investigate an approximation to make analysis of the effects of census error on allocations feasible when allocations also depend on non-census statistics, as is the case for many formula-based allocations. The approximation conditions on the realized values of the non-census statistics instead of using the joint distribution over both census and non-census statistics. The research studies how using the approximation affects conclusions. I find that in some simple cases, the approximation always either overstates or equals the true effects of census error. Understatement is possible in other cases, but theory suggests that the largest possible understatements are about one-third the amount of the largest possible overstatements. In simulations with a more complex allocation formula, the approximation tends to overstate the effects of census error with the overstatement increasing with error in non-census statistics but decreasing with error in census statistics. In the final chapter, I evaluate the use of 2008-2010 property tax data from CoreLogic, Inc. (CoreLogic), aggregated from county and township governments from around the country, to improve 2010 American Community Survey (ACS) estimates of property tax amounts for single-family homes. Particularly, I evaluate the potential to use CoreLogic to reduce respondent burden, to study survey response error and to improve adjustments for survey nonresponse. The coverage of the CoreLogic data varies between counties as does the correspondence between ACS and CoreLogic property taxes. This geographic variation implies that different approaches toward using CoreLogic are needed in different areas of the country. Further, large differences between CoreLogic and ACS property taxes in certain counties seem to be due to conceptual differences between what is collected in the two data sources. I examine three counties, Clark County, NV, Philadelphia County, PA and St. Louis County, MO, and compare how estimates would change with different approaches using the CoreLogic data. Mean county property tax estimates are highly sensitive to whether ACS or CoreLogic data are used to construct estimates. Using CoreLogic data in imputation modeling for nonresponse adjustment of ACS estimates modestly improves the predictive power of imputation models, although estimates of county property taxes and property taxes by mortgage status are not very sensitive to the imputation method. %B Statistics %I Northwestern University %C Evanston, Illinois %V PHD %P 24 %8 09/2016 %G eng %U http://search.proquest.com/docview/1826016819 %0 Journal Article %J Journal of Official Statistics %D 2016 %T Using Data Mining to Predict the Occurrence of Respondent Retrieval Strategies in Calendar Interviewing: The Quality of Retrospective Reports %A Belli, Robert F. %A Miller, L. Dee %A Baghal, Tarek Al %A Soh, Leen-Kiat %X Determining which verbal behaviors of interviewers and respondents are dependent on one another is a complex problem that can be facilitated via data-mining approaches. Data are derived from the interviews of 153 respondents of the Panel Study of Income Dynamics (PSID) who were interviewed about their life-course histories. Behavioral sequences of interviewer-respondent interactions that were most predictive of respondents spontaneously using parallel, timing, duration, and sequential retrieval strategies in their generation of answers were examined. We also examined which behavioral sequences were predictive of retrospective reporting data quality as shown by correspondence between calendar responses with responses collected in prior waves of the PSID. The verbal behaviors of immediately preceding interviewer and respondent turns of speech were assessed in terms of their co-occurrence with each respondent retrieval strategy. Interviewers’ use of parallel probes is associated with poorer data quality, whereas interviewers’ use of timing and duration probes, especially in tandem, is associated with better data quality. Respondents’ use of timing and duration strategies is also associated with better data quality and both strategies are facilitated by interviewer timing probes. Data mining alongside regression techniques is valuable to examine which interviewer-respondent interactions will benefit data quality. %B Journal of Official Statistics %V 32 %P 579-600 %8 2016 %G eng %N 3 %R https://doi.org/10.1515/jos-2016-0030 %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Using partially synthetic microdata to protect sensitive cells in business statistics %A Miranda, Javier %A Vilhuber, Lars %K confidentiality protection %K gross job flows %K local labor markets %K Statistical Disclosure Limitation %K Synthetic data %K time-series %X We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions). %B Statistical Journal of the International Association for Official Statistics %V 32 %P 69-80 %8 2016 %G eng %U http://content.iospress.com/download/statistical-journal-of-the-iaos/sji963 %N 1 %& 69 %R 10.3233/SJI-160963 %0 Report %D 2016 %T Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do %A John M. Abowd %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/32/ %0 Journal Article %J Journal of Survey Statistics and Methodology %D 2015 %T Accounting for nonignorable unit nonresponse and attrition in panel studies with refreshment samples %A Schifeling, T. %A Cheng, C. %A Hillygus, D. S. %A Reiter, J. P. %X Panel surveys typically su↵er from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, panel data alone cannot inform the extent of the bias from the attrition, so that analysts using the panel data alone must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new individuals during some later wave of the panel. Refreshment samples o↵er information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the refreshment sample itself. As we illustrate, nonignorable unit nonresponse can significantly compromise the analyst’s ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences—corrected for panel attrition—are to di↵erent assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study. %B Journal of Survey Statistics and Methodology %V 3 %P 265-295 %G eng %U http://jssam.oxfordjournals.org/content/3/3/265.abstract %N 3 %& 265 %R 10.1093/jssam/smv007 %0 Journal Article %J Statistica Sinica %D 2015 %T Bayesian Analysis of Spatially-Dependent Functional Responses with Spatially-Dependent Multi-Dimensional Functional Predictors %A Yang, W. H. %A Wikle, C.K. %A Holan, S.H. %A Sudduth, K. %A Meyers, D.B. %B Statistica Sinica %V 25 %G eng %U http://www3.stat.sinica.edu.tw/preprint/SS-13-245w_Preprint.pdf %& 205-223 %R 10.5705/ss.2013.245w %0 Journal Article %J Annals of Applied Statistics %D 2015 %T Bayesian Binomial Mixture Models for Estimating Abundance in Ecological Monitoring Studies %A Wu, G. %A Holan, S.H. %A Nilon, C.H. %A Wikle, C.K. %B Annals of Applied Statistics %V 9 %P 1-26 %G eng %U http://projecteuclid.org/euclid.aoas/1430226082 %R 10.1214/14-AOAS801 %0 Journal Article %J Statistical Methods and Applications %D 2015 %T Bayesian Hierarchical Statistical SIRS Models %A Zhuang, L. %A Cressie, N. %B Statistical Methods and Applications %V 23 %P 601-646 %G eng %R 10.1007/s10260-014-0280-9 %0 Journal Article %J ArXiv %D 2015 %T Bayesian Latent Pattern Mixture Models for Handling Attrition in Panel Studies With Refreshment Samples %A Yajuan Si %A Jerome P. Reiter %A D. Sunshine Hillygus %K Categorical %K Dirichlet pro- cess %K Multiple imputation %K Non-ignorable %K Panel attrition %K Refreshment sample %X Many panel studies collect refreshment samples---new, randomly sampled respondents who complete the questionnaire at the same time as a subsequent wave of the panel. With appropriate modeling, these samples can be leveraged to correct inferences for biases caused by non-ignorable attrition. We present such a model when the panel includes many categorical survey variables. The model relies on a Bayesian latent pattern mixture model, in which an indicator for attrition and the survey variables are modeled jointly via a latent class model. We allow the multinomial probabilities within classes to depend on the attrition indicator, which offers additional flexibility over standard applications of latent class models. We present results of simulation studies that illustrate the benefits of this flexibility. We apply the model to correct attrition bias in an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study. %B ArXiv %8 09/2015 %G eng %U http://arxiv.org/abs/1509.02124 %N 1509.02124 %0 Journal Article %J ArXiv %D 2015 %T Bayesian Lattice Filters for Time-Varying Autoregression and Time-Frequency Analysis %A Yang, W. H. %A Holan, S. H. %A Wikle, C.K. %X Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time-frequency estimation using time-varying autoregressive models. In this context, we take a fully Bayesian approach and allow both the autoregressive coefficients and innovation variance to vary over time. Importantly, our estimation method uses the lattice filter and is cast within the partial autocorrelation domain. The marginal posterior distributions are of standard form and, as a convenient by-product of our estimation method, our approach avoids undesirable matrix inversions. As such, estimation is extremely computationally efficient and stable. To illustrate the effectiveness of our approach, we conduct a comprehensive simulation study that compares our method with other competing methods and find that, in most cases, our approach performs superior in terms of average squared error between the estimated and true time-varying spectral density. Lastly, we demonstrate our methodology through three modeling applications; namely, insect communication signals, environmental data (wind components), and macroeconomic data (US gross domestic product (GDP) and consumption). %B ArXiv %G eng %U http://arxiv.org/abs/1408.2757 %N 1408.2757 %0 Journal Article %J Project Euclid %D 2015 %T Bayesian Lattice Filters for Time-Varying Autoregression and Time–Frequency Analysis %A Yang, W. H. %A Holan, Scott H. %A Wikle, Christopher K. %K locally stationary %K model selection %K nonstationary partial autocorrelation %K piecewise stationary %K sequential estimation %K time-varying spectral density %X Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time–frequency estimation using time-varying autoregressive models. In this context, we take a fully Bayesian approach and allow both the autoregressive coefficients and innovation variance to vary over time. Importantly, our estimation method uses the lattice filter and is cast within the partial autocorrelation domain. The marginal posterior distributions are of standard form and, as a convenient by-product of our estimation method, our approach avoids undesirable matrix inversions. As such, estimation is extremely computationally efficient and stable. To illustrate the effectiveness of our approach, we conduct a comprehensive simulation study that compares our method with other competing methods and find that, in most cases, our approach performs superior in terms of average squared error between the estimated and true time-varying spectral density. Lastly, we demonstrate our methodology through three modeling applications; namely, insect communication signals, environmental data (wind components), and macroeconomic data (US gross domestic product (GDP) and consumption). %B Project Euclid %P 27 %8 10/2015 %G eng %U http://projecteuclid.org/euclid.ba/1445263834 %R 10.1214/15-BA978 %0 Journal Article %J Spatial Statistics %D 2015 %T Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography %A Quick, Harrison %A Holan, Scott H. %A Wikle, Christopher K. %A Reiter, Jerome P. %B Spatial Statistics %V 14 %P 439--451 %8 08/2015 %G eng %U http://www.sciencedirect.com/science/article/pii/S2211675315000718 %R 10.1016/j.spasta.2015.07.008 %0 Journal Article %J ArXiv %D 2015 %T Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography %A Quick, H. %A Holan, S. H. %A Wikle, C. K. %A Reiter, J. P. %X Many data stewards collect confidential data that include fine geography. When sharing these data with others, data stewards strive to disseminate data that are informative for a wide range of spatial and non-spatial analyses while simultaneously protecting the confidentiality of data subjects' identities and attributes. Typically, data stewards meet this challenge by coarsening the resolution of the released geography and, as needed, perturbing the confidential attributes. When done with high intensity, these redaction strategies can result in released data with poor analytic quality. We propose an alternative dissemination approach based on fully synthetic data. We generate data using marked point process models that can maintain both the statistical properties and the spatial dependence structure of the confidential data. We illustrate the approach using data consisting of mortality records from Durham, North Carolina. %B ArXiv %G eng %U http://arxiv.org/abs/1407.7795 %N 1407.7795 %0 Journal Article %J Journal of Statistical Planning and Inference %D 2015 %T Bayesian Semiparametric Hierarchical Empirical Likelihood Spatial Models %A Porter, A.T. %A Holan, S.H. %A Wikle, C.K. %B Journal of Statistical Planning and Inference %V 165 %P 78-90 %8 10/2015 %G eng %R 10.1016/j.jspi.2015.04.002 %0 Journal Article %J Journal of the American Statistical Association %D 2015 %T Bayesian Spatial Change of Support for Count-Valued Survey Data with Application to the American Community Survey %A Bradley, Jonathan %A Wikle, C.K. %A Holan, S. H. %X We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year “period-estimates,” and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data-users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on “new” spatial supports in “real-time.” This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in “real-time.” We show the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. %B Journal of the American Statistical Association %8 12/2015 %G eng %U http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1117471 %R 10.1080/01621459.2015.1117471 %0 Journal Article %J Journal of the American Statistical Association %D 2015 %T Bayesian Spatial Change of Support for Count-Valued Survey Data with Application to the American Community Survey %A Bradley, Jonathan R. %A Wikle, Christopher K. %A Holan, Scott H. %K Aggregation %K American Community Survey %K Bayesian hierarchical model %K Givens angle prior %K Markov chain Monte Carlo %K Multiscale model %K Non-Gaussian. %X We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year “period-estimates,” and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data-users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on “new” spatial supports in “real-time.” This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in “real-time.” We show the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. %B Journal of the American Statistical Association %8 12/2015 %G eng %U http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1117471 %R 10.1080/01621459.2015.1117471 %0 Journal Article %J ArXiv %D 2015 %T Bayesian Spatial Change of Support for Count–Valued Survey Data %A Bradley, J. R. %A Wikle, C.K. %A Holan, S. H. %X We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year "period-estimates," and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on "new" spatial supports in "real-time." This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in "real-time." We demonstrate the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. %B ArXiv %G eng %U http://arxiv.org/abs/1405.7227 %N 1405.7227 %0 Report %D 2015 %T Blocking Methods Applied to Casualty Records from the Syrian Conflict %A Sadosky, Peter %A Shrivastava, Anshumali %A Price, Megan %A Steorts, Rebecca %B ArXiv %G eng %U http://arxiv.org/abs/1510.07714 %0 Journal Article %J Statistical Science %D 2015 %T Capturing multivariate spatial dependence: Model, estimate, and then predict %A Cressie, N. %A Burden, S. %A Davis, W. %A Krivitsky, P. %A Mokhtarian, P. %A Seusse, T. %A Zammit-Mangion, A. %B Statistical Science %V 30 %P 170-175 %8 06/2015 %G eng %U http://projecteuclid.org/euclid.ss/1433341474 %N 2 %R 10.1214/15-STS517 %0 Report %D 2015 %T Categorical data fusion using auxiliary information %A Fosdick, B. K. %A Maria DeYoreo %A J. P. Reiter %X In data fusion analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people's preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion. %I arXiv %G eng %U http://arxiv.org/abs/1506.05886 %0 Journal Article %J Population and Environment %D 2015 %T Change in Visible Impervious Surface Area in Southeastern Michigan Before and After the “Great Recession:” Spatial Differentiation in Remotely Sensed Land-Cover Dynamics %A Wilson, C. R. %A Brown, D. G. %B Population and Environment %V 36 %P 331-355 %8 03/2015 %G eng %U http://link.springer.com/article/10.1007%2Fs11111-014-0219-y %N 3 %& 331 %R 10.1007/s11111-014-0219-y %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Changing ‘Who’ or ‘Where’: Implications for Data Quality in the American Time Use Survey %A Deal, C.E. %A Kirchner, A. %A Cordova-Cazar, A.L. %A Ellyne, L. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Bayesian Analysis %D 2015 %T Comment on Article by Ferreira and Gamerman %A Cressie, N. %A Chambers, R. L. %B Bayesian Analysis %V 10 %P 741-748 %8 04/2015 %G eng %U http://projecteuclid.org/euclid.ba/1429880217 %N 3 %R doi:10.1214/15-BA944B %0 Journal Article %J Journal of the American Statistical Association %D 2015 %T Comment on ``Semiparametric Bayesian Density Estimation with Disparate Data Sources: A Meta-Analysis of Global Childhood Undernutrition" by Finncane, M. M., Paciorek, C. J., Stevens, G. A., and Ezzati, M. %A Wikle, C.K. %A Holan, S.H. %B Journal of the American Statistical Association %G eng %0 Journal Article %J Bayesian Analysis %D 2015 %T Comment: Spatial sampling designs depend as much on “how much?” and “why?” as on “where?” %A Cressie, N. %A Chambers, R. L. %X A comment on “Optimal design in geostatistics under preferential sampling” by G. da Silva Ferreira and D. Gamerman %B Bayesian Analysis %G eng %0 Journal Article %J Journal of Economic Literature %D 2015 %T Communicating Uncertainty in Official Economic Statistics: An Appraisal Fifty Years after Morgenstern %A Manski, Charles F. %K and Organizing Macroeconomic Data; Data Access E23: Macroeconomics: Production %K B22: History of Economic Thought: Macroeconomics C82: Methodology for Collecting %K Estimating %X Federal statistical agencies in the United States and analogous agencies elsewhere commonly report official economic statistics as point estimates, without accompanying measures of error. Users of the statistics may incorrectly view them as error free or may incorrectly conjecture error magnitudes. This paper discusses strategies to mitigate misinterpretation of official statistics by communicating uncertainty to the public. Sampling error can be measured using established statistical principles. The challenge is to satisfactorily measure the various forms of nonsampling error. I find it useful to distinguish transitory statistical uncertainty, permanent statistical uncertainty, and conceptual uncertainty. I illustrate how each arises as the Bureau of Economic Analysis periodically revises GDP estimates, the Census Bureau generates household income statistics from surveys with nonresponse, and the Bureau of Labor Statistics seasonally adjusts employment statistics. I anchor my discussion of communication of uncertainty in the contribution of Oskar Morgenstern (1963a), who argued forcefully for agency publication of error estimates for official economic statistics. (JEL B22, C82, E23) %B Journal of Economic Literature %V 53 %P 631-53 %8 09/2015 %G eng %U http://www.aeaweb.org/articles.php?doi=10.1257/jel.53.3.631 %R 10.1257/jel.53.3.631 %0 Journal Article %J Test %D 2015 %T Comparing and selecting spatial predictors using local criteria %A Bradley, J.R. %A Cressie, N. %A Shi, T. %B Test %V 24 %P 1-28 %8 03/2015 %G eng %U http://dx.doi.org/10.1007/s11749-014-0415-1 %N 1 %& 1 %R 10.1007/s11749-014-0415-1 %0 Thesis %B Statistical Science %D 2015 %T A Comparison of Multiple Imputation Methods for Categorical Data (Master's Thesis) %A Akande, O. %B Statistical Science %I Duke University %G eng %9 Masters %0 Report %D 2015 %T Cost-Benefit Analysis for a Quinquennial Census: The 2016 Population Census of South Africa. %A Spencer, Bruce D. %A May, Julian %A Kenyon, Steven %A Seeskin, Zachary H. %K demographic statistics %K fiscal allocations %K loss function %K population estimates %K post-censal estimates %XThe question of whether to carry out a quinquennial census is being faced by national statistical offices in increasingly many countries, including Canada, Nigeria, Ireland, Australia, and South Africa. The authors describe uses, and limitations, of cost-benefit analysis for this decision problem in the case of the 2016 census of South Africa. The government of South Africa needed to decide whether to conduct a 2016 census or to rely on increasingly inaccurate post-censal estimates accounting for births, deaths, and migration since the previous (2011) census. The cost-benefit analysis compared predicted costs of the 2016 census to the benefits from improved allocation of intergovernmental revenue, which was considered by the government to be a critical use of the 2016 census, although not the only important benefit. Without the 2016 census, allocations would be based on population estimates. Accuracy of the post-censal estimates was estimated from the performance of past estimates, and the hypothetical expected reduction in errors in allocation due to the 2016 census was estimated. A loss function was introduced to quantify the improvement in allocation. With this evidence, the government was able to decide not to conduct the 2016 census, but instead to improve data and capacity for producing post-censal estimates.

%B IPR Working Paper Series %I Northwestern University, Institute for Policy Research %G eng %U http://www.ipr.northwestern.edu/publications/papers/2015/ipr-wp-15-06.html %9 Working Paper %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Determining Potential for Breakoff in Time Diary Survey Using Paradata %A Wettlaufer, D. %A Arunachalam, H. %A Atkin, G. %A Eck, A. %A Soh, L.-K. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 05/2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J ArXiv %D 2015 %T Dirichlet Process Mixture Models for Nested Categorical Data %A Hu, J. %A Reiter, J.P. %A Wang, Q. %X We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables. We apply the model to estimate multivariate relationships in a subset of the American Community Survey. Using the estimated model, we generate synthetic household data that could be disseminated as redacted public use files with high analytic validity and low disclosure risks. Supplementary materials for this article are available online. %B ArXiv %G eng %U http://arxiv.org/pdf/1412.2282v3.pdf %N 1412.2282 %0 Thesis %B Statistical Science %D 2015 %T Dirichlet Process Mixture Models for Nested Categorical Data (Ph.D. Thesis) %A Hu, J. %B Statistical Science %I Duke University %G eng %U http://dukespace.lib.duke.edu/dspace/handle/10161/9933 %9 Ph.D. %0 Conference Paper %B International Conference on Total Survey Error %D 2015 %T Do Interviewers with High Cooperation Rates Behave Differently? Interviewer Cooperation Rates and Interview Behaviors %A Olson, K. %A Smyth, J.D. %A Kirchner, A. %B International Conference on Total Survey Error %C Baltimore, MD %8 09/2015 %G eng %U http://www.niss.org/events/2015-international-total-survey-error-conference %0 Conference Paper %B Joint Statistical Meetings %D 2015 %T Do Interviewers with High Cooperation Rates Behave Differently? Interviewer Cooperation Rates and Interview Behaviors %A Olson, K. %A Smyth, J.D. %A Kirchner, A. %B Joint Statistical Meetings %C Seattle, WA %8 08/2015 %G eng %U http://www.amstat.org/meetings/jsm/2015/program.cfm %0 Thesis %B Economics %D 2015 %T Dynamic Models of Human Capital Accumulation (Ph.D. Thesis) %A Ransom, T. %B Economics %I Duke University %G eng %U http://dukespace.lib.duke.edu/dspace/handle/10161/9929 %9 Ph.D. %0 Report %D 2015 %T Economic Analysis and Statistical Disclosure Limitation %A Abowd, John M. %A Schmutte, Ian M. %XEconomic Analysis and Statistical Disclosure Limitation Abowd, John M.; Schmutte, Ian M. This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect confidentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies.

%I Cornell University %G eng %U http://hdl.handle.net/1813/40581 %9 Preprint %0 Journal Article %J Brookings Papers on Economic Activity %D 2015 %T Economic Analysis and Statistical Disclosure Limitation %A Abowd, John M. %A Schmutte, Ian M. %X Economic Analysis and Statistical Disclosure Limitation Abowd, John M.; Schmutte, Ian M. This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect confidentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies. %B Brookings Papers on Economic Activity %V Spring 2015 %8 03/2015 %G eng %U http://www.brookings.edu/about/projects/bpea/papers/2015/economic-analysis-statistical-disclosure-limitation %0 Journal Article %J Journal of Survey Statistics and Methodology %D 2015 %T The Effect of CATI Questionnaire Design Features on Response Timing %A Olson, K. %A Smyth, J.D. %B Journal of Survey Statistics and Methodology %V 3 %P 361-396 %G eng %N 3 %R 10.1093/jssam/smv021 %0 Report %D 2015 %T Effects of Census Accuracy on Apportionment of Congress and Allocations of Federal Funds. %A Seeskin, Zachary H. %A Spencer, Bruce D. %XHow much accuracy is needed in the 2020 census depends on the cost of attaining accuracy and on the consequences of imperfect accuracy. The cost target for the 2020 census of the United States has been specified, and the Census Bureau is developing projections of the accuracy attainable for that cost. It is desirable to have information about the consequences of the accuracy that might be attainable for that cost or for alternative cost levels. To assess the consequences of imperfect census accuracy, Seeskin and Spencer consider alternative profiles of accuracy for states and assess their implications for apportionment of the U.S. House of Representatives and for allocation of federal funds. An error in allocation is defined as the difference between the allocation computed under imperfect data and the allocation computed with perfect data. Estimates of expected sums of absolute values of errors are presented for House apportionment and for federal funds allocations.

%B IPR Working Paper Series %I Northwestern University, Institute for Policy Research %G eng %U http://www.ipr.northwestern.edu/publications/papers/2015/ipr-wp-15-05.html %9 Working Paper %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Effects of interviewer and respondent behavior on data quality: An investigation of question types and interviewer learning %A Kirchner, A. %A Olson, K. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B 6th Conference of the European Survey Research Association %D 2015 %T Effects of interviewer and respondent behavior on data quality: An investigation of question types and interviewer learning %A Kirchner, A. %A Olson, K. %B 6th Conference of the European Survey Research Association %C Reykjavik, Iceland %8 07/2015 %G eng %U http://www.europeansurveyresearch.org/conference %0 Journal Article %J arXiv %D 2015 %T An empirical comparison of multiple imputation methods for categorical data %A Akande, O. %A Li, Fan %A Reiter , J. P. %X Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. The results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and mixture model approaches. They also suggest competing advantages for the regression tree and mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. %B arXiv %G eng %U http://arxiv.org/abs/1508.05918 %N 1508.05918 %0 Journal Article %J Bayesian Anal. %D 2015 %T Entity Resolution with Empirically Motivated Priors %A Steorts, Rebecca C. %X Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian-type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey on income and wealth, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters. %B Bayesian Anal. %V 10 %P 849–875 %8 12 %G eng %U http://dx.doi.org/10.1214/15-BA965SI %R 10.1214/15-BA965SI %0 Journal Article %J Bayesian Analysis %D 2015 %T Entity resolution with empirically motivated priors %A Steorts, Rebecca C. %X Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters. %B Bayesian Analysis %V 10 %G eng %U http://projecteuclid.org/euclid.ba/1441790411 %N 5 %( http://arxiv.org/abs/1409.0643 %R 10.1214/15-BA965SI %0 Thesis %B Department of Economics %D 2015 %T Essays on Multinational Production and the Propagation of Shocks %A Flaaen, Aaron %K Business Cycle Comovement %K Global Supply Chains %K Multinational Firms %X The increased exposure of the United States to economic shocks originating from abroad is a common concern of those critical of globalization. An understanding of the cross-country transmission of shocks is of central importance for policymakers seeking to limit excess volatility resulting from international linkages. Firms whose ownership spans multiple countries are one under-appreciated mechanism. These multinationals represent an enormous share of the global economy, but a general scarcity of firm-level data has limited our understanding of how they affect both origin and destination countries. One contribution of this dissertation is to expand the data availability on these firms, using innovative data-linking techniques. The first chapter provides some of the first ever causal evidence on the role of trade and multinational production in the transmission of economic shocks and the cross-country synchronization of business cycles. This chapter leverages the 2011 Japanese earthquake/tsunami as a natural experiment. It finds that those U.S. firms with large exposure to intermediate inputs from Japan -- typically the affiliates of Japanese multinationals -- experience significant output declines after this shock, roughly one-for-one with declines in imported inputs. Structural estimation of the production function reveals substantial complementarities between imported and domestic inputs. These results suggest that global supply chains are more rigid than previously thought. The second chapter incorporates this low production elasticity of imported inputs into an otherwise standard dynamic stochastic general equilibrium model. The low degree of input substitutability, when applied to the share of trade governed by multinational firms, can generate effects in the aggregate. Value-added co-movement increases by 11 percentage points in the baseline model relative to a model where such features are absent. The model confirms that real linkages -- in addition to financial and policy spillovers -- play an important role in business cycle synchronization. The third chapter describes additional characteristics of multinational firms relative to domestic and exporting firms in the U.S. economy. These firms are larger, more productive, more capital intensive, and pay higher wages than other firms. The relative patterns of trade and output offer valuable guidance for the motives for ownership that spans national boundaries. %B Department of Economics %I University of Michigan %C Ann Arbor, MI %G eng %U http://hdl.handle.net/2027.42/111331 %9 Ph.D. %0 Book Section %B Geometry Driven Statistics %D 2015 %T Evaluation of diagnostics for hierarchical spatial statistical models %A Cressie, N. %A Burden, S. %E I.L. Dryden %E J.T. Kent %B Geometry Driven Statistics %7 1 %I Wiley %C Chinchester %P 241-256 %@ 978-1118866573 %G eng %U http://niasra.uow.edu.au/content/groups/public/@web/@inf/@math/documents/doc/uow169240.pdf %& 12 %0 Journal Article %J Journal of Poverty %D 2015 %T Expanding the Discourse on Antipoverty Policy: Reconsidering a Negative Income Tax %A Jessica Wiederspan %A Elizabeth Rhodes %A H. Luke Shaefer %K economic well-being %K poverty alleviation %K public policy %K social welfare policy %X This article proposes that advocates for the poor consider the replacement of the current means-tested safety net in the United States with a Negative Income Tax (NIT), a guaranteed income program that lifts families’ incomes above a minimum threshold. The article highlights gaps in service provision that leave millions in poverty, explains how a NIT could help fill those gaps, and compares current expenditures on major means-tested programs to estimated expenditures necessary for a NIT. Finally, it addresses the financial and political concerns that are likely to arise in the event that a NIT proposal gains traction among policy makers. %B Journal of Poverty %V 19 %P 218-238 %8 02/2015 %G eng %U http://dx.doi.org/10.1080/10875549.2014.991889 %R 10.1080/10875549.2014.991889 %0 Journal Article %J Stat %D 2015 %T Figures of merit for simultaneous inference and comparisons in simulation experiments %A Cressie, N. %A Burden, S. %B Stat %V 4 %P 196-211 %8 08/2015 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/sta4.88/epdf %N 1 %& 196 %R 10.1002/sta4.88 %0 Thesis %B Department of Economics %D 2015 %T Four Essays in Unemployment, Wage Dynamics and Subjective Expectations %A Hudomiet, Peter %K measurement error %K subjective expectations %K unemployment %X This dissertation contains four essays on unemployment differences between skill groups, on the effect of non-employment on wages and measurement error, and on subjective expectations of Americans about mortality and the stock market. Chapter 1 tests how much of the unemployment rate differences between education groups can be explained by occupational differences in labor adjustment costs. The educational gap in unemployment is substantial. Recent empirical studies found that the largest component of labor adjustment costs are adaptation costs: newly hired workers need a few month get up to speed and reach full productivity. The chapter evaluates the effect of adaptation costs on unemployment using a calibrated search and matching model. Chapter 2 tests how short periods of non-employment affect survey reports of annual earnings. Non-employment has strong and non-standard effects on response error in earnings. Persons tend to report the permanent component of their earnings accurately, but transitory shocks are underreported. Transitory shocks due to career interruptions are very large, taking up several month of lost earnings, on average, and people only report 60-85% percent of these earnings losses. The resulting measurement error is non-standard: it has a positive mean, it is right-skewed, and the bias correlates with predictors of turnover. Chapter 3 proposes and tests a model, the modal response hypothesis, to explain patterns in mortality expectations of Americans. The model is a mathematical expression of the idea that survey responses of 0%, 50%, or 100% to probability questions indicate a high level of uncertainty about the relevant probability. The chapter shows that subjective survival expectations in 2002 line up very well with realized mortality of the HRS respondents between 2002 and 2010 and our model performs better than typically used models in the literature of subjective probabilities. Chapter 4 analyzes the impact of the stock market crash of 2008 on households' expectations about the returns on the stock market index: the population average of expectations, the average uncertainty, and the cross-sectional heterogeneity in expectations from March 2008 to February 2009. %B Department of Economics %I University of Michigan %C Ann Arbor, MI %G eng %U http://hdl.handle.net/2027.42/113598 %9 Ph.D. %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Grids and Online Panels: A Comparison of Device Type from a Survey Quality Perspective %A Wang, Mengyang %A McCutcheon, Allan L. %A Allen, Laura %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2015 %T he role of occupation specific adaptation costs in explaining the educational gap in unemployment. %A Hudomiet, Peter %G eng %U https://sites.google.com/site/phudomiet/Hudomiet-JobMarketPaper.pdf?attredirects=0 %9 Mimeo %0 Book Section %B Handbook of Uncertainty Quantification %D 2015 %T Hierarchcial models for uncertainty quantification: An overview %A Wikle, C.K. %E Ghanem, R. %E Higdon, D. %E Owhadi, H. %B Handbook of Uncertainty Quantification %I Springer %G eng %0 Book Section %B Handbook of Discrete-Valued Time Series %D 2015 %T Hierarchical Agent-Based Spatio-Temporal Dynamic Models for Discrete Valued Data %A Wikle, C.K. %A Hooten, M.B. %E Davis, R. %E Holan, S. %E Lund, R. %E Ravishanker, N. %B Handbook of Discrete-Valued Time Series %I Chapman and Hall/CRC Press %C Boca Raton, FL. %G eng %U http://www.crcpress.com/product/isbn/9781466577732 %& Hierarchical Agent-Based Spatio-Temporal Dynamic Models for Discrete Valued Data %0 Book Section %B Handbook of Discrete-Valued Time Series %D 2015 %T Hierarchical Dynamic Generalized Linear Mixed Models for Discrete-Valued Spatio-Temporal Data %A Holan, S.H. %A Wikle, C.K. %E Davis, R. %E Holan, S. %E Lund, R. %E Ravishanker, N %B Handbook of Discrete-Valued Time Series %I Chapman and Hall/CRC Press %C Boca Raton, FL %@ ISBN 9781466577732 %G eng %U http://www.crcpress.com/product/isbn/9781466577732 %0 Book Section %B Handbook of Discrete--Valued Time Series %D 2015 %T Hierarchical Dynamic Generalized Linear Mixed Models for Discrete--Valued Spatio-Temporal Data %A Holan, S.H. %A Wikle, C.K. %B Handbook of Discrete--Valued Time Series %G eng %0 Book Section %B Encyclopedia of Geographical Information Science %D 2015 %T Hierarchical Spatial Models %A Arab, A. %A Hooten, M.B. %A Wikle, C.K. %B Encyclopedia of Geographical Information Science %I Springer %G eng %0 Journal Article %J Geological Society %D 2015 %T Hierarchical, stochastic modeling across spatiotemporal scales of large river ecosystems and somatic growth in fish populations under various climate models: Missouri River sturgeon example %A Wildhaber, M.L. %A Wikle, C.K. %A Moran, E.H. %A Anderson, C.J. %A Franz, K.J. %A Dey, R. %B Geological Society %G eng %0 Journal Article %J Mathematical Geosciences %D 2015 %T Hot enough for you? A spatial exploratory and inferential analysis of North American climate-change projections %A Cressie, N. %A Kang, E.L. %B Mathematical Geosciences %G eng %U http://dx.doi.org/10.1007/s11004-015-9607-9 %R 10.1007/s11004-015-9607-9 %0 Report %D 2015 %T How individuals smooth spending: Evidence from the 2013 government shutdown using account data %A Gelman, Michael %A Kariv, Shachar %A Shapiro, Matthew D %A Silverman, Dan %A Tadelis, Steven %X Using comprehensive account records, this paper examines how individuals adjusted spending and saving in response to a temporary drop in income due to the 2013 U.S. government shutdown. The shutdown cut paychecks by 40% for affected employees, which was recovered within 2 weeks. Though the shock was short-lived and completely reversed, spending dropped sharply implying a naïve estimate of the marginal propensity to spend of 0.58. This estimate overstates how consumption responded. While many individuals had low liquidity, they used multiple strategies to smooth consumption including delay of recurring payments such as mortgages and credit card balances. %I National Bureau of Economic Research %G eng %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T I Know What You Did Next: Predicting Respondent’s Next Activity Using Machine Learning %A Arunachalam, H. %A Atkin, G. %A Eck, A. %A Wettlaufer, D. %A Soh, L.-K. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 May 14-17, 2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2015 %T Introduction to The Survey of Income and Program Participation (SIPP) %A Shaefer, H. Luke %X Introduction to The Survey of Income and Program Participation (SIPP) Shaefer, H. Luke Goals for the SIPP Workshop Provide you with an introduction to the SIPP and get you up and running on the public-use SIPP files, offer some advanced tools for 2008 Panel SIPP data analysis, Get you some experience analyzing SIPP data, Introduce you to the SIPP EHC (SIPP Redesign), Introduce you to the SIPP Synthetic Beta (SSB) Presentation made on May 15, 2015 at the Census Bureau, and previously in 2014 at Duke University and University of Michigan %I University of Michigan %G eng %U http://hdl.handle.net/1813/40169 %9 Preprint %0 Book Section %B Handbook of Discrete-Valued Time Series %D 2015 %T Long Memory Discrete--Valued Time Series %A Lund, R. %A Holan, S.H. %A Livsey, J. %Y Davis, R. %Y Holan, S. %Y Lund, R. %Y Ravishanker, N. %B Handbook of Discrete-Valued Time Series %I Chapman and Hall %G eng %U http://www.crcpress.com/product/isbn/9781466577732 %& Long Memoriy Discrete-Valued Time Series %0 Report %D 2015 %T Modeling Endogenous Mobility in Wage Determination %A Abowd, John M. %A McKinney, Kevin L. %A Schmutte, Ian M. %X Modeling Endogenous Mobility in Wage Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %I Cornell University %G eng %U http://hdl.handle.net/1813/40306 %9 Preprint %0 Report %D 2015 %T Modeling Endogenous Mobility in Wage Determination %A Abowd, John M. %A McKinney, Kevin L. %A Schmutte, Ian M. %X Modeling Endogenous Mobility in Wage Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax exogenous mobility by modeling the matched data as an evolving bipartite graph using a Bayesian latent-type framework. Our results suggest that allowing endogenous mobility increases the variation in earnings explained by individual heterogeneity and reduces the proportion due to employer and match effects. To assess external validity, we match our estimates of the wage components to out-ofsample estimates of revenue per worker. The mobility-bias corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52608 %9 Preprint %0 Report %D 2015 %T Modeling for Dynamic Ordinal Regression Relationships: An Application to Estimating Maturity of Rockfish in California %A DeYoreo, M. %A Kottas, A. %K Statistics - Applications %X We develop a Bayesian nonparametric framework for modeling ordinal regression relationships which evolve in discrete time. The motivating application involves a key problem in fisheries research on estimating dynamically evolving relationships between age, length and maturity, the latter recorded on an ordinal scale. The methodology builds from nonparametric mixture modeling for the joint stochastic mechanism of covariates and latent continuous responses. This approach yields highly flexible inference for ordinal regression functions while at the same time avoiding the computational challenges of parametric models. A novel dependent Dirichlet process prior for time-dependent mixing distributions extends the model to the dynamic setting. The methodology is used for a detailed study of relationships between maturity, age, and length for Chilipepper rockfish, using data collected over 15 years along the coast of California. %I ArXiv %G eng %U http://arxiv.org/abs/1507.01242 %0 Journal Article %J WIRES Computational Statistics %D 2015 %T Modern Perspectives on Statistics for Spatio-Temporal Data %A Wikle, C.K. %B WIRES Computational Statistics %V 7 %P 86-98 %G eng %U http://dx.doi.org/10.1002/wics.1341 %N 1 %R 10.1002/wics.1341 %0 Journal Article %J Journal of Official Statistics %D 2015 %T Moving Toward the New World of Censuses and Large-Scale Sample Surveys: Methodological Developments and Practical Implementations %A Fienberg, S. E. %B Journal of Official Statistics %G eng %0 Journal Article %J Statistics in Medicine %D 2015 %T Multiple imputation for harmonizing longitudinal non-commensurate measures in individual participant data meta-analysis %A Siddique, J. %A Reiter, J. P. %A Brincks, A. %A Gibbons, R. %A Crespi, C. %A Brown, C. H. %B Statistics in Medicine %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/sim.6562/abstract %R 10.1002/sim.6562 %0 Journal Article %J arXiv %D 2015 %T Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence %A Murray, J. S. %A Reiter, J. P. %X We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. %B arXiv %G eng %U arxiv.org/abs/1410.0438 %N 1410.0438 %0 Web Page %D 2015 %T Multiscale Analysis of Survey Data: Recent Developments and Exciting Prospects %A Bradley, J.R. %A Wikle, C.K. %A Holan, S.H. %B Statistics Views %G eng %0 Journal Article %D 2015 %T Multivariate Spatial Covariance Models: A Conditional Approach %A Cressie, N. %A Zammit-Mangion, A. %X Multivariate geostatistics is based on modelling all covariances between all possible combinations of two or more variables at any sets of locations in a continuously indexed domain. Multivariate spatial covariance models need to be built with care, since any covariance matrix that is derived from such a model must be nonnegative-definite. In this article, we develop a conditional approach for spatial-model construction whose validity conditions are easy to check. We start with bivariate spatial covariance models and go on to demonstrate the approach's connection to multivariate models defined by networks of spatial variables. In some circumstances, such as modelling respiratory illness conditional on air pollution, the direction of conditional dependence is clear. When it is not, the two directional models can be compared. More generally, the graph structure of the network reduces the number of possible models to compare. Model selection then amounts to finding possible causative links in the network. We demonstrate our conditional approach on surface temperature and pressure data, where the role of the two variables is seen to be asymmetric. %G eng %U https://arxiv.org/abs/1504.01865 %0 Journal Article %J STAT %D 2015 %T Multivariate Spatial Hierarchical Bayesian Empirical Likelihood Methods for Small Area Estimation %A Porter, A.T. %A Holan, S.H. %A Wikle, C.K. %B STAT %V 4 %P 108-116 %8 05/2015 %G eng %U http://dx.doi.org/10.1002/sta4.81 %N 1 %R 10.1002/sta4.81 %0 Journal Article %J ArXiv %D 2015 %T Multivariate Spatio-Temporal Models for High-Dimensional Areal Data with Application to Longitudinal Employer-Household Dynamics %A Bradley, J. R. %A Holan, S. H. %A Wikle, C.K. %X Many data sources report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate spatio-temporal dependencies. Additionally, many multivariate spatio-temporal areal datasets are extremely high-dimensional, which leads to practical issues when formulating statistical models. For example, we analyze Quarterly Workforce Indicators (QWI) published by the US Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) program. QWIs are available by different variables, regions, and time points, resulting in millions of tabulations. Despite their already expansive coverage, by adopting a fully Bayesian framework, the scope of the QWIs can be extended to provide estimates of missing values along with associated measures of uncertainty. Motivated by the LEHD, and other applications in federal statistics, we introduce the multivariate spatio-temporal mixed effects model (MSTM), which can be used to efficiently model high-dimensional multivariate spatio-temporal areal datasets. The proposed MSTM extends the notion of Moran's I basis functions to the multivariate spatio-temporal setting. This extension leads to several methodological contributions including extremely effective dimension reduction, a dynamic linear model for multivariate spatio-temporal areal processes, and the reduction of a high-dimensional parameter space using {a novel} parameter model. %B ArXiv %G eng %U http://arxiv.org/abs/1503.00982 %N 1503.00982 %0 Journal Article %J Annals of Applied Statistics %D 2015 %T Multivariate Spatio-Temporal Models for High-Dimensional Areal Data with Application to Longitudinal Employer-Household Dynamics %A Bradley, J.R. %A Holan, S.H. %A Wikle, C.K. %X Many data sources report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate spatio-temporal dependencies. Additionally, many multivariate spatio-temporal areal datasets are extremely high-dimensional, which leads to practical issues when formulating statistical models. For example, we analyze Quarterly Workforce Indicators (QWI) published by the US Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) program. QWIs are available by different variables, regions, and time points, resulting in millions of tabulations. Despite their already expansive coverage, by adopting a fully Bayesian framework, the scope of the QWIs can be extended to provide estimates of missing values along with associated measures of uncertainty. Motivated by the LEHD, and other applications in federal statistics, we introduce the multivariate spatio-temporal mixed effects model (MSTM), which can be used to efficiently model high-dimensional multivariate spatio-temporal areal datasets. The proposed MSTM extends the notion of Moran’s I basis functions to the multivariate spatio-temporal setting. This extension leads to several methodological contributions including extremely effective dimension reduction, a dynamic linear model for multivariate spatio-temporal areal processes, and the reduction of a high-dimensional parameter space using a novel parameter model. %B Annals of Applied Statistics %V 9 %8 03/2015 %G eng %N 4 %R 0.1214/15-AOAS862 %0 Report %D 2015 %T NCRN Meeting Fall 2016: Dynamic Question Ordering: Obtaining Useful Information While Reducing Burden %A Early, Kirstin %X NCRN Meeting Fall 2016: Dynamic Question Ordering: Obtaining Useful Information While Reducing Burden Early, Kirstin %I Carnegie-Mellon University %G eng %U http://hdl.handle.net/1813/45822 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015 %A Vilhuber, Lars %X NCRN Meeting Spring 2015 Vilhuber, Lars May 7 meetings @ U.S. Census Bureau, Washington DC. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/45867 %9 Preprint %0 Generic %D 2015 %T NCRN Meeting Spring 2015: A Vision for the Future of Data Access %A Reiter, J.P. %XNCRN Meeting Spring 2015: A Vision for the Future of Data Access Reiter, J.P. Presentation at the NCRN Meeting Spring 2015

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40181 %9 Preprint %0 Generic %D 2015 %T NCRN Meeting Spring 2015: Broadening data access through synthetic data %A Vilhuber, Lars %XNCRN Meeting Spring 2015: Broadening data access through synthetic data Vilhuber, Lars Presentation at the NCRN Meeting Spring 2015

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40185 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Building and Training the Next Generation of Survey Methodologists and Researchers %A Nugent, Rebecca %X NCRN Meeting Spring 2015: Building and Training the Next Generation of Survey Methodologists and Researchers Nugent, Rebecca Presentation at the NCRN Meetings Spring 2015 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40188 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network %A Abowd, John M. %A Fienberg, Stephen E. %X NCRN Meeting Spring 2015: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network Abowd, John M.; Fienberg, Stephen E. May 8, 2015 CNSTAT Public Seminar %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40186 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Comment on: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network %A Groshen, Erica L. %X NCRN Meeting Spring 2015: Comment on: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network Groshen, Erica L. Public Seminar Presentation by Erica L. Groshen at the Spring 2015 NCRN/CNSTAT Meetings %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40187 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Geographic Aspects of Direct and Indirect Estimators for Small Area Estimation %A Nagle, Nicholas %X NCRN Meeting Spring 2015: Geographic Aspects of Direct and Indirect Estimators for Small Area Estimation Nagle, Nicholas Presentation at the NCRN Meeting Spring 2015 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40182 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Geography and Usability of the American Community Survey %A Spielman, Seth %X NCRN Meeting Spring 2015: Geography and Usability of the American Community Survey Spielman, Seth Presentation at the NCRN Meeting Spring 2015 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40183 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Models for Multiscale Spatially-Referenced Count Data %A Holan, Scott %A Bradley, Jonathan R. %A Wikle, Christopher K. %X NCRN Meeting Spring 2015: Models for Multiscale Spatially-Referenced Count Data Holan, Scott; Bradley, Jonathan R.; Wikle, Christopher K. Presentation at the NCRN Meeting Spring 2015 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40176 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Regionalization of Multiscale Spatial Processes Using a Criterion for Spatial Aggregation Error %A Wikle, Christopher K. %A Bradley, Jonathan %A Holan, Scott %X NCRN Meeting Spring 2015: Regionalization of Multiscale Spatial Processes Using a Criterion for Spatial Aggregation Error Wikle, Christopher K.; Bradley, Jonathan; Holan, Scott Develop and implement a statistical criterion to diagnose spatial aggregation error that can facilitate the choice of regionalizations of spatial data. Presentation at NCRN Meeting Spring 2015 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40177 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A Abowd, John M. %A Schmutte, Ian %X NCRN Meeting Spring 2015: Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John M.; Schmutte, Ian Presentation at the NCRN Meeting Spring 2015 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40184 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Survey Informatics: The Future of Survey Methodology and Survey Statistics Training in the Academy? %A McCutcheon, Allan L. %XNCRN Meeting Spring 2015: Survey Informatics: The Future of Survey Methodology and Survey Statistics Training in the Academy? McCutcheon, Allan L. Presentation at the NCRN Meeting Spring 2015

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40309 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Training Undergraduates, Graduate Students, Postdocs, and Federal Agencies: Methodology, Data, and Science for Federal Statistics %A Cressie, Noel %A Holan, Scott H. %A Wikle, Christopher K. %X NCRN Meeting Spring 2015: Training Undergraduates, Graduate Students, Postdocs, and Federal Agencies: Methodology, Data, and Science for Federal Statistics Cressie, Noel; Holan, Scott H.; Wikle, Christopher K. Presentation at the NCRN Spring 2015 Meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40179 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 1 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 2 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from October 2014 to January 2015. NCRN Newsletter Vol. 2, Issue 1: January 30, 2015. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40193 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 2 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40194 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 2 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from February 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/44200 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 3 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %XNCRN Newsletter: Volume 2 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from June 2015 through August 2015. NCRN Newsletter Vol. 2, Issue 3: September 15, 2015.

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/42393 %9 Preprint %0 Report %D 2015 %T Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics %A Abowd, John A. %A McKinney, Kevin L. %X Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics Abowd, John A.; McKinney, Kevin L. We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs. %I Cornell University %G eng %U http://hdl.handle.net/1813/42338 %9 Preprint %0 Journal Article %J ArXiV %D 2015 %T Nonparametric Bayesian models with focused clustering for mixed ordinal and nominal data %A DeYoreo, Maria %A Reiter , J. P. %A Hillygus, D. S. %X Dirichlet process mixtures can be useful models of multivariate categorical data and effective tools for multiple imputation of missing categorical values. In some contexts, however, these models can fit certain variables well at the expense of others in ways beyond the analyst's control. For example, when the data include some variables with non-trivial amounts of missing values, the mixture model may fit the marginal distributions of the nearly and fully complete variables at the expense of the variables with high fractions of missing data. Motivated by this setting, we present a Dirichlet process mixture model for mixed ordinal and nominal data that allows analysts to split variables into two groups: focus variables and remainder variables. The model uses three sets of clusters, one set for ordinal focus variables, one for nominal focus variables, and one for all remainder variables. The model uses a multivariate ordered probit specification for the ordinal variables and independent multinomial kernels for the nominal variables. The three sets of clusters are linked using an infinite tensor factorization prior, as well as via dependence of the means of the latent continuous focus variables on the remainder variables. This effectively specifies a rich, complex model for the focus variables and a simpler model for remainder variables, yet still potentially captures associations among the variables. In the multiple imputation context, focus variables include key variables with high rates of missing values, and remainder variables include variables without much missing data. Using simulations, we illustrate advantages and limitations of using focused clustering compared to mixture models that do not distinguish variables. We apply the model to handle missing values in an analysis of the 2012 American National Election Study. %B ArXiV %I arXiv %G eng %U http://arxiv.org/abs/1508.03758 %N 1508.03758 %0 Journal Article %J Bayesian Analysis %D 2015 %T Nonparametric Bayesian models with focused clustering for mixed ordinal and nominal data %A M. De Yoreo %A J. P. Reiter %A D. S. Hillygus %X Dirichlet process mixtures can be useful models of multivariate categorical data and effective tools for multiple imputation of missing categorical values. In some contexts, however, these models can fit certain variables well at the expense of others in ways beyond the analyst's control. For example, when the data include some variables with non-trivial amounts of missing values, the mixture model may fit the marginal distributions of the nearly and fully complete variables at the expense of the variables with high fractions of missing data. Motivated by this setting, we present a Dirichlet process mixture model for mixed ordinal and nominal data that allows analysts to split variables into two groups: focus variables and remainder variables. The model uses three sets of clusters, one set for ordinal focus variables, one for nominal focus variables, and one for all remainder variables. The model uses a multivariate ordered probit specification for the ordinal variables and independent multinomial kernels for the nominal variables. The three sets of clusters are linked using an infinite tensor factorization prior, as well as via dependence of the means of the latent continuous focus variables on the remainder variables. This effectively specifies a rich, complex model for the focus variables and a simpler model for remainder variables, yet still potentially captures associations among the variables. In the multiple imputation context, focus variables include key variables with high rates of missing values, and remainder variables include variables without much missing data. Using simulations, we illustrate advantages and limitations of using focused clustering compared to mixture models that do not distinguish variables. We apply the model to handle missing values in an analysis of the 2012 American National Election Study. %B Bayesian Analysis %8 08/2015 %G eng %R 10.1214/16-BA1020 %0 Journal Article %J Multivariate Behavioral Research %D 2015 %T A nonparametric, multiple imputation-based method for the retrospective integration of data sets %A M.M. Carrig %A D. Manrique-Vallier %A K. Ranby %A J.P. Reiter %A R. Hoyle %B Multivariate Behavioral Research %V 50 %P 383-397 %G eng %U http://www.tandfonline.com/doi/full/10.1080/00273171.2015.1022641 %N 4 %& 383 %R 10.1080/00273171.2015.1022641 %0 Journal Article %J Criminal Justice Review %D 2015 %T Perceptions, behaviors and satisfaction related to public safety for persons with disabilities in the United States %A Brucker, D. %B Criminal Justice Review %V 1 %G eng %N 18 %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Predicting Breakoff Using Sequential Machine Learning Methods %A Soh, L.-K. %A Eck, A. %A McCutcheon, A.L. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 05/2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2015 %T Presentation: NADDI 2015: Crowdsourcing DDI Development: New Features from the CED2AR Project %A Perry, Benjamin %A Kambhampaty, Venkata %A Brumsted, Kyle %A Vilhuber, Lars %A Block, William %X Presentation: NADDI 2015: Crowdsourcing DDI Development: New Features from the CED2AR Project Perry, Benjamin; Kambhampaty, Venkata; Brumsted, Kyle; Vilhuber, Lars; Block, William Recent years have shown the power of user-sourced information evidenced by the success of Wikipedia and its many emulators. This sort of unstructured discussion is currently not feasible as a part of the otherwise successful metadata repositories. Creating and augmenting metadata is a labor-intensive endeavor. Harnessing collective knowledge from actual data users can supplement officially generated metadata. As part of our Comprehensive Extensible Data Documentation and Access Repository (CED2AR) infrastructure, we demonstrate a prototype of crowdsourced DDI, using DDI-C and supplemental XML. The system allows for any number of network connected instances (web or desktop deployments) of the CED2AR DDI editor to concurrently create and modify metadata. The backend transparently handles changes, and frontend has the ability to separate official edits (by designated curators of the data and the metadata) from crowd-sourced content. We briefly discuss offline edit contributions as well. CED2AR uses DDI-C and supplemental XML together with Git for a very portable and lightweight implementation. This distributed network implementation allows for large scale metadata curation without the need for a hardware intensive computing environment, and can leverage existing cloud services, such as Github or Bitbucket. Ben Perry (Cornell/NCRN) presents joint work with Venkata Kambhampaty, Kyle Brumsted, Lars Vilhuber, & William C. Block at NADDI 2015. %I Cornell University %G eng %U http://hdl.handle.net/1813/40172 %9 Preprint %0 Journal Article %J Journal of Poverty %D 2015 %T Preventive policy strategy for banking the unbanked: Savings accounts for teenagers? %A Friedline, T. %A Despard, M. %A Chowa, G. %K financial assets %K savings %K Survey of Income and Program Participation (SIPP) %K teenagers %K unbanked %K young adults %X Concern over percentages of unbanked and underbanked households in the United States and their lack of connectedness to the financial mainstream has led to policy strategies geared toward reaching these households. Using nationally-representative longitudinal data, a preventive strategy for banking households is tested that asks whether young adults are more likely to be banked and own a diversity of financial assets when they are connected to the financial mainstream as teenagers. Young adults are more likely to own checking accounts, savings accounts, certificates of deposit, and stocks when they had savings accounts as teenagers. Policy implications are discussed. %B Journal of Poverty %V 20 %P 2-33 %8 07/2015 %G eng %U http://www.tandfonline.com/doi/full/10.1080/10875549.2015.1015068 %N 1 %& 2 %R 10.1080/10875549.2015.1015068 %0 Journal Article %J Science %D 2015 %T Privacy and human behavior in the age of information %A Alessandro Acquisti %A Laura Brandimarte %A George Loewenstein %K confidentiality %K privacy %X This Review summarizes and draws connections between diverse streams of empirical research on privacy behavior. We use three themes to connect insights from social and behavioral sciences: people’s uncertainty about the consequences of privacy-related behaviors and their own preferences over those consequences; the context-dependence of people’s concern, or lack thereof, about privacy; and the degree to which privacy concerns are malleable—manipulable by commercial and governmental interests. Organizing our discussion by these themes, we offer observations concerning the role of public policy in the protection of privacy in the information age. %B Science %V 347 %G eng %U http://www.sciencemag.org/content/347/6221/509 %N 6221 %& 509 %R 10.1126/science.aaa1465 %0 Thesis %B Computer Science %D 2015 %T Probabilistic Hashing Techniques For Big Data %A Anshumali Shrivastava %X We investigate probabilistic hashing techniques for addressing computational and memory challenges in large scale machine learning and data mining systems. In this thesis, we show that the traditional idea of hashing goes far beyond near-neighbor search and there are some striking new possibilities. We show that hashing can improve state of the art large scale learning algorithms, and it goes beyond the conventional notions of pairwise similarities. Despite being a very well studied topic in literature, we found several opportunities for fundamentally improving some of the well know textbook hashing algorithms. In particular, we show that the traditional way of computing minwise hashes is unnecessarily expensive and without loosing anything we can achieve an order of magnitude speedup. We also found that for cosine similarity search there is a better scheme than SimHash. In the end, we show that the existing locality sensitive hashing framework itself is very restrictive, and we cannot have efficient algorithms for some important measures like inner products which are ubiquitous in machine learning. We propose asymmetric locality sensitive hashing (ALSH), an extended framework, where we show provable and practical efficient algorithms for Maximum Inner Product Search (MIPS). Having such an efficient solutions to MIPS directly scales up many popular machine learning algorithms. We believe that this thesis provides significant improvements to some of the heavily used subroutines in big-data systems, which we hope will be adopted. %B Computer Science %I Cornell University %V Ph.D. %G eng %U https://ecommons.cornell.edu/handle/1813/40886 %9 Dissertation %0 Thesis %B Department of Economics %D 2015 %T Ranking Firms Using Revealed Preference and Other Essays About Labor Markets %A Isaac Sorkin %K economics %K labor markets %X This dissertation contains essays on three questions about the labor market. Chapter 1 considers the question: why do some firms pay so much and some so little? Firms account for a substantial portion of earnings inequality. Although the standard explanation is that there are search frictions that support an equilibrium with rents, this chapter finds that compensating differentials for nonpecuniary characteristics are at least as important. To reach this finding, this chapter develops a structural search model and estimates it on U.S. administrative data. The model analyzes the revealed preference information in the labor market: specifically, how workers move between the 1.5 million firms in the data. With on the order of 1.5 million parameters, standard estimation approaches are infeasible and so the chapter develops a new estimation approach that is feasible on such big data. Chapter 2 considers the question: why do men and women work at different firms? Men work for higher-paying firms than women. The chapter builds on chapter 1 to consider two explanations for why men and women work in different firms. First, men and women might search from different offer distributions. Second, men and women might have different rankings of firms. Estimation finds that the main explanation for why men and women are sorted is that women search from a lower-paying offer distribution than men. Indeed, men and women are estimated to have quite similar rankings of firms. Chapter 3 considers the question: what are there long-run effects of the minimum wage? An empirical consensus suggests that there are small employment effects of minimum wage increases. This chapter argues that these are short-run elasticities. Long-run elasticities, which may differ from short-run elasticities, are more policy relevant. This chapter develops a dynamic industry equilibrium model of labor demand. The model makes two points. First, long-run regressions have been misinterpreted because even if the short- and long-run employment elasticities differ, standard methods would not detect a difference using U.S. variation. Second, the model offers a reconciliation of the small estimated short-run employment effects with the commonly found pass-through of minimum wage increases to product prices. %B Department of Economics %I University of Michigan %C Ann Arbor, MI %G eng %U http://hdl.handle.net/2027.42/116747 %9 Ph.D. %0 Journal Article %J The Stata Journal %D 2015 %T Record Linkage using STATA: Pre-processing, Linking and Reviewing Utilities %A Wasi, Nada %A Flaaen, Aaron %X In this article, we describe Stata utilities that facilitate probabilistic record linkage—the technique typically used for merging two datasets with no common record identifier. While the preprocessing tools are developed specifically for linking two company databases, the other tools can be used for many different types of linkage. Specifically, the stnd_compname and stnd_address commands parse and standardize company names and addresses to improve the match quality when linking. The reclink2 command is a generalized version of Blasnik's reclink (2010, Statistical Software Components S456876, Department of Economics, Boston College) that allows for many-to-one matching. Finally, clrevmatch is an interactive tool that allows the user to review matched results in an efficient and seamless manner. Rather than exporting results to another file format (for example, Excel), inputting clerical reviews, and importing back into Stata, one can use the clrevmatch tool to conduct all of these steps within Stata. This helps improve the speed and flexibility of matching, which often involves multiple runs. %B The Stata Journal %V 15 %P 1-15 %G eng %U http://www.stata-journal.com/article.html?article=dm0082 %N 3 %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Recording What the Respondent Says: Does Question Format Matter? %A Smyth, J.D. %A Olson, K. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J PlosOne %D 2015 %T Reducing the Margins of Error in the American Community Survey Through Data-Driven Regionalization %A Folch, D. %A Spielman, S. E. %B PlosOne %8 02/2015 %G eng %U http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115626 %R 10.1371/journal.pone.0115626 %0 Journal Article %J ArXiv %D 2015 %T Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error %A Bradley, J. R. %A Wikle, C.K. %A Holan, S. H. %X The modifiable areal unit problem and the ecological fallacy are known problems that occur when modeling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By "regionalization" we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers, but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error (CAGE), which we minimize to obtain an optimal regionalization. To define CAGE we draw a connection between spatial aggregation error and a new multiscale representation of the Karhunen-Loeve (K-L) expansion. This relationship between CAGE and the multiscale K-L expansion leads to illuminating theoretical developments including: connections between spatial aggregation error, squared prediction error, spatial variance, and a novel extension of Obled-Creutin eigenfunctions. The effectiveness of our approach is demonstrated through an analysis of two datasets, one using the American Community Survey and one related to environmental ocean winds. %B ArXiv %G eng %U http://arxiv.org/abs/1502.01974 %N 1502.01974 %0 Journal Article %J Test %D 2015 %T Rejoinder on: Comparing and selecting spatial predictors using local criteria %A Bradley, J.R. %A Cressie, N. %A Shi, T. %B Test %V 24 %P 54-60 %8 03/2015 %G eng %U http://dx.doi.org/10.1007/s11749-014-0414-2 %N 1 %R 10.1007/s11749-014-0414-2 %0 Thesis %B Statistics Department %D 2015 %T Relaxations of differential privacy and risk utility evaluations of synthetic data and fidelity measures %A McClure, D. %X Many organizations collect data that would be useful to public researchers, but cannot be shared due to promises of confidentiality to those that participated in the study. This thesis evaluates the risks and utility of several existing release methods, as well as develops new ones with different risk/utility tradeoffs. In Chapter 2, I present a new risk metric, called model-specific probabilistic differ- ential privacy (MPDP), which is a relaxed version of differential privacy that allows the risk of a release to be based on the worst-case among plausible datasets instead of all possible datasets. In addition, I develop a generic algorithm called local sensitiv- ity random sampling (LSRS) that, under certain assumptions, is guaranteed to give releases that meet MPDP for any query with computable local sensitivity. I demon- strate, using several well-known queries, that LSRS releases have much higher utility than standard differentially private release mechanism, the Laplace Mechanism, at only marginally higher risk. In Chapter 3, using to synthesis models, I empirically characterize the risks of releasing synthetic data under the standard “all but one” assumption on intruder background knowledge, as well the effect decreasing the number of observations the intruder knows beforehand has on that risk. I find in these examples that even in the “all but one” case, there is no risk except to extreme outliers, and even then the risk is mild. I find that the effect of removing observations from an intruder’s background knowledge has on risk heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and the risk drops quickly if he/she cannot. In Chapter 4, I characterize the risk/utility tradeoffs for an augmentation of synthetic data called fidelity measures (see Section 1.2.3). Fidelity measures were proposed in Reiter et al. (2009) to quantify the degree to which the results of an analysis performed on a released synthetic dataset match with the results of the same analysis performed on the confidential data. I compare the risk/utility of two different fidelity measures, the confidence interval overlap (Karr et al., 2006) and a new fidelity measure I call the mean predicted probability difference (MPPD). Simultaneously, I compare the risk/utility tradeoffs of two different private release mechanisms, LSRS and a heuristic release method called “safety zones”. I find that the confidence interval overlap can be applied to a wider variety of analyses and is more specific than MPPD, but MPPD is more robust to the influence of individual observations in the confidential data, which means it can be released with less noise than the confidence interval overlap with the same level of risk. I also find that while safety zones are much simpler to compute and generally have good utility (whereas the utility of LSRS depends on the value of ε), it is also much more vulnerable to context specific attacks that, while not easy for an intruder to implement, are difficult to anticipate. %B Statistics Department %I Duke University %V PhD %G eng %U http://hdl.handle.net/10161/11365 %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T The Role of Device Type and Respondent Characteristics in Internet Panel Survey Breakoff %A Allan L. McCutcheon %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Econometrics %D 2015 %T The SAR model for very large datasets: A reduced-rank approach %A Burden, S. %A Cressie, N. %A Steel, D.G. %B Econometrics %V 3 %P 317-338 %G eng %U http://www.mdpi.com/2225-1146/3/2/317 %N 2 %R 10.3390/econometrics3020317 %0 Journal Article %J Political Analysis %D 2015 %T Semi-parametric selection models for potentially non-ignorable attrition in panel studies with refreshment samples %A Y. Si %A J.P. Reiter %A D.S. Hillygus %B Political Analysis %V 23 %P 92-112 %G eng %U http://pan.oxfordjournals.org/cgi/reprint/mpu009?%20ijkey=joX8eSl6gyIlQKP&keytype=ref %& 92 %0 Journal Article %J Journal of the American Statistical Association %D 2015 %T Simultaneous Edit-Imputation for Continuous Microdata %A Kim, H. J. %A Cox, L. H. %A Karr, A. F. %A Reiter, J. P. %A Wang, Q. %B Journal of the American Statistical Association %V 110 %P 987-999 %G eng %U http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1040881 %R 10.1080/01621459.2015.1040881 %0 Journal Article %J Australian & New Zealand Journal of Statistics %D 2015 %T Small Area Estimation via Multivariate Fay-Herriot Models With Latent Spatial Dependence %A Porter, A.T. %A Wikle, C.K. %A Holan, S.H. %B Australian & New Zealand Journal of Statistics %V 57 %P 15-29 %G eng %U http://arxiv.org/abs/1310.7211 %0 Journal Article %J Stat %D 2015 %T Spatio-temporal change of support with application to American Community Survey multi-year period estimates %A Bradley, Jonathan R. %A Wikle, Christopher K. %A Holan, Scott H. %K Bayesian %K change-of-support %K dynamical %K hierarchical models %K mixed-effects model %K Moran's I %K multi-year period estimate %X We present hierarchical Bayesian methodology to perform spatio-temporal change of support (COS) for survey data with Gaussian sampling errors. This methodology is motivated by the American Community Survey (ACS), which is an ongoing survey administered by the US Census Bureau that provides timely information on several key demographic variables. The ACS has published 1-year, 3-year, and 5-year period estimates, and margins of errors, for demographic and socio-economic variables recorded over predefined geographies. The spatio-temporal COS methodology considered here provides data users with a way to estimate ACS variables on customized geographies and time periods while accounting for sampling errors. Additionally, 3-year ACS period estimates are to be discontinued, and this methodology can provide predictions of ACS variables for 3-year periods given the available period estimates. The methodology is based on a spatio-temporal mixed-effects model with a low-dimensional spatio-temporal basis function representation, which provides multi-resolution estimates through basis function aggregation in space and time. This methodology includes a novel parameterization that uses a target dynamical process and recently proposed parsimonious Moran's I propagator structures. Our approach is demonstrated through two applications using public-use ACS estimates and is shown to produce good predictions on a hold-out set of 3-year period estimates. Copyright © 2015 John Wiley & Sons, Ltd. %B Stat %V 4 %P 255–270 %8 10/2015 %G eng %U http://dx.doi.org/10.1002/sta4.94 %R 10.1002/sta4.94 %0 Journal Article %J Journal of Official Statistics %D 2015 %T Statistical Disclosure Limitation in the Presence of Edit Rules %A Kim, H.J. %A Karr, A.F. %A Reiter, J.P. %B Journal of Official Statistics %V 31 %P 121-138 %G eng %& 121 %0 Journal Article %J Geological Society %D 2015 %T A stochastic bioenergetics model based approach to translating large river flow and temperature in to fish population responses: the pallid sturgeon example %A Wildhaber, M.L. %A Dey, R. %A Wikle, C.K. %A Anderson, C.J. %A Moran, E.H. %A Franz, K.J. %B Geological Society %V 408 %G eng %R 10.1144/SP408.10 %0 Journal Article %J ArXiv %D 2015 %T Stop or continue data collection: A nonignorable missing data approach for continuous variables %A T. Paiva %A J.P. Reiter %K Methodology %X We present an approach to inform decisions about nonresponse followup sampling. The basic idea is (i) to create completed samples by imputing nonrespondents' data under various assumptions about the nonresponse mechanisms, (ii) take hypothetical samples of varying sizes from the completed samples, and (iii) compute and compare measures of accuracy and cost for different proposed sample sizes. As part of the methodology, we present a new approach for generating imputations for multivariate continuous data with nonignorable unit nonresponse. We fit mixtures of multivariate normal distributions to the respondents' data, and adjust the probabilities of the mixture components to generate nonrespondents' distributions with desired features. We illustrate the approaches using data from the 2007 U. S. Census of Manufactures. %B ArXiv %8 11/2015 %G eng %U http://arxiv.org/abs/1511.02189 %N 1511.02189 %0 Journal Article %J Annals of the Association of American Geographers %D 2015 %T Studying Neighborhoods Using Uncertain Data from the American Community Survey: A Contextual Approach %A Seth E. Spielman %A Alex Singleton %X In 2010 the American Community Survey (ACS) replaced the long form of the decennial census as the sole national source of demographic and economic data for small geographic areas such as census tracts. These small area estimates suffer from large margins of error, however, which makes the data difficult to use for many purposes. The value of a large and comprehensive survey like the ACS is that it provides a richly detailed, multivariate, composite picture of small areas. This article argues that one solution to the problem of large margins of error in the ACS is to shift from a variable-based mode of inquiry to one that emphasizes a composite multivariate picture of census tracts. Because the margin of error in a single ACS estimate, like household income, is assumed to be a symmetrically distributed random variable, positive and negative errors are equally likely. Because the variable-specific estimates are largely independent from each other, when looking at a large collection of variables these random errors average to zero. This means that although single variables can be methodologically problematic at the census tract scale, a large collection of such variables provides utility as a contextual descriptor of the place(s) under investigation. This idea is demonstrated by developing a geodemographic typology of all U.S. census tracts. The typology is firmly rooted in the social scientific literature and is organized around a framework of concepts, domains, and measures. The typology is validated using public domain data from the City of Chicago and the U.S. Federal Election Commission. The typology, as well as the data and methods used to create it, is open source and published freely online. %B Annals of the Association of American Geographers %V 105 %P 1003-1025 %G eng %U http://dx.doi.org/10.1080/00045608.2015.1052335 %R 10.1080/00045608.2015.1052335 %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Survey Informatics: The Future of Survey Methodology and Survey Statistics Training in the Academy? %A Allan L. McCutcheon %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2015 %T Synthetic Establishment Microdata Around the World %A Vilhuber, Lars %A Abowd, John A. %A Reiter, Jerome P. %X Synthetic Establishment Microdata Around the World Vilhuber, Lars; Abowd, John A.; Reiter, Jerome P. In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature. %I Cornell University %G eng %U http://hdl.handle.net/1813/42340 %9 Preprint %0 Journal Article %J The Russell Sage Foundation Journal of the Social Sciences %D 2015 %T Understanding the Dynamics of $2-a-Day Poverty in the United States %A Shaefer, H. Luke %A Edin, Kathryn %A Talbert, E. %B The Russell Sage Foundation Journal of the Social Sciences %V 1 %G eng %N Severe Deprivation %0 Journal Article %J IEEE Computer %D 2015 %T Understanding the Human Condition through Survey Informatics %A Eck, A. %A Leen-Kiat, S. %A McCutcheon, A. L. %A Smyth, J.D. %A Belli, R.F. %B IEEE Computer %V 48 %P 112-116 %G eng %N 11 %R 10.1109/MC.2015.327 %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T The Use of Paradata to Evaluate Interview Complexity and Data Quality (in Calendar and Time Diary Surveys) %A Cordova-Cazar, A.L. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Using Data Mining to Examine Interviewer-Respondent Interactions in Calendar Interviews %A Belli, R.F. %A Miller, L.D. %A Soh, L.-K. %A T. Al Baghal %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 05/2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Using Machine Learning Techniques to Predict Respondent Type from A Priori Demographic Information %A Atkin, G. %A Arunachalam, H. %A Eck, A. %A Wettlaufer, D. %A Soh, L.-K. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 May 14-17, 2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2015 %T Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics %A Vilhuber, Lars %A Miranda, Javier %X Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics Vilhuber, Lars; Miranda, Javier We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions). %I Cornell University %G eng %U http://hdl.handle.net/1813/42339 %9 Preprint %0 Conference Paper %B 2015 Joint Program in Survey Methodology (JPSM) Distinguished Lecture %D 2015 %T Web Surveys, Online Panels, and Paradata: Automating Responsive Design %A Allan L. McCutcheon %B 2015 Joint Program in Survey Methodology (JPSM) Distinguished Lecture %C University of Maryland. College Park, MD %8 04/2015 %G eng %U http://www.jpsm.umd.edu/ %0 Journal Article %J Journal of Sociology and Social Welfare %D 2015 %T Who’s Left Out? Characteristics of Households in Economic Need not Receiving Public Support %A Fusaro, V. %B Journal of Sociology and Social Welfare %V 42 %P 65-85 %G eng %N 3 %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Why Do Interviewers Speed Up? An Examination of Changes in Interviewer Behaviors over the Course of the Survey Field Period %A Olson, K. %A Smyth, J.D. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Achieving balance: Understanding the relationship between complexity and response quality %A Powell, R.J. %A Kirchner, A. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Statistics Views %D 2014 %T Agent Based Models: Statistical Challenges and Opportunities %A Wikle, C.K. %B Statistics Views %I Wiley %G eng %U http://www.statisticsviews.com/details/feature/6354691/Agent-Based-Models-Statistical-Challenges-and-Opportunities.html %0 Book Section %B Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches %D 2014 %T Analytical frameworks for data release: A statistical view %A A. F. Karr %A J. P. Reiter %B Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches %I Cambridge University Press %C New York City, NY %G eng %0 Generic %D 2014 %T An Approach for Identifying and Predicting Economic Recessions in Real-Time Using Time-Frequency Functional Models, Seminar on Bayesian Inference in Econometrics and Statistics (SBIES) %A Holan, S.H. %8 May %G eng %0 Conference Paper %B Joint Statistical Meetings 2014 %D 2014 %T An Approach for Identifying and Predicting Economic Recessions in Real-Time Using Time-Frequency Functional Models %A Holan, S.H. %B Joint Statistical Meetings 2014 %I Joint Statistical Meetings %C Boston, MA %8 August %G eng %U http://www.amstat.org/meetings/jsm/2014/onlineprogram/AbstractDetails.cfm?abstractid=310841 %R 10.1002/asmb.1954 %0 Journal Article %J Annals of Statistics %D 2014 %T Asymptotic Theory of Cepstral Random Fields %A McElroy, T. %A Holan, S. %B Annals of Statistics %I University of Missouri %V 42 %P 64-86 %G eng %U http://arxiv.org/pdf/1112.1977v4.pdf %R 10.1214/13-AOS1180 %0 Book Section %B SAGE Handbook of Applied Memory %D 2014 %T Autobiographical memory dynamics in survey research %A Belli, R. F. %E T. J. Perfect %E D. S. Lindsay %B SAGE Handbook of Applied Memory %I Sage %G eng %U http://dx.doi.org/10.4135/9781446294703 %R 10.4135/9781446294703 %0 Generic %D 2014 %T A Bayesian Approach to Estimating Agricultural Yield Based on Multiple Repeated Surveys %A Holan, S.H. %8 March %G eng %0 Conference Paper %B Twelfth World Meeting of ISBA %D 2014 %T Bayesian Dynamic Time-Frequency Estimation %A Holan, S.H. %B Twelfth World Meeting of ISBA %I ISBA %C Cancun, Mexico %8 July %G eng %0 Journal Article %J Journal of Privacy and Confidentiality %D 2014 %T Bayesian estimation of disclosure risks for multiply imputed, synthetic data %A Reiter, J. P. %A Wang, Q. %A Zhang, B. %XAgencies seeking to disseminate public use microdata, i.e., data on individual records, can replace confidential values with multiple draws from statistical models estimated with the collected data. We present a famework for evaluating disclosure risks inherent in releasing multiply-imputed, synthetic data. The basic idea is to mimic an intruder who computes posterior distributions of confidential values given the released synthetic data and prior knowledge. We illustrate the methodology with artificial fully synthetic data and with partial synthesis of the Survey of Youth in Custody.

%B Journal of Privacy and Confidentiality %V 6 %8 2014 %G eng %U http://repository.cmu.edu/jpc/vol6/iss1/2 %N 1 %0 Journal Article %J Journal of Computational and Graphical Statistics %D 2014 %T Bayesian estimation of discrete multivariate latent structure models with structural zeros %A Manrique-Vallier, D. %A Reiter, J.P. %B Journal of Computational and Graphical Statistics %V 23 %P 1061-1079 %G eng %0 Journal Article %J Survey Methodology %D 2014 %T Bayesian multiple imputation for large-scale categorical data with structural zeros %A D. Manrique-Vallier %A J.P. Reiter %B Survey Methodology %V 40 %P 125-134 %8 06/2014 %G eng %U http://www.stat.duke.edu/ jerry/Papers/SurvMeth14.pdf %0 Report %D 2014 %T Bayesian Nonparametric Modeling for Multivariate Ordinal Regression %A DeYoreo, M. %A Kottas, A. %K Statistics - Methodology %X Univariate or multivariate ordinal responses are often assumed to arise from a latent continuous parametric distribution, with covariate effects which enter linearly. We introduce a Bayesian nonparametric modeling approach for univariate and multivariate ordinal regression, which is based on mixture modeling for the joint distribution of latent responses and covariates. The modeling framework enables highly flexible inference for ordinal regression relationships, avoiding assumptions of linearity or additivity in the covariate effects. In standard parametric ordinal regression models, computational challenges arise from identifiability constraints and estimation of parameters requiring nonstandard inferential techniques. A key feature of the nonparametric model is that it achieves inferential flexibility, while avoiding these difficulties. In particular, we establish full support of the nonparametric mixture model under fixed cut-off points that relate through discretization the latent continuous responses with the ordinal responses. The practical utility of the modeling approach is illustrated through application to two data sets from econometrics, an example involving regression relationships for ozone concentration, and a multirater agreement problem. %I ArXiv %G eng %U http://arxiv.org/abs/1408.1027 %0 Generic %D 2014 %T Big Data Methodology Applied to Small Area Estimation %A Porter, A.T. %8 January %G eng %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Call back later: The association of recruitment contact and error in the American Time Use Survey %A Countryman, A. %A Cordova-Cazar, A.L. %A Deal, C.E. %A Belli, R.F. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Spatial and Spatio-Temporal Epidemiology %D 2014 %T A CAR model for multiple outcomes on mismatched lattices %A Porter, A.T. %A Oleson, J. %B Spatial and Spatio-Temporal Epidemiology %V 11 %P 79-88 %G eng %U http://www.sciencedirect.com/science/article/pii/S1877584514000604 %& 79 %R 10.1016/j.sste.2014.08.001 %0 Journal Article %J Applied Geography %D 2014 %T Causes and Patterns of Uncertainty in the American Community Survey %A Spielman, S. E. %A Folch, D. %A Nagle, N. %B Applied Geography %V 46 %P 147-157 %G eng %U http://www.sciencedirect.com/science/article/pii/S0143622813002518 %R DOI: 10.1016/j.apgeog.2013.11.002 http://dx.doi.org/10.1016/j.apgeog.2013.11.002 %0 Report %D 2014 %T CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository %A Lagoze, Carl %A Vilhuber, Lars %A Williams, Jeremy %A Perry, Benjamin %A Block, William C. %X CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository Lagoze, Carl; Vilhuber, Lars; Williams, Jeremy; Perry, Benjamin; Block, William C. We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED 2 AR is distinguished from other metadata repository-based applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains Presented at 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), Sept 8-12, 2014 %I Cornell University %G eng %U http://hdl.handle.net/1813/44702 %9 Preprint %0 Report %D 2014 %T The Cepstral Model for Multivariate Time Series: The Vector Exponential Model. %A Holan, S.H. %A McElroy, T.S. %A Wu, G. %XVector autoregressive (VAR) models have become a staple in the analysis of multivariate time series and are formulated in the time domain as difference equations, with an implied covariance structure. In many contexts, it is desirable to work with a stable, or at least stationary, representation. To fit such models, one must impose restrictions on the coefficient matrices to ensure that certain determinants are nonzero; which, except in special cases, may prove burdensome. To circumvent these difficulties, we propose a flexible frequency domain model expressed in terms of the spectral density matrix. Specifically, this paper treats the modeling of covariance stationary vector-valued (i.e., multivariate) time series via an extension of the exponential model for the spectrum of a scalar time series. We discuss the modeling advantages of the vector exponential model and its computational facets, such as how to obtain Wold coefficients from given cepstral coefficients. Finally, we demonstrate the utility of our approach through simulation as well as two illustrative data examples focusing on multi-step ahead forecasting and estimation of squared coherence.

%I arXiv %G eng %U http://arxiv.org/abs/1406.0801 %9 preprint %0 Conference Paper %B Joint Statistical Meetings %D 2014 %T Changes in interviewer-related error over the course of the field period: An empirical examination using paradata %A Olson, K. %A Kirchner, A. %B Joint Statistical Meetings %C Boston, MA %G eng %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Changes in interviewer-related error over the course of the field period: An empirical examination using paradata %A Olson, K. %A Kirchner, A. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Transactions in GIS %D 2014 %T The Co-Evolution of Residential Segregation and the Built Environment at the Turn of the 20th Century: a Schelling Model %A Spielman, S. E. %A Harrison, P. %B Transactions in GIS %V 18 %P 25-45 %G eng %U http://onlinelibrary.wiley.com/enhanced/doi/10.1111/tgis.12014/ %R DOI: 10.1111/tgis.12014 %0 Report %D 2014 %T Collaborative Editing of DDI Metadata: The Latest from the CED2AR Project %A Perry, Benjamin %A Kambhampaty, Venkata %A Brumsted, Kyle %A Vilhuber, Lars %A Block, William %X Collaborative Editing of DDI Metadata: The Latest from the CED2AR Project Perry, Benjamin; Kambhampaty, Venkata; Brumsted, Kyle; Vilhuber, Lars; Block, William Benjamin Perry's presentation on "Collaborative Editing and Versioning of DDI Metadata: The Latest from Cornell's NCRN CED²AR Software" at the 6th Annual European DDI User Conference in London, 12/02/2014. %I Cornell University %G eng %U http://hdl.handle.net/1813/38200 %9 Preprint %0 Conference Paper %B 39th Annual Conference of the Midwest Association for Public Opinion Research %D 2014 %T Commitment, concealment, and confusion: An empirical assessment of interviewer and respondent behaviors in survey interviews %A Kirchner, A. %A Olson, K. %B 39th Annual Conference of the Midwest Association for Public Opinion Research %C Chicago, IL %8 11/2014 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2014 %T Communicating Uncertainty in Official Economic Statistics %A Manski, Charles %X Communicating Uncertainty in Official Economic Statistics Manski, Charles Federal statistical agencies in the United States and analogous agencies elsewhere commonly report official economic statistics as point estimates, without accompanying measures of error. Users of the statistics may incorrectly view them as error-free or may incorrectly conjecture error magnitudes. This paper discusses strategies to mitigate misinterpretation of official statistics by communicating uncertainty to the public. Sampling error can be measured using established statistical principles. The challenge is to satisfactorily measure the various forms of nonsampling error. I find it useful to distinguish transitory statistical uncertainty, permanent statistical uncertainty, and conceptual uncertainty. I illustrate how each arises as the Bureau of Economic Analysis periodically revises GDP estimates, the Census Bureau generates household income statistics from surveys with nonresponse, and the Bureau of Labor Statistics seasonally adjusts employment statistics. %I Northwestern University %G eng %U http://hdl.handle.net/1813/36323 %9 Preprint %0 Report %D 2014 %T Communicating Uncertainty in Official Economic Statistics: An Appraisal Fifty Years after Morgenstern %A Manski, Charles F. %XCommunicating Uncertainty in Official Economic Statistics: An Appraisal Fifty Years after Morgenstern Manski, Charles F. Federal statistical agencies in the United States and analogous agencies elsewhere commonly report official economic statistics as point estimates, without accompanying measures of error. Users of the statistics may incorrectly view them as error-free or may incorrectly conjecture error magnitudes. This paper discusses strategies to mitigate misinterpretation of official statistics by communicating uncertainty to the public. Sampling error can be measured using established statistical principles. The challenge is to satisfactorily measure the various forms of nonsampling error. I find it useful to distinguish transitory statistical uncertainty, permanent statistical uncertainty, and conceptual uncertainty. I illustrate how each arises as the Bureau of Economic Analysis periodically revises GDP estimates, the Census Bureau generates household income statistics from surveys with nonresponse, and the Bureau of Labor Statistics seasonally adjusts employment statistics. I anchor my discussion of communication of uncertainty in the contribution of Morgenstern (1963), who argued forcefully for agency publication of error estimates for official economic statistics.

%I Northwestern University %8 10/2014 %G eng %U http://hdl.handle.net/1813/40830 %9 Preprint %0 Thesis %D 2014 %T Comparing models of Demographic Subpopulations (Master's Thesis) %A Moehl, J. %I University of Tennessee %G eng %U http://trace.tennessee.edu/utk_gradthes/2835/; http://trace.tennessee.edu/cgi/viewcontent.cgi?article=4005&context=utk_gradthes %9 masters %0 Book Section %B Privacy in Statistical Databases %D 2014 %T A Comparison of Blocking Methods for Record Linkage %A Steorts, R. %A Ventura, S. %A Sadinle, M. %A Fienberg, S. E. %A Domingo-Ferrer, J. %B Privacy in Statistical Databases %I Springer %V 8744 %P 253–268 %G eng %U http://link.springer.com/chapter/10.1007/978-3-319-11257-2_20 %R 10.1007/978-3-319-11257-2_20 %0 Journal Article %J ArXiv %D 2014 %T A Comparison of Spatial Predictors when Datasets Could be Very Large %A Bradley, J. R. %A Cressie, N. %A Shi, T. %K Statistics - Methodology %XIn this article, we review and compare a number of methods of spatial prediction. To demonstrate the breadth of available choices, we consider both traditional and more-recently-introduced spatial predictors. Specifically, in our exposition we review: traditional stationary kriging, smoothing splines, negative-exponential distance-weighting, Fixed Rank Kriging, modified predictive processes, a stochastic partial differential equation approach, and lattice kriging. This comparison is meant to provide a service to practitioners wishing to decide between spatial predictors. Hence, we provide technical material for the unfamiliar, which includes the definition and motivation for each (deterministic and stochastic) spatial predictor. We use a benchmark dataset of

The work of Seth Spielman and Nicholas Nagle was noted in this article in City Lab, a publication from The Atlantic magazine, available at http://www.citylab.com/design/2014/11/how-to-make-a-better-map-according-to-science/382898/.

%I Citylab %G eng %U http://www.citylab.com/design/2014/11/how-to-make-a-better-map-according-to-science/382898/ %9 Online %0 Journal Article %J Journal of Personality and Social Psychology %D 2014 %T I Cheated, but only a Little–Partial Confessions to Unethical Behavior %A Peer, E. %A Acquisti, A. %A Shalvi, S. %B Journal of Personality and Social Psychology %V 106 %P 202–217 %G eng %0 Journal Article %J International Journal of Geographic Information Science %D 2014 %T Identifying Regions based on Flexible User Defined Constraints %A Folch, D. %A Spielman, S. E. %B International Journal of Geographic Information Science %V 28 %P 164-184 %G eng %U http://www.tandfonline.com/doi/abs/10.1080/13658816.2013.848986 %R 10.1080/13658816.2013.848986 %0 Journal Article %J Statistics in Medicine %D 2014 %T Imputation of confidential data sets with spatial locations using disease mapping models %A T. Paiva %A A. Chakraborty %A J.P. Reiter %A A.E. Gelfand %B Statistics in Medicine %V 33 %P 1928-1945 %G eng %0 Report %D 2014 %T Interval Estimates for Official Statistics with Survey Nonresponse %A Manski, C. %G eng %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Interviewer variance and prevalence of verbal behaviors in calendar and conventional interviewing %A Belli, R.F. %A Charoenruk, N., %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B XVIII International Sociological Association World Congress of Sociology %D 2014 %T Interviewer variance of interviewer and respondent behaviors: A comparison between calendar and conventional interviewing %A Belli, R.F. %A Charoenruk, N., %B XVIII International Sociological Association World Congress of Sociology %C Yokohama, Japan %G eng %U https://isaconf.confex.com/isaconf/wc2014/webprogram/Paper34278.html %0 Journal Article %J Annals of Applied Statistics %D 2014 %T Longitudinal mixed membership trajectory models for disability survey data %A Manrique-Vallier, D %B Annals of Applied Statistics %V 8 %P 2268-2291 %G eng %& 2268 %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Making sense of paradata: Challenges faced and lessons learned %A Eck, A. %A Stuart, L. %A Atkin, G. %A Soh, L-K %A McCutcheon, A.L. %A Belli, R.F. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B UNL/SRAM/Gallup Symposium %D 2014 %T Making Sense of Paradata: Challenges Faced and Lessons Learned %A Eck, A. %A Stuart, L. %A Atkin, G. %A Soh, L-K %A McCutcheon, A.L. %A Belli, R.F. %B UNL/SRAM/Gallup Symposium %C Omaha, NE %G eng %U http://grc.unl.edu/unlsramgallup-symposium %0 Journal Article %J Journal of Computational and Graphical Statistics %D 2014 %T Multiple imputation by ordered monotone blocks with application to the Anthrax Vaccine Adsorbed Trial %A Li, Fan %A Baccini, Michela %A Mealli, Fabrizia %A Zell, Elizabeth R. %A Frangakis, Constantine E. %A Rubin, Donald B %B Journal of Computational and Graphical Statistics %V 23 %P 877-892 %G eng %U http://www.tandfonline.com/doi/abs/10.1080/10618600.2013.826583 %R 10.1080/10618600.2013.826583 %0 Thesis %B Department of Statistical Sciences %D 2014 %T Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic Geographies (Ph.D. thesis) %A Thais Paiva %B Department of Statistical Sciences %I Duke University %V Ph.D. %G eng %U http://dukespace.lib.duke.edu/dspace/handle/10161/9406 %9 phd %0 Journal Article %J Journal of Business and Economic Statistics %D 2014 %T Multiple imputation of missing or faulty values under linear constraints %A Kim, H. J. %A Reiter, J. P. %A Wang, Q. %A Cox, L. H. %A Karr, A. F. %XMany statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.

%B Journal of Business and Economic Statistics %V 32 %P 375-386 %G eng %& 375 %R 10.1080/07350015.2014.885435 %0 Report %D 2014 %T NCRN Meeting Fall 2014 %A Vilhuber, Lars %X NCRN Meeting Fall 2014 Vilhuber, Lars Taken place at the ILR NYC Conference Center. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/45868 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography %A Quick, Harrison %A Holan, Scott %A Wikle, Christopher %A Reiter, Jerry %X NCRN Meeting Fall 2014: Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography Quick, Harrison; Holan, Scott; Wikle, Christopher; Reiter, Jerry Presentation from NCRN Fall 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37750 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Change in Visible Impervious Surface Area in Southeastern Michigan Before and After the "Great Recession" %A Wilson, Courtney %A Brown, Daniel G. %X NCRN Meeting Fall 2014: Change in Visible Impervious Surface Area in Southeastern Michigan Before and After the "Great Recession" Wilson, Courtney; Brown, Daniel G. Presentation at Fall 2014 NCRN meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37446 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Constrained Smoothed Bayesian Estimation %A Steorts, Rebecca %A Shalizi, Cosma %X NCRN Meeting Fall 2014: Constrained Smoothed Bayesian Estimation Steorts, Rebecca; Shalizi, Cosma Presentation from NCRN Fall 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37748 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Decomposing Medical-Care Expenditure Growth %A Dunn, Abe %A Liebman, Eli %A Shapiro, Adam %X NCRN Meeting Fall 2014: Decomposing Medical-Care Expenditure Growth Dunn, Abe; Liebman, Eli; Shapiro, Adam %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37411 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Designer Census Geographies %A Spielman, Seth %X NCRN Meeting Fall 2014: Designer Census Geographies Spielman, Seth Presentation from NCRN Fall 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37747 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Geographic linkages between National Center for Health Statistics’ population health surveys and air quality measures %A Parker, Jennifer %X NCRN Meeting Fall 2014: Geographic linkages between National Center for Health Statistics’ population health surveys and air quality measures Parker, Jennifer %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37412 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Mixed Effects Modeling for Multivariate-Spatio-Temporal Areal Data %A Bradley, Jonathan %A Holan, Scott %A Wikle, Christopher %X NCRN Meeting Fall 2014: Mixed Effects Modeling for Multivariate-Spatio-Temporal Areal Data Bradley, Jonathan; Holan, Scott; Wikle, Christopher Presentation from NCRN Fall 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37749 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Fall 2014: Respondent-Driven Sampling Estimation and the National HIV Behavioral Surveillance System %A Spiller, Michael (Trey) %X NCRN Meeting Fall 2014: Respondent-Driven Sampling Estimation and the National HIV Behavioral Surveillance System Spiller, Michael (Trey) %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/37414 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014 %A Vilhuber, Lars %X NCRN Meeting Spring 2014 Vilhuber, Lars Taken place at the Census Headquarters, Washington, DC. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/45869 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Adaptive Protocols and the DDI 4 Process Model %A Greenfield, Jay %A Kuan, Sophia %X NCRN Meeting Spring 2014: Adaptive Protocols and the DDI 4 Process Model Greenfield, Jay; Kuan, Sophia Presentation from NCRN Spring 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36393 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Aiming at a More Cost-Effective Census Via Online Data Collection: Privacy Trade-Offs of Geo-Location %A Brandimarte, Laura %A Acquisti, Alessandro %X NCRN Meeting Spring 2014: Aiming at a More Cost-Effective Census Via Online Data Collection: Privacy Trade-Offs of Geo-Location Brandimarte, Laura; Acquisti, Alessandro presentation at NCRN Spring 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36397 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Imputation of multivariate continuous data with non-ignorable missingness %A Paiva, Thais %A Reiter, Jerry %X NCRN Meeting Spring 2014: Imputation of multivariate continuous data with non-ignorable missingness Paiva, Thais; Reiter, Jerry Presentation at Spring 2014 NCRN meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36399 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau %A Block, William %A Brown, Warren %A Williams, Jeremy %A Vilhuber, Lars %A Lagoze, Carl %X NCRN Meeting Spring 2014: Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau Block, William; Brown, Warren; Williams, Jeremy; Vilhuber, Lars; Lagoze, Carl presentation at NCRN Spring 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36392 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Introduction %A Thompson, John %X NCRN Meeting Spring 2014: Introduction Thompson, John NCRN Spring 2014 Meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36395 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Metadata Standards & Technology Development for the NSF Survey of Earned Doctorates %A Noonan, Kimberly %A Heus, Pascal %A Mulcahy, Tim %X NCRN Meeting Spring 2014: Metadata Standards & Technology Development for the NSF Survey of Earned Doctorates Noonan, Kimberly; Heus, Pascal; Mulcahy, Tim Presentation from NCRN Spring 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36394 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Research Program and Enterprise Architecture for Adaptive Survey Design at Census %A Miller, Peter %A Mathur, Anup %A Thieme, Michael %X NCRN Meeting Spring 2014: Research Program and Enterprise Architecture for Adaptive Survey Design at Census Miller, Peter; Mathur, Anup; Thieme, Michael %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36400 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Summer Working Group for Employer List Linking (SWELL) %A Gathright, Graton %A Kutzbach, Mark %A Mccue, Kristin %A McEntarfer, Erika %A Monti, Holly %A Trageser, Kelly %A Vilhuber, Lars %A Wasi, Nada %A Wignall, Christopher %X NCRN Meeting Spring 2014: Summer Working Group for Employer List Linking (SWELL) Gathright, Graton; Kutzbach, Mark; Mccue, Kristin; McEntarfer, Erika; Monti, Holly; Trageser, Kelly; Vilhuber, Lars; Wasi, Nada; Wignall, Christopher Presentation for NCRN Spring 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36396 %9 Preprint %0 Report %D 2014 %T NCRN Meeting Spring 2014: Web Surveys, Online Panels, and Paradata: Automating Adaptive Design %A McCutcheon, Allan %X NCRN Meeting Spring 2014: Web Surveys, Online Panels, and Paradata: Automating Adaptive Design McCutcheon, Allan Presentation at NCRN Spring 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36398 %9 Preprint %0 Report %D 2014 %T NCRN Newsletter: Volume 1 - Issue 2 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from November 2013 to March 2014. NCRN Newsletter Vol. 1, Issue 2: March 20, 2014 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40233 %9 Preprint %0 Report %D 2014 %T NCRN Newsletter: Volume 1 - Issue 3 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from March 2014 to July 2014. NCRN Newsletter Vol. 1, Issue 3: July 23, 2014 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40234 %9 Preprint %0 Report %D 2014 %T NCRN Newsletter: Volume 1 - Issue 4 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2014 to October 2014. NCRN Newsletter Vol. 1, Issue 4: October 15, 2014 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40192 %9 Preprint %0 Report %D 2014 %T A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data %A Schneider, Matthew J. %A Abowd, John M. %X A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data Schneider, Matthew J.; Abowd, John M. Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The United States Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce Indicators. %I Cornell University %G eng %U http://hdl.handle.net/1813/40828 %9 Preprint %0 Generic %D 2014 %T NewsViews: An Automated Pipeline for Creating Custom Geovisualizations for News %A Gao, T. %A Hullman, J. %A Adar, E. %A Hect, B. %A Diakopoulos, N. %X Interactive visualizations add rich, data-based context to online news articles. Geographic maps are currently the most prevalent form of these visualizations. Unfortunately, designers capable of producing high-quality, customized geovisualizations are scarce. We present NewsViews, a novel automated news visualization system that generates interactive, annotated maps without requiring professional designers. NewsViews’ maps support trend identification and data comparisons relevant to a given news article. The NewsViews system leverages text mining to identify key concepts and locations discussed in articles (as well as po-tential annotations), an extensive repository of “found” databases, and techniques adapted from cartography to identify and create visually “interesting” thematic maps. In this work, we develop and evaluate key criteria in automatic, annotated, map generation and experimentally validate the key features for successful representations (e.g., relevance to context, variable selection, "interestingness" of representation and annotation quality). %G eng %U http://cond.org/newsviews.html %R 10.1145/2556288.2557228 %0 Journal Article %J The Professional Geographer %D 2014 %T The Past, Present, and Future of Geodemographic Research in the Unites States and United Kingdom %A Singleton, A. %A Spielman, S. E. %B The Professional Geographer %V 4 %G eng %0 Conference Paper %B Joint Statistical Meetings 2014 %D 2014 %T The Poisson Change of Support Problem with Applications to the American Community Survey %A Bradley, J.R. %B Joint Statistical Meetings 2014 %G eng %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Predicting Survey Breakoff in Online Survey Panels %A McCutcheon, A.L. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2014 %T Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization %A Spielman, Seth %A Folch, David %X Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization Spielman, Seth; Folch, David The American Community Survey (ACS) is the largest US survey of households and is the principal source for neighborhood scale information about the US population and economy. The ACS is used to allocate billions in federal spending and is a critical input to social scientific research in the US. However, estimates from the ACS can be highly unreliable. For example, in over 72% of census tracts the estimated number of children under 5 in poverty has a margin of error greater than the estimate. Uncertainty of this magnitude complicates the use of social data in policy making, research, and governance. This article develops a spatial optimization algorithm that is capable of reducing the margins of error in survey data via the creation of new composite geographies, a process called regionalization. Regionalization is a complex combinatorial problem. Here rather than focusing on the technical aspects of regionalization we demonstrate how to use a purpose built open source regionalization algorithm to post-process survey data in order to reduce the margins of error to some user-specified threshold. %I University of Colorado at Boulder / University of Tennessee %G eng %U http://hdl.handle.net/1813/38121 %9 Preprint %0 Conference Paper %B Paper presented at the annual conference of the Midwest Association for Public Opinion Research %D 2014 %T Remembering where: A look at the American Time Use Survey %A Deal, C. %A Cordova-Cazar, A.L. %A Countryman, A. %A Kirchner, A. %A Belli, R.F. %B Paper presented at the annual conference of the Midwest Association for Public Opinion Research %C Chicago, IL %8 11/2014 %G eng %U http://www.mapor.org/conferences.html %0 Journal Article %J Behavior Research Methods %D 2014 %T Reputation as a Sufficient Condition for Data Quality on Amazon Mechanical Turk %A Peer, E. %A Vosgerau, J. %A Acquisti, A. %B Behavior Research Methods %V 46 %P 1023–1031 %8 December %G eng %0 Book Section %B The Routledge Handbook of Poverty in the United States %D 2014 %T The Rise of Incarceration Among the Poor with Mental Illnesses: How Neoliberal Policies Contribute %A Camp, J. %A Haymes, S. %A Haymes, M. V. d. %A Miller, R.J. %B The Routledge Handbook of Poverty in the United States %I Routledge %G eng %0 Conference Paper %B Midwest Association for Public Opinion Research Annual Conference %D 2014 %T The Role of Device Type in Internet Panel Survey Breakoff %A McCutcheon, A.L. %B Midwest Association for Public Opinion Research Annual Conference %C Chicago, IL %G eng %U http://www.mapor.org/conferences.html %0 Journal Article %J Poverty & Public Policy %D 2014 %T Savings from ages 16 to 35: A test to inform Child Development Account policy %A Friedline, T. %A Nam, I. %B Poverty & Public Policy %V 6 %G eng %U http://onlinelibrary.wiley.com/store/10.1002/pop4.59/asset/pop459.pdf %N 1 %& 46-70 %R 10.1002/pop4.59 %0 Journal Article %J Research Policy %D 2014 %T Seeing the Non-Stars: (Some) Sources of Bias in Past Disambiguation Approaches and a New Public Tool Leveraging Labeled Records %A Ventura, S. %A Nugent, R. %A Fuchs, E. %B Research Policy %8 December %G eng %0 Generic %D 2014 %T SIPP: From Conventional Questionnaire to Event History Calendar Interviewing %A Belli, R.F. %8 February %G eng %0 Conference Paper %B AISTATS 2014 Proceedings, JMLR %D 2014 %T SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication %A Steorts, R. %A Hall, R. %A Fienberg, S. E. %B AISTATS 2014 Proceedings, JMLR %I W& CP %V 33 %P 922–930 %G eng %0 Report %D 2014 %T Sorting Between and Within Industries: A Testable Model of Assortative Matching %A Abowd, John M. %A Kramarz, Francis %A Perez-Duarte, Sebastien %A Schmutte, Ian M. %X Sorting Between and Within Industries: A Testable Model of Assortative Matching Abowd, John M.; Kramarz, Francis; Perez-Duarte, Sebastien; Schmutte, Ian M. We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated. %I Cornell University %G eng %U http://hdl.handle.net/1813/52607 %9 Preprint %0 Journal Article %J Cartography and Geographic Information Science %D 2014 %T Spatial Collective Intelligence? Accuracy, Credibility in Crowdsourced Data %A Spielman, S. E. %B Cartography and Geographic Information Science %V 41 %P 115-124 %G eng %U http://go.galegroup.com/ps/i.do?action=interpret&id=GALE|A361943563&v=2.1&u=nysl_sc_cornl&it=r&p=AONE&sw=w&authCount=1 %N 2 %R http://dx.doi.org/10.1080/15230406.2013.874200 %0 Generic %D 2014 %T Spatial Fay-Herriot Models for Small Area Estimation With Functional Covariates %A Holan, S.H. %8 January %G eng %0 Journal Article %J Spatial Statistics %D 2014 %T Spatial Fay-Herriot Models for Small Area Estimation with Functional Covariates %A Porter, A. T., %A Holan, S.H., %A Wikle, C.K., %A Cressie, N. %B Spatial Statistics %V 10 %P 27-42 %G eng %U http://arxiv.org/pdf/1303.6668v3.pdf %( 2013 %0 Conference Paper %B Proceedings of the Workshop on Usable Security (USEC) %D 2014 %T Spiny CACTOS: OSN Users Attitudes and Perceptions Towards Cryptographic Access Control Tools %A Balsa, E., %A Brandimarte, L., %A Acquisti, A., %A Diaz, C., %A Gürses, S. %B Proceedings of the Workshop on Usable Security (USEC) %G eng %U https://www.internetsociety.org/doc/spiny-cactos-osn-users-attitudes-and-perceptions-towards-cryptographic-access-control-tools %0 Conference Paper %B GIScience Workshop on Uncertainty Visualization %D 2014 %T Supporting Planners' Work with Uncertain Demographic Data %A Griffin, A. L. %A Spielman, S. E. %A Jurjevich, J. %A Merrick, M. %A Nagle, N. N. %A Folch, D. C. %B GIScience Workshop on Uncertainty Visualization %V 23 %G eng %U http://cognitivegiscience.psu.edu/uncertainty2014/papers/griffin_demographic.pdf. %0 Conference Paper %B Proceedings of IEEE VIS 2014 %D 2014 %T Supporting Planners' work with Uncertain Demographic Data %A Griffin, A. L. %A Spielman, S. E. %A Nagle, N. N. %A Jurjevich, J. %A Merrick, M. %A Folch, D. C. %B Proceedings of IEEE VIS 2014 %I Proceedings of IEEE VIS 2014 %P 9–14 %G eng %U http://cognitivegiscience.psu.edu/uncertainty2014/papers/griffin_demographic.pdf %0 Conference Paper %B Joint Statistical Meetings 2014 %D 2014 %T Survey Fusion for Data that Exhibit Multivariate, Spatio-Temporal Dependencies %A Bradley, J.R. %B Joint Statistical Meetings 2014 %G eng %0 Conference Paper %B UNL/SRAM/Gallup Symposium %D 2014 %T Survey Informatics: Ideas, Opportunities, and Discussions %A Eck, A. %A Soh, L-K %B UNL/SRAM/Gallup Symposium %C Omaha, NE %G eng %U http://grc.unl.edu/unlsramgallup-symposium %0 Generic %D 2014 %T A Survey of Contemporary Spatial Models for Small Area Estimation %A Porter, A.T. %8 January %G eng %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2014 %T SynLBD 2.0: Improving the Synthetic Longitudinal Business Database %A S. K. Kinney %A J. P. Reiter %A J. Miranda %B Statistical Journal of the International Association for Official Statistics %V 30 %P 129-135 %G eng %0 Journal Article %J Journal of Privacy and Confidentiality %D 2014 %T Top-Coding and Public Use Microdata Samples from the U.S. Census Bureau %A Crimi, N. %A Eddy, W. C. %B Journal of Privacy and Confidentiality %V 6 %P 21–58 %G eng %U http://repository.cmu.edu/jpc/vol6/iss2/2/ %0 Journal Article %J The St. Louis Federal Reserve Bulletin %D 2014 %T Toward healthy balance sheets: Savings accounts as a gateway for young adults’ asset diversification and accumulation %A Friedline, T. %A Johnson, P. %A Hughes, R. %B The St. Louis Federal Reserve Bulletin %G eng %U http://research.stlouisfed.org/publications/review/2014/q4/friedline.pdf %0 Thesis %D 2014 %T Towards an Understanding of Dynamics Between Race, Population Movement, and the Built Environment of American Cities (undergraduate honors thesis) %A Bellman, B. %I University of Colorado at Boulder %G eng %9 Undergraduate Honors Thesis %0 Report %D 2014 %T Twitter, Big Data, and Jobs Numbers %A Hudomiet, Peter %B LSA Today %G eng %U http://www.lsa.umich.edu/lsa/ci.twitterbigdataandjobsnumbers_ci.detail %9 online %0 Report %D 2014 %T Uncertain Uncertainty: Spatial Variation in the Quality of American Community Survey Estimates %A Folch, David C. %A Arribas-Bel, Daniel %A Koschinsky, Julia %A Spielman, Seth E. %X Uncertain Uncertainty: Spatial Variation in the Quality of American Community Survey Estimates Folch, David C.; Arribas-Bel, Daniel; Koschinsky, Julia; Spielman, Seth E. The U.S. Census Bureau's American Community Survey (ACS) is the foundation of social science research, much federal resource allocation and the development of public policy and private sector decisions. However, the high uncertainty associated with some of the ACS's most frequently used estimates can jeopardize the accuracy of inferences based on these data. While there is high level understanding in the research community that problems exist in the data, the sources and implications of these problems have been largely overlooked. Using 2006-2010 ACS median household income at the census tract scale as the test case (where a third of small-area estimates have higher than recommend errors), we explore the patterns in the uncertainty of ACS data. We consider various potential sources of uncertainty in the data, ranging from response level to geographic location to characteristics of the place. We find that there exist systematic patterns in the uncertainty in both the spatial and attribute dimensions. Using a regression framework, we identify the factors that are most frequently correlated with the error at national, regional and metropolitan area scales, and find these correlates are not consistent across the various locations tested. The implication is that data quality varies in different places, making cross-sectional analysis both within and across regions less reliable. We also present general advice for data users and potential solutions to the challenges identified. %I University of Colorado at Boulder / University of Tennessee %G eng %U http://hdl.handle.net/1813/38122 %9 Preprint %0 Book Section %B Online Panel Surveys: An Interdisciplinary Approach %D 2014 %T The Untold Story of Multi-Mode (Online and Mail) Consumer Panels: From Optimal Recruitment to Retention and Attrition %A McCutcheon, Allan L. %A Rao, K., %A Kaminska, O. %E Callegaro, M. %E Baker, R. %E Bethlehem, J. %E Göritz, A. %E Krosnick, J. %E Lavrakas, P. %B Online Panel Surveys: An Interdisciplinary Approach %I Wiley %G eng %R 10.1002/9781118763520.ch5 %0 Journal Article %J Unpublished manuscript, University of Michigan. Accessed May %D 2014 %T An updated method for calculating income and payroll taxes from PSID data using the NBER’s TAXSIM, for PSID survey years 1999 through 2011 %A Kimberlin, Sara %A Kim, Jiyoun %A Shaefer, Luke %X This paper describes a method to calculate income and payroll taxes from Panel Study of Income Dynamics data using the NBERʼs Internet TAXSIM version 9 (http://users.nber.org/~taxsim/taxsim9/), for PSID survey years 1999, 2001, 2003, 2005. 2007, 2009, and 2011 (tax years n-1). These methods are implemented in two Stata programs, designed to be used with the PSID public-use zipped Main Interview data files: PSID_TAXSIM_1of2.do and PSID_TAXSIM_2of2.do. The main program (2of2) was written by Sara Kimberlin (skimberlin@berkeley.edu) and generates all TAXSIM input variables, runs TAXSIM, adjusts tax estimates using additional information available in PSID data, and calculates total PSID family unit taxes. A separate program (1of2) was written by Jiyoon (June) Kim (junekim@umich.edu) in collaboration with Luke Shaefer (lshaefer@umich.edu) to calculate mortgage interest for itemized deductions; this program needs to be run first, before the main program. Jonathan Latner contributed code to use the programs with the PSID zipped data. The overall methods build on the strategy for using TAXSIM with PSID data outlined by Butrica & Burkhauser (1997), with some expansions and modifications. Note that the methods described below are designed to prioritize accuracy of income taxes calculated for low-income households, particularly refundable tax credits such as the Earned Income Tax Credit (EITC) and the Additional Child Tax Credit. Income tax liability is generally low for low-income households, and the amount of refundable tax credits is often substantially larger than tax liabilities for this population. Payroll tax can also be substantial for low-income households. Thus the methods below focus on maximizing accuracy of income tax and payroll tax calculations for low-income families, with less attention to tax items that largely impact higher-income households (e.g. the treatment of capital gains). %B Unpublished manuscript, University of Michigan. Accessed May %V 6 %P 2016 %G eng %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T The use of paradata (in time use surveys) to better evaluate data quality %A Cordova-Cazar, A.L. %A Belli, R.F. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2014 %T Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results %A Miranda, Javier %A Vilhuber, Lars %X Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results Miranda, Javier; Vilhuber, Lars The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells. %I Cornell University %G eng %U http://hdl.handle.net/1813/40852 %9 Preprint %0 Journal Article %J Privacy in Statistical Databases %D 2014 %T Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results %A J. Miranda %A L. Vilhuber %X The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells. %B Privacy in Statistical Databases %P 232-242 %@ 978-3-319-11256-5 %G eng %U http://dx.doi.org/10.1007/978-3-319-11257-2_18 %R 10.1007/978-3-319-11257-2_18 %0 Report %D 2014 %T Using Social Media to Measure Labor Market Flows %A Antenucci, Dolan %A Cafarella, Michael J %A Levenstein, Margaret C. %A Ré, Christopher %A Shapiro, Matthew %G eng %U http://www-personal.umich.edu/~shapiro/papers/LaborFlowsSocialMedia.pdf %9 Mimeo %0 Conference Paper %B NSF-Census Research Network (NCRN) Spring Meeting %D 2014 %T Web Surveys, Online Panels, and Paradata: Automating Adaptive Design %A McCutcheon, A.L. %B NSF-Census Research Network (NCRN) Spring Meeting %C Washington, DC %G eng %U http://www.ncrn.info/event/ncrn-meeting-spring-2014 %0 Journal Article %J Journal of Survey Statistics and Methodology %D 2014 %T What are You Doing Now? Activity Level Responses and Errors in the American Time Use Survey %A T. Al Baghal %A Belli, R.F. %A Phillips, A.L. %A Ruther, N. %B Journal of Survey Statistics and Methodology %V 2 %G eng %N 4 %& 519-537 %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2014 %T Why data availability is such a hard problem %A A. F. Karr %K Data Archive %K Data availability %K public good %K replicability %K reproducibility %X If data availability were a simple problem, it would already have been resolved. In this paper, I argue that by viewing data availability as a public good, it is possible to both understand the complexities with which it is fraught and identify a path to a solution. %B Statistical Journal of the International Association for Official Statistics %V 30 %8 06/2014 %G eng %N 2 %& 101-107 %0 Conference Paper %B Proceedings of the Tenth Symposium on Usable Privacy and Security (SOUPS) %D 2014 %T Would a Privacy Fundamentalist Sell their DNA for \$1000... if Nothing Bad Happened Thereafter? A Study of the Western Categories, Behavior Intentions, and Consequences %A Woodruff, A. %A Pihur, V. %A Acquisti, A. %A Consolvo, S. %A Schmidt, L. %A Brandimarte, L. %B Proceedings of the Tenth Symposium on Usable Privacy and Security (SOUPS) %I ACM %C New York, NY %G eng %U https://www.usenix.org/conference/soups2014/proceedings/presentation/woodruff %0 Journal Article %J The American Statistician %D 2013 %T Are independent parameter draws necessary for multiple imputation? %A Hu, J. %A Mitra, R. %A Reiter, J.P. %B The American Statistician %V 67 %P 143-149 %G eng %U http://www.tandfonline.com/doi/full/10.1080/00031305.2013.821953 %R 10.1080/00031305.2013.821953 %0 Generic %D 2013 %T A Bayesian Approach to Estimating Agricultural Yield Based on Multiple Repeated Surveys, Institute of Public Policy and the Truman School of Public Affairs %A Holan, S.H. %8 March %G eng %0 Report %D 2013 %T A Bayesian Approach to Graphical Record Linkage and De-duplication %A Steorts, Rebecca C. %A Hall, Rob %A Fienberg, Stephen E. %X We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online. %B arXiv %G eng %U https://arxiv.org/abs/1312.4645 %0 Generic %D 2013 %T Bayesian inference for the Spatial Random Effects Model %A Cressie, N. %B Department of Statistics, Macquarie University %I Macquarie University %8 July %G eng %0 Conference Paper %B Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS) 2013 %D 2013 %T Bayesian learning of joint distributions of objects %A Banerjee, A. %A Murray, J. %A Dunson, D. B. %XThere is increasing interest in broad application areas in defining flexible joint models for data having a variety of measurement scales, while also allowing data of complex types, such as functions, images and documents. We consider a general framework for nonparametric Bayes joint modeling through mixture models that incorporate dependence across data types through a joint mixing measure. The mixing measure is assigned a novel infinite tensor factorization (ITF) prior that allows flexible dependence in cluster allocation across data types. The ITF prior is formulated as a tensor product of stick-breaking processes. Focusing on a convenient special case corresponding to a Parafac factorization, we provide basic theory justifying the flexibility of the proposed prior and resulting asymptotic properties. Focusing on ITF mixtures of product kernels, we develop a new Gibbs sampling algorithm for routine implementation relying on slice sampling. The methods are compared with alternative joint mixture models based on Dirichlet processes and related approaches through simulations and real data applications.

Also at http://arxiv.org/abs/1303.0449

%B Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS) 2013 %G eng %U http://jmlr.csail.mit.edu/proceedings/papers/v31/banerjee13a.html %0 Conference Paper %B The Extreme Science and Engineering Discovery Environment Conference %D 2013 %T Bayesian Modeling in the Era of Big Data: the Role of High-Throughput and High-Performance Computing %A Wu, G. %B The Extreme Science and Engineering Discovery Environment Conference %C San Diego, CA %8 July %G eng %0 Report %D 2013 %T Bayesian multiple imputation for large-scale categorical data with structural zeros %A Manrique-Vallier, D. %A Reiter, J. P. %X Bayesian multiple imputation for large-scale categorical data with structural zeros Manrique-Vallier, D.; Reiter, J. P. We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. This approach has several appealing features: imputations are generated from coherent, Bayesian joint models that automatically capture complex dependencies and readily scale to large numbers of variables. We outline a Gibbs sampling algorithm for implementing the approach, and we illustrate its potential with a repeated sampling study using public use census microdata from the state of New York, USA. %I Duke University / National Institute of Statistical Sciences (NISS) %G eng %U http://hdl.handle.net/1813/34889 %9 Preprint %0 Report %D 2013 %T b-Bit Minwise Hashing in Practice %A Li, Ping %A Shrivastava, Anshumali %A König, Arnd Christian %X b-Bit Minwise Hashing in Practice Li, Ping; Shrivastava, Anshumali; König, Arnd Christian Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demon- strated a potential use of b-bit minwise hashing [23, 24] for ef- ficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical is- sues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. %I Cornell University %G eng %U http://hdl.handle.net/1813/37986 %9 Preprint %0 Conference Paper %B Internetware'13 %D 2013 %T b-Bit Minwise Hashing in Practice %A Ping Li %A Anshumali Shrivastava %A König, Arnd Christian %X Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented. %B Internetware'13 %8 October %G eng %U http://www.nudt.edu.cn/internetware2013/ %0 Conference Paper %B Neural Information Processing Systems (NIPS) %D 2013 %T Beyond Pairwise: Provably Fast Algorithms for Approximate K-Way Similarity Search %A Anshumali Shrivastava %A Ping Li %B Neural Information Processing Systems (NIPS) %G eng %0 Conference Paper %B Joint Statistical Meetings 2013 %D 2013 %T Binomial Mixture Models for Urban Ecological Monitoring Studies Using American Community Survey Demographic Covariates %A Wu, G. %B Joint Statistical Meetings 2013 %C Montreal, Canada %8 August %G eng %0 Conference Paper %B Transactions in GIS %D 2013 %T The Co-Evolution of Residential Segregation and the Built Environment at the Turn of the 20th Century: A Schelling Model %A S.E. Spielman %A Patrick Harrison %B Transactions in GIS %G eng %R 10.1111/tgis.12014 %0 Book Section %B Improving Surveys with Paradata: Analytic Uses of Process Information %D 2013 %T Collecting paradata for measurement error evaluation %A Olson, K. %A Parkhurst, B. %E Frauke Kreuter %B Improving Surveys with Paradata: Analytic Uses of Process Information %I John Wiley and Sons %C Hoboken, NJ. %P 43-72 %G eng %& Collecting paradata for measurement error evaluation %R 10.1002/9781118596869.ch3 %0 Journal Article %J The American Statistician %D 2013 %T Comment: Innovations Associated with Multiple Systems Estimation in Human Rights Settings %A Fienberg, S. E. %B The American Statistician %V 67 %G eng %0 Conference Paper %B International Workshop on Recent Advances in Statistical Inference: Theory and Case Studies %D 2013 %T Comparing and Selecting Predictors Predictors Using Local Criteria %A Cressie, N. %B International Workshop on Recent Advances in Statistical Inference: Theory and Case Studies %I International Workshop on Recent Advances in Statistical Inference: Theory and Case Studies %C Padua, Italy %8 March %G eng %0 Conference Paper %B IEEE Security & Privacy %D 2013 %T Complementary Perspectives on Privacy and Security: Economics %A Acquisti, A. %B IEEE Security & Privacy %V 11 %P 93–95 %G eng %R 10.1109/MSP.2013.30 %0 Report %D 2013 %T Credible interval estimates for official statistics with survey nonresponse %A Manski, Charles F. %X Credible interval estimates for official statistics with survey nonresponse Manski, Charles F. Government agencies commonly report official statistics based on survey data as point estimates, without accompanying measures of error. In the absence of agency guidance, users of the statistics can only conjecture the error magnitudes. Agencies could mitigate misinterpretation of official statistics if they were to measure potential errors and report them. Agencies could report sampling error using established statistical principles. It is more challenging to report nonsampling errors because there are many sources of such errors and there has been no consensus about how to measure them. To advance discourse on practical ways to report nonsampling error, this paper considers error due to survey nonresponse. I summarize research deriving interval estimates that make no assumptions about the values of missing data. In the absence of assumptions, one can obtain computable bounds on the population parameters that official statistics intend to measure. I also explore the middle ground between interval estimation making no assumptions and traditional point estimation using weights and imputations to implement assumptions that nonresponse is conditionally random. I am grateful to Aanchal Jain for excellent research assistance and to Bruce Spencer for helpful discussions. I have benefitted from the opportunity to present this work in a seminar at the Institute for Social and Economic Research, University of Essex. %I Northwestern University %G eng %U http://hdl.handle.net/1813/34447 %9 Preprint %0 Journal Article %J International Journal of Digital Curation %D 2013 %T Data Management of Confidential Data %A Carl Lagoze %A William C. Block %A Jeremy Williams %A John M. Abowd %A Lars Vilhuber %X Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data. %B International Journal of Digital Curation %V 8 %P 265-278 %G eng %R 10.2218/ijdc.v8i1.259 %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Do ‘Don’t Know’ Responses = Survey Satisficing? Evidence from the Gallup Panel Paradata %A Wang, Mengyang %A Ruppanner, Leah %A McCutcheon, Allan L. %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Review of Economics of the Household %D 2013 %T Do single mothers in the United States use the Earned Income Tax Credit to reduce unsecured debt? %A Shaefer, H. Luke %A Song, Xiaoqing %A Williams Shanks, Trina R. %K Earned Income Tax Credit Single Mothers Unsecured Debt %XThe Earned Income Tax Credit (EITC) is a refundable credit for low income workers mainly targeted at families with children. This study uses the Survey of Income and Program Participation’s topical modules on Assets and Liabilities to examine associations between the EITC expansions during the early 1990s and the unsecured debt of the households of single mothers. We use two difference-in-differences comparisons over the study period 1988–1999, first comparing single mothers to single childless women, and then comparing single mothers with two or more children to single mothers with exactly one child. In both cases we find that the EITC expansions are associated with a relative decline in the unsecured debt of affected households of single mothers. While not direct evidence of a causal relationship, this is suggestive evidence that single mothers may have used part of their EITC to limit the growth of their unsecured debt during this period.

%B Review of Economics of the Household %P 659–680 %G eng %9 Journal Article %0 Conference Paper %B Joint Statistical Meetings 2013 %D 2013 %T Ecological Prediction with Nonlinear Multivariate Time-Frequency Functional Data Models %A Wikle, C.K. %B Joint Statistical Meetings 2013 %C Montreal, Canada %8 August %G eng %0 Journal Article %J Journal of Agricultural, Biological, and Environmental Statistics %D 2013 %T Ecological Prediction With Nonlinear Multivariate Time-Frequency Functional Data Models %A Yang, W.H., %A Wikle, C.K. %A Holan, S.H. %A Wildhaber, M.L. %B Journal of Agricultural, Biological, and Environmental Statistics %V 18 %G eng %U http://link.springer.com/article/10.1007/s13253-013-0142-1 %& 450-474 %R 10.1007/s13253-013-0142-1 %0 Journal Article %J Journal of Empirical Legal Studies %D 2013 %T Empirical Analysis of Data Breach Litigation %A Romanosky, A. %A Hoffman, D. %A Acquisti, A. %B Journal of Empirical Legal Studies %V 11 %P 74–104 %G eng %0 Conference Paper %B Metadata and Semantics Research %D 2013 %T Encoding Provenance Metadata for Social Science Datasets %A Lagoze, Carl %A Willliams, Jeremy %A Vilhuber, Lars %E Garoufallou, Emmanouel %E Greenberg, Jane %K DDI %K eSocial Science %K Metadata %K Provenance %B Metadata and Semantics Research %S Communications in Computer and Information Science %I Springer International Publishing %V 390 %P 123-134 %@ 978-3-319-03436-2 %G eng %U http://dx.doi.org/10.1007/978-3-319-03437-9_13 %R 10.1007/978-3-319-03437-9_13 %0 Report %D 2013 %T Encoding Provenance of Social Science Data: Integrating PROV with DDI %A Lagoze, Carl %A Block, William C %A Williams, Jeremy %A Abowd, John %A Vilhuber, Lars %X Encoding Provenance of Social Science Data: Integrating PROV with DDI Lagoze, Carl; Block, William C; Williams, Jeremy; Abowd, John; Vilhuber, Lars Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface. Submitted to EDDI13 5th Annual European DDI User Conference December 2013, Paris, France %I Cornell University %G eng %U http://hdl.handle.net/1813/34443 %9 Preprint %0 Conference Paper %B 5th Annual European DDI User Conference %D 2013 %T Encoding Provenance of Social Science Data: Integrating PROV with DDI %A Carl Lagoze %A William C. Block %A Jeremy Williams %A Lars Vilhuber %K DDI %K eSocial Science %K Metadata %K Provenance %X Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface. %B 5th Annual European DDI User Conference %G eng %0 Journal Article %J Statistica Sinica %D 2013 %T On estimation of mean squared errors of benchmarked and empirical bayes estimators %A Rebecca C. Steorts %A Malay Ghosh %B Statistica Sinica %V 23 %P 749–767 %G eng %0 Conference Paper %B 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining %D 2013 %T Exact Sparse Recovery with L0 Projections %A Ping Li %A Cun-Hui Zhang %B 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining %8 August %G eng %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Examining item nonresponse through paradata and respondent characteristics: A multilevel approach %A Cordova-Cazar, A.L. %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Examining response time outliers through paradata in Online Panel Surveys %A Lee, J. %A T. Al Baghal %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Examining the relationship between error and behavior in the American Time Use Survey using audit trail paradata %A Ruther, N. %A T. Al Baghal %A A. Eck %A L. Stuart %A L. Phillips %A R. Belli %A Soh, L-K %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2013 %T Fast Near Neighbor Search in High-Dimensional Binary Data %A Shrivastava, Anshumali %A Li, Ping %X Fast Near Neighbor Search in High-Dimensional Binary Data Shrivastava, Anshumali; Li, Ping Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections. %I Cornell University %G eng %U http://hdl.handle.net/1813/37987 %9 Preprint %0 Conference Paper %B Joint Statistical Meetings 2013 %D 2013 %T Flexible Semiparametric Hierarchical Spatial Models %A Porter, A.T. %B Joint Statistical Meetings 2013 %C Montreal, Canada %8 August %G eng %0 Journal Article %J Ohio State Law Journal %D 2013 %T From Facebook Regrets to Facebook Privacy Nudges %A Wang, Y. %A Leon, P. G. %A Chen, X. %A Komanduri, S. %A Norcie, G. %A Scott, K. %A Acquisti, A. %A Cranor, L. F. %A Sadeh, N. %B Ohio State Law Journal %G eng %0 Journal Article %J Journal of the American Statistical Association %D 2013 %T A Generalized Fellegi-Sunter Framework for Multiple Record Linkage with Application to Homicide Record Systems %A Sadinle, M. %A Fienberg, S. E. %B Journal of the American Statistical Association %V 108 %P 385–397 %G eng %U http://dx.doi.org/10.1080/01621459.2012.757231 %R 10.1080/01621459.2012.757231 %0 Journal Article %J IEEE Security & Privacy %D 2013 %T Gone in 15 Seconds: The Limits of Privacy Transparency and Control %A Acquisti, A. %A Adjerid, I. %A Brandimarte, L. %B IEEE Security & Privacy %V 11 %P 72–74 %G eng %0 Journal Article %J Statist. Sci. %D 2013 %T Handling Attrition in Longitudinal Studies: The Case for Refreshment Samples %A Deng, Yiting %A Hillygus, D. Sunshine %A Reiter, Jerome P. %A Si, Yajuan %A Zheng, Siyu %X Panel studies typically suffer from attrition, which reduces sample size and can result in biased inferences. It is impossible to know whether or not the attrition causes bias from the observed panel data alone. Refreshment samples—new, randomly sampled respondents given the questionnaire at the same time as a subsequent wave of the panel—offer information that can be used to diagnose and adjust for bias due to attrition. We review and bolster the case for the use of refreshment samples in panel studies. We include examples of both a fully Bayesian approach for analyzing the concatenated panel and refreshment data, and a multiple imputation approach for analyzing only the original panel. For the latter, we document a positive bias in the usual multiple imputation variance estimator. We present models appropriate for three waves and two refreshment samples, including nonterminal attrition. We illustrate the three-wave analysis using the 2007–2008 Associated Press–Yahoo! News Election Poll. %B Statist. Sci. %V 28 %P 238–256 %8 05/2013 %G eng %U http://dx.doi.org/10.1214/13-STS414 %& 238 %R 10.1214/13-STS414 %0 Journal Article %J Journal of Agricultural, Biological, and Environmental Statistics %D 2013 %T Hierarchical Bayesian Spatio-Temporal Conway-Maxwell Poisson Models with Dynamic Dispersion %A Wu, G. %A Holan, S.H. %A Wikle, C.K. %B Journal of Agricultural, Biological, and Environmental Statistics %C Anchorage, Alaska %V 18 %P 335-356 %G eng %U http://link.springer.com/article/10.1007/s13253-013-0141-2 %R 10.1007/s13253-013-0141-2 %0 Journal Article %J Statistics Views %D 2013 %T Hierarchical Spatio-Temporal Models and Survey Research %A Wikle, C. %A Holan, S. %A Cressie, N. %B Statistics Views %8 May %G eng %U http://www.statisticsviews.com/details/feature/4730991/Hierarchical-Spatio-Temporal-Models-and-Survey-Research.html %( Wiley %0 Journal Article %J Spatial Statistics %D 2013 %T Hierarchical Statistical Modeling of Big Spatial Datasets Using the Exponential Family of Distributions %A Sengupta, A. %A Cressie, N. %K EM algorithm %K Empirical Bayes %K Geostatistical process %K Maximum likelihood estimation %K MCMC %K SRE model %B Spatial Statistics %V 4 %P 14-44 %G eng %U http://www.sciencedirect.com/science/article/pii/S2211675313000055 %R 10.1016/j.spasta.2013.02.002 %0 Generic %D 2013 %T How can survey estimates of small areas be improved by leveraging social-media data? %A Cressie, N. %A Holan, S. %A Wikle, C. %B The Survey Statistician %8 July %G eng %U http://isi.cbs.nl/iass/N68.pdf %0 Journal Article %J Annals of the Association of American Geographers %D 2013 %T Identifying Neighborhoods Using High Resolution Population Data %A S.E. Spielman %A J. Logan %B Annals of the Association of American Geographers %V 103 %P 67-84 %G eng %0 Report %D 2013 %T Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files %A Block, William C. %A Williams, Jeremy %A Vilhuber, Lars %A Lagoze, Carl %A Brown, Warren %A Abowd, John M. %X Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files Block, William C.; Williams, Jeremy; Vilhuber, Lars; Lagoze, Carl; Brown, Warren; Abowd, John M. Presentation at NADDI 2013 This record has also been archived at http://kuscholarworks.ku.edu/dspace/handle/1808/11093 . %I Cornell University %G eng %U http://hdl.handle.net/1813/33362 %9 Preprint %0 Conference Paper %B Proceedings of Learning from Authoritative Security Experiment Results (LASER) %D 2013 %T Is it the Typeset or the Type of Statistics? Disfluent Font and Self-Disclosure %A Balebako, R. %A Pe'er, E. %A Brandimarte, L. %A Cranor, L. F. %A Acquisti, A. %B Proceedings of Learning from Authoritative Security Experiment Results (LASER) %I USENIX Association %C New York, NY %G eng %U https://www.usenix.org/laser2013/program/balebako %0 Report %D 2013 %T Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata %A Vilhuber, Lars %A Abowd, John %A Block, William %A Lagoze, Carl %A Williams, Jeremy %X Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata Vilhuber, Lars; Abowd, John; Block, William; Lagoze, Carl; Williams, Jeremy Social science researchers are increasingly interested in making use of confidential micro-data that contains linkages to the identities of people, corporations, etc. The value of this linking lies in the potential to join these identifiable entities with external data such as genome data, geospatial information, and the like. Leveraging these linkages is an essential aspect of “big data” scholarship. However, the utility of these confidential data for scholarship is compromised by the complex nature of their management and curation. This makes it difficult to fulfill US federal data management mandates and interferes with basic scholarly practices such as validation and reuse of existing results. We describe in this paper our work on the CED2AR prototype, a first step in providing researchers with a tool that spans the confidential/publicly-accessible divide, making it possible for researchers to identify, search, access, and cite those data. The particular points of interest in our work are the cloaking of metadata fields and the expression of provenance chains. For the former, we make use of existing fields in the DDI (Data Description Initiative) specification and suggest some minor changes to the specification. For the latter problem, we investigate the integration of DDI with recent work by the W3C PROV working group that has developed a generalizable and extensible model for expressing data provenance. %I Cornell University %G eng %U http://hdl.handle.net/1813/34534 %9 Preprint %0 Journal Article %J Public Opinion Quarterly %D 2013 %T Memory, communication, and data quality in calendar interviews %A Belli, R. F., %A Bilgen, I., %A T. Al Baghal %B Public Opinion Quarterly %V 77 %P 194-219 %G eng %0 Thesis %B Social Work %D 2013 %T Mental Disorders and Inequality in the United States: Intersection of race, gender, and disability on employment and income %A Camp, J. %B Social Work %I Wayne State University %V Ph.D. %G eng %0 Journal Article %J Social Psychological and Personality Science %D 2013 %T Misplaced confidences: Privacy and the control paradox %A Laura Brandimarte %A Alessandro Acquisti %A George Loewenstein %B Social Psychological and Personality Science %V 4 %P 340–347 %G eng %R 10.1177/1948550612455931 %0 Report %D 2013 %T NCRN Meeting Spring 2013 %A Vilhuber, Lars %X NCRN Meeting Spring 2013 Vilhuber, Lars Taken place at the NISS Headquarters, Research Triangle Park, NC. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/45870 %9 Preprint %0 Report %D 2013 %T NCRN Newsletter: Volume 1 - Issue 1 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2013 to November 2013. NCRN Newsletter Vol. 1, Issue 1: November 17, 2013 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40232 %9 Preprint %0 Journal Article %J Environment and Planning B %D 2013 %T Neighborhood contexts, health, and behavior: understanding the role of scale and residential sorting %A Spielman, S. E. %A Linkletter, C. %A Yoo, E.-H. %B Environment and Planning B %V 3 %G eng %0 Conference Paper %B Southern Regional Council on Statistics Summer Research Conference %D 2013 %T Nonlinear Dynamic Spatio-Temporal Statistical Models %A Wikle, C.K. %B Southern Regional Council on Statistics Summer Research Conference %8 June %G eng %0 Journal Article %J Journal of Educational and Behavioral Statistics %D 2013 %T Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys %A Si, Y. %A Reiter, J.P. %B Journal of Educational and Behavioral Statistics %V 38 %P 499-521 %G eng %U http://www.stat.duke.edu/ jerry/Papers/StatinMed14.pdf %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Paradata for Measurement Error Evaluation %A Olson, K. %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Predicting survey breakoff in Internet survey panels %A McCutcheon, A.L. %A T. Al Baghal %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B Biennial conference of the Society for Applied Research in Memory and Cognition %D 2013 %T Predicting the occurrence of respondent retrieval strategies in calendar interviewing: The quality of autobiographical recall in surveys %A Belli, R.F. %A Miller, L.D. %A Soh, L-K %A T. Al Baghal %B Biennial conference of the Society for Applied Research in Memory and Cognition %C Rotterdam, Netherlands %G eng %U http://static1.squarespace.com/static/504170d6e4b0b97fe5a59760/t/52457a8be4b0012b7a5f462a/1380285067247/SARMAC_X_PaperJune27.pdf %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Predicting the occurrence of respondent retrieval strategies in calendar interviewing: The quality of retrospective reports %A Belli, R.F. %A Miller, L.D. %A Soh, L-K %A T. Al Baghal %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2013 %T Presentation: Predicting Multiple Responses with Boosting and Trees %A Li, Ping %A Abowd, John %X Presentation: Predicting Multiple Responses with Boosting and Trees Li, Ping; Abowd, John Presentation by Ping Li and John Abowd at FCSM on November 4, 2013 %I Cornell University %G eng %U http://hdl.handle.net/1813/40255 %9 Preprint %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T The process of turning audit trails from a CATI survey into useful data: Interviewer behavior paradata in the American Time Use Survey %A Ruther, N. %A Phipps, P. %A Belli, R.F. %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Generic %D 2013 %T Recent Advances in Spatial Methods for Federal Surveys %A Holan, S.H. %8 September %G eng %0 Report %D 2013 %T Reconsidering the Consequences of Worker Displacements: Survey versus Administrative Measurements %A Flaaen, Aaron %A Shapiro, Matthew %A Isaac Sorkin %X Displaced workers suffer persistent earnings losses. This stark finding has been established by following workers in administrative data after mass layoffs under the presumption that these are involuntary job losses owing to economic distress. Using linked survey and administrative data, this paper examines this presumption by matching worker-supplied reasons for separations with what is happening at the firm. The paper documents substantially different earnings dynamics in mass layoffs depending on the reason the worker gives for the separation. Using a new methodology for accounting for the increase in the probability of separation among all types of survey response during in a mass layoff, the paper finds earnings loss estimates that are surprisingly close to those using only administrative data. Finally, the survey-administrative link allows the decomposition of earnings losses due to subsequent nonemployment into non-participation and unemployment. Including the zero earnings of those identified as being unemployed substantially increases the estimate of earnings losses. %I University of Michigan %G eng %U http://www-personal.umich.edu/~shapiro/papers/ReconsideringDisplacements.pdf %9 mimeo %0 Generic %D 2013 %T A Reduced Rank Model for Analyzing Multivariate Spatial Datasets %A Bradley, J.R. %B University of Missouri-Kansas City %I University of Missouri-Kansas City %8 November %G eng %0 Journal Article %J WebDB %D 2013 %T Ringtail: a generalized nowcasting system. %A Antenucci, Dolan %A Li, Erdong %A Liu, Shaobo %A Zhang, Bochun %A Cafarella, Michael J %A Ré, Christopher %X Social media nowcasting—using online user activity to de- scribe real-world phenomena—is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcast- ing systems to quickly become a standard tool among non- computer scientists, yet it has largely remained a research topic. We believe a major obstacle to widespread adoption is the nowcasting feature selection problem. Typical now- casting systems require the user to choose a handful of social media objects from a pool of billions of potential candidates, which can be a time-consuming and error-prone process. We have built Ringtail, a nowcasting system that helps the user by automatically suggesting high-quality signals. We demonstrate that Ringtail can make nowcasting easier by suggesting relevant features for a range of topics. The user provides just a short topic query (e.g., unemployment) and a small conventional dataset in order for Ringtail to quickly return a usable predictive nowcasting model. %B WebDB %V 6 %P 1358-1361 %G eng %U http://cs.stanford.edu/people/chrismre/papers/Ringtail-VLDB-demo.pdf %& 1358 %0 Journal Article %J WebDB %D 2013 %T Ringtail: Feature Selection for Easier Nowcasting. %A Antenucci, Dolan %A Cafarella, Michael J %A Levenstein, Margaret C. %A Ré, Christopher %A Shapiro, Matthew %X In recent years, social media “nowcasting”—the use of on- line user activity to predict various ongoing real-world social phenomena—has become a popular research topic; yet, this popularity has not led to widespread actual practice. We be- lieve a major obstacle to widespread adoption is the feature selection problem. Typical nowcasting systems require the user to choose a set of relevant social media objects, which is difficult, time-consuming, and can imply a statistical back- ground that users may not have. We propose Ringtail, which helps the user choose rele- vant social media signals. It takes a single user input string (e.g., unemployment) and yields a number of relevant signals the user can use to build a nowcasting model. We evaluate Ringtail on six different topics using a corpus of almost 6 billion tweets, showing that features chosen by Ringtail in a wholly-automated way are better or as good as those from a human and substantially better if Ringtail receives some human assistance. In all cases, Ringtail reduces the burden on the user. %B WebDB %P 49-54 %G eng %U http://www.cs.stanford.edu/people/chrismre/papers/webdb_ringtail.pdf %& 49 %0 Journal Article %J Social Service Review %D 2013 %T Rising extreme poverty in the United States and the response of means-tested transfers. %A H. Luke Shaefer %A Edin, K. %X This study documents an increase in the prevalence of extreme poverty among US households with children between 1996 and 2011 and assesses the response of major federal means-tested transfer programs. Extreme poverty is defined using a World Bank metric of global poverty: \$2 or less, per person, per day. Using the 1996–2008 panels of the Survey of Income and Program Participation (SIPP), we estimate that in mid-2011, 1.65 million households with 3.55 million children were living in extreme poverty in a given month, based on cash income, constituting 4.3 percent of all nonelderly households with children. The prevalence of extreme poverty has risen sharply since 1996, particularly among those most affected by the 1996 welfare reform. Adding SNAP benefits to household income reduces the number of extremely poor households with children by 48.0 percent in mid-2011. Adding SNAP, refundable tax credits, and housing subsidies reduces it by 62.8 percent. %B Social Service Review %V 87 %P 250-268 %8 06/2013 %G eng %U http://www.jstor.org/stable/10.1086/671012 %N 2 %& 250 %R 10.1086/671012 %0 Conference Paper %B Proceedings of the Ninth Symposium on Usable Privacy and Security (SOUPS) %D 2013 %T Sleights of Privacy: Framing, Disclosures, and the Limits of Transparency %A Adjerid, I. %A Acquisti, A. %A Loewenstein, G. %B Proceedings of the Ninth Symposium on Usable Privacy and Security (SOUPS) %I ACM %C New York, NY %G eng %0 Generic %D 2013 %T Some Historical Remarks on Spatial Statistics, Spatio-Temporal Statistics %A Cressie, N. %B Reading Group, University of Missouri %8 April %G eng %0 Thesis %B Department of Statistical Science %D 2013 %T Some Recent Advances in Non- and Semiparametric Bayesian Modeling with Copulas, Mixtures, and Latent Variables (Ph.D. Thesis) %A Jared S. Murray %X This thesis develops flexible non- and semiparametric Bayesian models for mixed continuous, ordered and unordered categorical data. These methods have a range of possible applications; the applications considered in this thesis are drawn primarily from the social sciences, where multivariate, heterogeneous datasets with complex dependence and missing observations are the norm. The first contribution is an extension of the Gaussian factor model to Gaussian copula factor models, which accommodate continuous and ordinal data with unspecified marginal distributions. I describe how this model is the most natural extension of the Gaussian factor model, preserving its essential dependence structure and the interpretability of factor loadings and the latent variables. I adopt an approximate likelihood for posterior inference and prove that, if the Gaussian copula model is true, the approximate posterior distribution of the copula correlation matrix asymptotically converges to the correct parameter under nearly any marginal distributions. I demonstrate with simulations that this method is both robust and efficient, and illustrate its use in an application from political science. The second contribution is a novel nonparametric hierarchical mixture model for continuous, ordered and unordered categorical data. The model includes a hierarchical prior used to couple component indices of two separate models, which are also linked by local multivariate regressions. This structure effectively overcomes the limitations of existing mixture models for mixed data, namely the overly strong local independence assumptions. In the proposed model local independence is replaced by local conditional independence, so that the induced model is able to more readily adapt to structure in the data. I demonstrate the utility of this model as a default engine for multiple imputation of mixed data in a large repeated-sampling study using data from the Survey of Income and Participation. I show that it improves substantially on its most popular competitor, multiple imputation by chained equations (MICE), while enjoying certain theoretical properties that MICE lacks. The third contribution is a latent variable model for density regression. Most existing density regression models are quite flexible but somewhat cumbersome to specify and fit, particularly when the regressors are a combination of continuous and categorical variables. The majority of these methods rely on extensions of infinite discrete mixture models to incorporate covariate dependence in mixture weights, atoms or both. I take a fundamentally different approach, introducing a continuous latent variable which depends on covariates through a parametric regression. In turn, the observed response depends on the latent variable through an unknown function. I demonstrate that a spline prior for the unknown function is quite effective relative to Dirichlet Process mixture models in density estimation settings (i.e., without covariates) even though these Dirichlet process mixtures have better theoretical properties asymptotically. The spline formulation enjoys a number of computational advantages over more flexible priors on functions. Finally, I demonstrate the utility of this model in regression applications using a dataset on U.S. wages from the Census Bureau, where I estimate the return to schooling as a smooth function of the quantile index. %B Department of Statistical Science %I Duke University %G eng %U http://dukespace.lib.duke.edu/dspace/handle/10161/8253 %9 Ph.D. %0 Generic %D 2013 %T Spatial Fay-Herriot Models for Small Area Estimation with Functional Covariates %A Porter, A.T. %8 May %G eng %0 Book Section %B Spatio-temporal Design: Advances in Efficient Data Acquisition %D 2013 %T Spatio-temporal Design: Advances in Efficient Data Acquisition %A Holan, S. %A Wikle, C. %E Jorge Mateu %E Werner Muller %K semiparametric dynamic design for non-Gaussian spatio-temporal data %B Spatio-temporal Design: Advances in Efficient Data Acquisition %I Wiley %P 269-284 %@ 9780470974292 %G eng %& Semiparametric Dynamic Design of Monitoring Networks for Non-Gaussian Spatio-Temporal Data %R 10.1002/9781118441862 %0 Generic %D 2013 %T Statistics and the Environment: Overview and Challenges %A Wikle, C.K. %8 May %G eng %0 Generic %D 2013 %T Statistics for Spatio-Temporal Data %A Cressie, N. %B Invited One-Day Short Course at the U.S. Census Bureau %8 April %G eng %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Troubles with time-use: Examining potential indicators of error in the American Time Use Survey %A Phillips, A.L. %A T. Al Baghal %A Belli, R.F. %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J TEST %D 2013 %T Two-stage Bayesian benchmarking as applied to small area estimation %A Rebecca C. Steorts %A Malay Ghosh %K small area estimation %B TEST %V 22 %8 2013 %G eng %N 4 %& 670 %0 Thesis %D 2013 %T User Modeling via Machine Learning and Rule-based Reasoning to Understand and Predict Errors in Survey Systems %A Stuart, Leonard Cleve %I University of Nebraska-Lincoln %G eng %U http://digitalcommons.unl.edu/computerscidiss/70/ %9 Masters %0 Journal Article %J Annals of the Association of American Geographers %D 2013 %T Using High Resolution Population Data to Identify Neighborhoods and Determine their Boundaries %A Spielman, S. E. %A Logan, J. %B Annals of the Association of American Geographers %V 103 %P 67-84 %G eng %U http://www.tandfonline.com/doi/abs/10.1080/00045608.2012.685049 %R 10.1080/00045608.2012.685049 %0 Thesis %D 2013 %T Using Satellite Imagery to Evaluate and Analyze Socioeconomic Changes Observed with Census Data %A Wilson, C. R. %G eng %9 Ph.D. %0 Conference Paper %B American Association for Public Opinion Research %D 2013 %T What are you doing now?: Audit trails, Activity level responses and error in the American Time Use Survey %A T. Al Baghal %A Phillips, A.L. %A Ruther, N. %A Belli, R.F. %A Stuart, L. %A Eck, A. %A Soh, L-K %B American Association for Public Opinion Research %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Journal of Legal Studies %D 2013 %T What is Privacy Worth? %A Acquisti, A. %A John, L. %A Loewenstein, G. %B Journal of Legal Studies %V 42 %P 249–274 %G eng %0 Journal Article %J Journal of Privacy and Confidentiality %D 2012 %T Achieving both valid and secure logistic regression analysis on aggregated data from different private sources %A Yuval Nardi %A Robert Hall %A Stephen E. Fienberg %B Journal of Privacy and Confidentiality %V 4 %P 189 %G eng %0 Journal Article %J Applied Stochastic Models in Business and Industry %D 2012 %T An Approach for Identifying and Predicting Economic Recessions in Real-Time Using Time-Frequency Functional Models %A Holan, S. %A Yang, W. %A Matteson, D. %A Wikle, C.K. %K Bayesian model averaging %K business cycles %K empirical orthogonal functions %K functional data %K MIDAS %K spectrogram %K stochastic search variable selection %B Applied Stochastic Models in Business and Industry %V 28 %P 485-499 %8 12/2012 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/asmb.1954/full %R 10.1002/asmb.1954 %0 Generic %D 2012 %T Asymptotic Theory of Cepstral Random Fields %A McElroy, T. %A Holan, S. %I University of Missouri %G eng %0 Report %D 2012 %T Asymptotic Theory of Cepstral Random Fields %A McElroy, T.S. %A Holan, S.H. %X Asymptotic Theory of Cepstral Random Fields McElroy, T.S.; Holan, S.H. Random fields play a central role in the analysis of spatially correlated data and, as a result,have a significant impact on a broad array of scientific applications. Given the importance of this topic, there has been a substantial amount of research devoted to this area. However, the cepstral random field model remains largely underdeveloped outside the engineering literature. We provide a comprehensive treatment of the asymptotic theory for two-dimensional random field models. In particular, we provide recursive formulas that connect the spatial cepstral coefficients to an equivalent moving-average random field, which facilitates easy computation of the necessary autocovariance matrix. Additionally, we establish asymptotic consistency results for Bayesian, maximum likelihood, and quasi-maximum likelihood estimation of random field parameters and regression parameters. Further, in both the maximum and quasi-maximum likelihood frameworks, we derive the asymptotic distribution of our estimator. The theoretical results are presented generally and are of independent interest,pertaining to a wide class of random field models. The results for the cepstral model facilitate model-building: because the cepstral coefficients are unconstrained in practice, numerical optimization is greatly simplified, and we are always guaranteed a positive definite covariance matrix. We show that inference for individual coefficients is possible, and one can refine models in a disciplined manner. Finally, our results are illustrated through simulation and the analysis of straw yield data in an agricultural field experiment. http://arxiv.org/pdf/1112.1977.pdf %I University of Missouri %G eng %U http://hdl.handle.net/1813/34461 %9 Preprint %0 Journal Article %J Computational Statistics and Data Analysis %D 2012 %T Bayesian Multi-Regime Smooth Transition Regression with Ordered Categorical Variables %A Wang, J. %A Holan, S. %B Computational Statistics and Data Analysis %V 56 %P 4165-4179 %8 December %G eng %U http://dx.doi.org/10.1016/j.csda.2012.04.018 %R 10.1016/j.csda.2012.04.018 %0 Generic %D 2012 %T Bayesian Multiscale Multiple Imputation With Implications to Data Confidentiality %A Holan, S.H. %G eng %0 Conference Paper %B Modern Nonparametric Methods in Machine Learning Workshop %D 2012 %T Bayesian Parametric and Nonparametric Inference for Multiple Record Likage %A Hall, R. %A Steorts, R. %A Fienberg, S. E. %B Modern Nonparametric Methods in Machine Learning Workshop %I NIPS %G eng %U http://www.stat.cmu.edu/NCRN/PUBLIC/files/beka_nips_finalsub4.pdf %0 Conference Paper %B Eighth International Conference on Social Science Methodology %D 2012 %T Calendar interviewing in life course research: Associations between verbal behaviors and data quality %A Belli, R.F. %A Bilgen, I. %A T. Al Baghal %B Eighth International Conference on Social Science Methodology %C Sydney Australia %G eng %U https://conference.acspri.org.au/index.php/rc33/2012/paper/view/366 %0 Conference Paper %B Joint Statistical Meetings %D 2012 %T Change of Support in Spatio-Temporal Dynamical Models %A Wikle, C.K. %B Joint Statistical Meetings %C Montreal, Canada %8 August %G eng %0 Generic %D 2012 %T Confidentiality and Privacy Protection in a Non-US Census Context %A Anne-Sophie Charest %I Carnegie Mellon University %8 April %G eng %0 Conference Paper %B Nathan and Beatrice Keyfitz Lecture in Mathematics and the Social Sciences %D 2012 %T Counting the people %A Stephen E. Fienberg %B Nathan and Beatrice Keyfitz Lecture in Mathematics and the Social Sciences %I Fields Institute %C Toronto, Canada %8 May %G eng %0 Thesis %D 2012 %T Creation and Analysis of Differentially-Private Synthesis Datasets %A Anne-Sophie Charest %I Carnegie Mellon University %G eng %9 phd %0 Report %D 2012 %T Data Management of Confidential Data %A Lagoze, Carl %A Block, William C. %A Williams, Jeremy %A Abowd, John M. %A Vilhuber, Lars %X Data Management of Confidential Data Lagoze, Carl; Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data. %I Cornell University %G eng %U http://hdl.handle.net/1813/30924 %9 Preprint %0 Journal Article %J Journal of Privacy and Confidentiality %D 2012 %T Differential Privacy for Protecting Multi-dimensional Contingency Table Data: Extensions and Applications %A Yang Xiaolin %A Stephen E. Fienberg %A Alessandro Rinaldo %B Journal of Privacy and Confidentiality %V 4 %P 101-125 %G eng %0 Conference Paper %B Proceedings of the Survey Research Section of the SSC %D 2012 %T Differential Privacy for Synthetic Datasets %A Anne-Sophie Charest %B Proceedings of the Survey Research Section of the SSC %C Guelph, Ontario %G eng %0 Conference Paper %B Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University %D 2012 %T Disambiguating USPTO Inventors with Classification Models Trained on Comparisons of Labeled Inventor Records %A Samuel Ventura %A Rebecca Nugent %A Erich R.H. Fuchs %B Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University %G eng %0 Report %D 2012 %T An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) %A Block, William C. %A Williams, Jeremy %A Abowd, John M. %A Vilhuber, Lars %A Lagoze, Carl %X An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars; Lagoze, Carl This presentation will demonstrate the latest DDI-related technological developments of Cornell University’s $3 million NSF-Census Research Network (NCRN) award, dedicated to improving the documentation, discoverability, and accessibility of public and restricted data from the federal statistical system in the United States. The current internal name for our DDI-based system is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). CED²AR ingests metadata from heterogeneous sources and supports filtered synchronization between restricted and public metadata holdings. Currently-supported CED²AR “connector workflows” include mechanisms to ingest IPUMS, zero-observation files from the American Community Survey (DDI 2.1), and SIPP Synthetic Beta (DDI 1.2). These disparate metadata sources are all transformed into a DDI 2.5 compliant form and stored in a single repository. In addition, we will demonstrate an extension to DDI 2.5 that allows for the labeling of elements within the schema to indicate confidentiality. This metadata can then be filtered, allowing the creation of derived public use metadata from an original confidential source. This repository is currently searchable online through a prototype application demonstrating the ability to search across previously heterogeneous metadata sources. Presentation at the 4th Annual European DDI User Conference (EDDI12), Norwegian Social Science Data Services, Bergen, Norway, 3 December, 2012 %I Cornell University %G eng %U http://hdl.handle.net/1813/30922 %9 Preprint %0 Conference Paper %B The Oxford Handbook of the Digital Economy %D 2012 %T The Economics of Privacy %A Laura Brandimarte %A Alessandro Acquisti %E Martin Peitz %E Joel Waldfogel %B The Oxford Handbook of the Digital Economy %I Oxford University Press %P 547-570 %@ 9780195397840 %G eng %R 10.1093/oxfordhb/9780195397840.013.0020 %0 Generic %D 2012 %T Efficient Time-Frequency Representations in High-Dimensional Spatial and Spatio-Temporal Models %A Wikle, C.K. %8 October %G eng %0 Conference Paper %B Privacy in Statistical Databases %D 2012 %T Empirical Evaluation of Statistical Inference from Differentially-Private Contingency Tables %A Anne-Sophie Charest %E Josep Domingo-Ferrer %E Ilenia Tinnirello %B Privacy in Statistical Databases %I Springer %V 7556 %P 257-272 %@ 978-3-642-33627-0 %G eng %R 10.1007/978-3-642-33627-0_20 %0 Report %D 2012 %T Encoding Provenance Metadata for Social Science Datasets %A Lagoze, Carl %A Williams, Jeremy %A Vilhuber, Lars %X Encoding Provenance Metadata for Social Science Datasets Lagoze, Carl; Williams, Jeremy; Vilhuber, Lars Recording provenance is a key requirement for data-centric scholarship, allowing researchers to evaluate the integrity of source data sets and re- produce, and thereby, validate results. Provenance has become even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. Recent work by the W3C on the PROV model provides the foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We apply that model to complex, but characteristic, provenance examples of social science data, describe scenarios that make scholarly use of those provenance descriptions, and propose a manner for encoding this provenance metadata within the widely-used DDI metadata standard. Submitted to Metadata and Semantics Research (MTSR 2013) conference. %I Cornell University %G eng %U http://hdl.handle.net/1813/55327 %9 Preprint %0 Book Section %B Advances in Neural Information Processing Systems 25 %D 2012 %T Entropy Estimations Using Correlated Symmetric Stable Random Projections %A Ping Li %A Cun-Hui Zhang %E P. Bartlett %E F.C.N. Pereira %E C.J.C. Burges %E L. Bottou %E K.Q. Weinberger %B Advances in Neural Information Processing Systems 25 %P 3185–3193 %G eng %U http://books.nips.cc/papers/files/nips25/NIPS2012_1456.pdf %0 Journal Article %J Journal of the American Statistical Association %D 2012 %T Estimating identification disclosure risk using mixed membership models %A Manrique-Vallier, D. %A Reiter, J.P. %B Journal of the American Statistical Association %V 107 %P 1385-1394 %G eng %0 Conference Paper %B 2012 Joint Statistical Meetings %D 2012 %T On Estimation of Mean Squared Errors of Benchmarked and Empirical Bayes Estimators %A Rebecca C. Steorts %A Malay Ghosh %B 2012 Joint Statistical Meetings %C San Diego, CA %8 August %G eng %0 Conference Paper %B Midwest Association for Public Opinion Research 2012 Annual Conference %D 2012 %T Exploring interviewer and respondent interactions: An innovative behavior coding approach %A Walton, L. %A Stange, M. %A Powell, R. %A Belli, R.F. %B Midwest Association for Public Opinion Research 2012 Annual Conference %C Chicago, IL %G eng %U http://www.mapor.org/conferences.html %0 Generic %D 2012 %T Extreme Poverty in the United States, 1996 to 2011 %A Shaefer, H. Luke %A Edin, Kathryn %I University of Michigan %8 February 2012 %G eng %U http://www.npc.umich.edu/publications/policy_briefs/brief28/policybrief28.pdf %9 Report %0 Conference Paper %B The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) %D 2012 %T Fast Multi-task Learning for Query Spelling Correction %A Xu Sun %A Anshumali Shrivastava %A Ping Li %B The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) %P 285–294 %G eng %U http://dx.doi.org/10.1145/2396761.2396800 %R 10.1145/2396761.2396800 %0 Conference Paper %B The European Conference on Machine Learning (ECML 2012) %D 2012 %T Fast Near Neighbor Search in High-Dimensional Binary Data %A Anshumali Shrivastava %A Ping Li %B The European Conference on Machine Learning (ECML 2012) %G eng %0 Conference Paper %B Joint Statistical Meetings 2012 %D 2012 %T Flexible Spectral Models for Multivariate Time Series %A Holan, S.H. %B Joint Statistical Meetings 2012 %8 August %G eng %0 Report %D 2012 %T A Generalized Fellegi-Sunter Framework for Multiple Record Linkage with Application to Homicide Records Systems %A Mauricio Sadinle %A Stephen E. Fienberg %B arXiv %G eng %U https://arxiv.org/abs/1205.3217 %0 Conference Paper %B Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) %D 2012 %T GPU-based minwise hashing: GPU-based minwise hashing %A Ping Li %A Anshumali Shrivastava %A Arnd Christian König %B Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) %P 565-566 %G eng %U http://doi.acm.org/10.1145/2187980.2188129 %R 10.1145/2187980.2188129 %0 Conference Paper %B Red Raider Conference %D 2012 %T Hierarchical General Quadratic Nonlinear Models for Spatio-Temporal Dynamics %A Wikle, C.K. %B Red Raider Conference %I Texas Tech University %C Lubbock, TX %8 October %G eng %0 Generic %D 2012 %T Hierarchical Statistical Modeling of Big Spatial Datasets Using the Exponential Family of Distributions %A Sengupta, A. %A Cressie, N. %I The Ohio State University %G eng %0 Generic %D 2012 %T Inference for Count Data using the Spatial Random Effects Model %A Cressie, N. %8 May %G eng %0 Journal Article %J Journal of Official Statistics %D 2012 %T Inferentially valid partially synthetic data: Generating from posterior predictive distributions not necessary %A Reiter, J.P. %A Kinney, S.K. %B Journal of Official Statistics %V 28 %P 583-590 %G eng %0 Conference Paper %B Midwest Association for Public Opinion Research 2012 Annual Conference %D 2012 %T Interviewer variance of interviewer and respondent behaviors: A new frontier in analyzing the interviewer-respondent interaction %A Charoenruk, N. %A Parkhurst, B. %A Ay, M. %A Belli, R. F. %B Midwest Association for Public Opinion Research 2012 Annual Conference %C Chicago, IL %8 November %G eng %U http://www.mapor.org/conferences.html %0 Conference Paper %B American Statistical Association Pittsburgh Chapter Banquet %D 2012 %T Logit-Based Confidence Intervals for Single Capture-Recapture Estimation %A Mauricio Sadinle %B American Statistical Association Pittsburgh Chapter Banquet %C Pittsburgh, PA %8 April %G eng %0 Conference Paper %B 2012 Joint Statistical Meetings %D 2012 %T Maintaining Quality in the Face of Rapid Program Expansion %A Cosma Shalizi %A Rebecca Nugent %B 2012 Joint Statistical Meetings %C San Diego, CA %8 August %G eng %0 Conference Paper %B Conference Presentation Academy of Management Annual Meeting %D 2012 %T Methods Matter: Revamping Inventor Disambiguation Algorithms with Classification Models and Labeled Inventor Records %A Samuel Ventura %A Rebecca Nugent %A Erich R.H. Fuchs %B Conference Presentation Academy of Management Annual Meeting %C Boston, MA %8 August %G eng %0 Conference Paper %B Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University %D 2012 %T MulFiles Record Linkage Using a Generalized Fellegi-Sunter Framework %A Mauricio Sadinle %B Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University %G eng %0 Report %D 2012 %T NCRN Meeting Fall 2012 %A Vilhuber, Lars %X NCRN Meeting Fall 2012 Vilhuber, Lars Taken place at the Census Bureau Headquarters, Suitland, MD. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/45884 %9 Preprint %0 Report %D 2012 %T The NSF-Census Research Network: Cornell Node %A Block, William C. %A Lagoze, Carl %A Vilhuber, Lars %A Brown, Warren A. %A Williams, Jeremy %A Arguillas, Florio %X The NSF-Census Research Network: Cornell Node Block, William C.; Lagoze, Carl; Vilhuber, Lars; Brown, Warren A.; Williams, Jeremy; Arguillas, Florio Cornell University has received a $3M NSF-Census Research Network (NCRN) award to improve the documentation and discoverability of both public and restricted data from the federal statistical system. The current internal name for this project is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). The diagram to the right provides a high level architectural overview of the system to be implemented. The CED²AR will be based upon leading metadata standards such as the Data Documentation Initiative (DDI) and Statistical Data and Metadata eXchange (SDMX) and be flexibly designed to ingest documentation from a variety of source files. It will permit synchronization between the public and confidential instances of the repository. The scholarly community will be able to use the CED²AR as it would a conventional metadata repository, deprived only of the values of certain confidential information, but not their metadata. The authorized user, working on the secure Census Bureau network, could use the CED²AR with full information in authorized domains. %I Cornell University %G eng %U http://hdl.handle.net/1813/30925 %9 Preprint %0 Book Section %B Advances in Neural Information Processing Systems 25 %D 2012 %T One Permutation Hashing %A Ping Li %A Art Owen %A Cun-Hui Zhang %E P. Bartlett %E F.C.N. Pereira %E C.J.C. Burges %E L. Bottou %E K.Q. Weinberger %B Advances in Neural Information Processing Systems 25 %P 3122–3130 %G eng %U http://books.nips.cc/papers/files/nips25/NIPS2012_1436.pdf %0 Report %D 2012 %T Presentation: Revisiting the Economics of Privacy: Population Statistics and Privacy as Public Goods %A Abowd, John %X Presentation: Revisiting the Economics of Privacy: Population Statistics and Privacy as Public Goods Abowd, John Anonymization and data quality are intimately linked. Although this link has been properly acknowledged in the Computer Science and Statistical Disclosure Limitation literatures, economics offers a framework for formalizing the linkage and analyzing optimal decisions and equilibrium outcomes. The opinions expressed in this presentation are those of the author and neither the National Science Foundation nor the Census Bureau. %I Cornell University %G eng %U http://hdl.handle.net/1813/30937 %9 Preprint %0 Journal Article %J Notices of the AMS %D 2012 %T Privacy in a world of electronic data: Whom should you trust? %A Stephen E. Fienberg %B Notices of the AMS %V 59 %P 479 %G eng %0 Journal Article %J Journal of Privacy and Confidentiality %D 2012 %T Privacy-preserving data sharing in high dimensional regression and classification settings %A Stephen E. Fienberg %A Jiashun Jin %B Journal of Privacy and Confidentiality %V 4 %P 221 %G eng %0 Book Section %B Privacy in Statistical Databases %D 2012 %T A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs %A Abowd, John M. %A Vilhuber, Lars %A Block, William %E Domingo-Ferrer, Josep %E Tinnirello, Ilenia %K Data Archive %K Data Curation %K Privacy-preserving Datamining %K Statistical Disclosure Limitation %B Privacy in Statistical Databases %S Lecture Notes in Computer Science %I Springer Berlin Heidelberg %V 7556 %P 216-225 %@ 978-3-642-33626-3 %G eng %U http://dx.doi.org/10.1007/978-3-642-33627-0_17 %R 10.1007/978-3-642-33627-0_17 %0 Conference Paper %B Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) %D 2012 %T Query spelling correction using multi-task learning %A Xu Sun %A Anshumali Shrivastava %A Ping Li %B Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) %P 613-614 %G eng %U http://doi.acm.org/10.1145/2187980.2188153 %R 10.1145/2187980.2188153 %0 Journal Article %J Applied Stochastic Models in Business and Industry %D 2012 %T Rejoinder: An approach for identifying and predicting economic recessions in real time using time frequency functional models %A Holan, S. %A Yang, W. %A Matteson, D. %A Wikle, C. %B Applied Stochastic Models in Business and Industry %V 28 %P 504-505 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/asmb.1955/full %R 10.1002/asmb.1955 %0 Book Section %B Spatio-temporal Design: Advances in Efficient Data Acquisition %D 2012 %T Semiparametric Dynamic Design of Monitoring Networks for Non-Gaussian Spatio-Temporal Data %A Holan, S. %A Wikle, C.K. %E Jorge Mateu %E Werner Muller %B Spatio-temporal Design: Advances in Efficient Data Acquisition %I Wiley %C Chichester, UK %P 269-284 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/9781118441862.ch12/summary %R 10.1002/9781118441862.ch12 %0 Conference Paper %B Conference on Web Privacy Measurement %D 2012 %T Sleight of Privacy %A Idris Adjerid %A Alessandro Acquisti %A Laura Brandimarte %B Conference on Web Privacy Measurement %G eng %0 Thesis %D 2012 %T Smooth Post-Stratification in Multiple Capture Recapture %A Zachary Kurtz %I Carnegie Mellon University %G eng %9 phd %0 Generic %D 2012 %T Spatio-Temporal Statistics at Mizzou, Truman School of Public Affairs %A Wikle, C.K. %8 October %G eng %0 Conference Paper %B Presentation Samuel S. Wilks Lecture %D 2012 %T Statistics in Service to the Nation %A Stephen E. Fienberg %B Presentation Samuel S. Wilks Lecture %C Princeton, NJ %8 April %G eng %0 Conference Paper %B 2012 Joint Statistical Meetings %D 2012 %T Teaching about Big Data: Curricular Issues %A Stephen E. Fienberg %B 2012 Joint Statistical Meetings %C San Diego, CA %8 August %G eng %0 Journal Article %J Journal of Machine Learning Research - Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012) %D 2012 %T Testing for Membership to the IFRA and the NBU Classes of Distributions %A Radhendushka Srivastava %A Ping Li %A Debasis Sengupta %B Journal of Machine Learning Research - Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012) %V 22 %P 1099-1107 %G eng %U http://jmlr.csail.mit.edu/proceedings/papers/v22/srivastava12.html %0 Conference Paper %B AutoCarto 2012 %D 2012 %T Thinking inside the box: Mapping the microstructure of urban environment (and why it matters) %A Seth Spielman %A David Folch %A John Logan %A Nicholas Nagle %K cartography %B AutoCarto 2012 %C Columbus, Ohio %G eng %U http://www.cartogis.org/docs/proceedings/2012/Spielman_etal_AutoCarto2012.pdf %0 Conference Paper %B Midwest Association for Public Opinion Research 2012 Annual Conference %D 2012 %T Troubles with time-use: Examining potential indicators of error in the ATUS %A Phillips, A. L., %A T. Al Baghal %A Belli, R. F. %B Midwest Association for Public Opinion Research 2012 Annual Conference %C Chicago, IL %G eng %U http://www.mapor.org/conferences.html %0 Conference Paper %B Privacy in Statistical Databases %D 2012 %T Valid Statistical Inference on Automatically Matched Files %A Robert Hall %A Stephen E. Fienberg %E Josep Domingo-Ferrer %E Ilenia Tinnirello %B Privacy in Statistical Databases %I Springer %P 131–142 %G eng %R 10.1007/978-3-642-33627-0_11 %0 Journal Article %J Children and Youth Services Review %D 2012 %T The welfare reforms of the 1990s and the stratification of material well-being among low-income households with children %A Shaefer, H. Luke %A Ybarra, Marci %XWe examine the incidence of material hardship experienced by low-income households with children, before and after the major changes to U.S. anti-poverty programs during the 1990s. We use the Survey of Income and ProgramParticipation (SIPP) to examine a series of measures of householdmaterial hardship thatwere collected in the years 1992, 1995, 1998, 2003 and 2005.We stratify our sample to differentiate between the 1) deeply poor (b50% of poverty), who sawa decline in public assistance over this period; and two groups that sawsome forms of public assistance increase: 2) other poor households (50–99% of poverty), and 3) the near poor (100–150% of poverty). We report bivariate trends over the study period, as well as presenting multivariate difference-indifferences estimates.We find suggestive evidence that material hardship—in the form of difficulty meeting essential household expenses, and falling behind on utilities costs—has generally increased among the deeply poor but has remained roughly the same for the middle group (50–99% of poverty), and decreased among the near poor (100–150% of poverty). Multivariate difference-in-differences estimates suggest that these trends have resulted in intensified stratification of the material well-being of low-income households with children.

%B Children and Youth Services Review %V 34 %P 1810-1817 %G eng %9 Journal Article %0 Conference Paper %B Proceedings of the 58th World Statistical Congress %D 2011 %T Approaches to Multiple Record Linkage %A Sadinle, M. %A Hall, R. %A Fienberg, S. E. %B Proceedings of the 58th World Statistical Congress %I International Statistical Institute %C Dublin %P 1064–1071 %G eng %U http://2011.isiproceedings.org/papers/450092.pdf %0 Journal Article %J Journal of Privacy and Confidentiality %D 2011 %T Comment on Gates: Toward a Reconceptualization of Confidentiality Protection in the Context of Linkages with Administrative Records %A Stephen E. Fienberg %B Journal of Privacy and Confidentiality %V 3 %P 65 %G eng %0 Report %D 2011 %T Do Single Mothers in the United States use the Earned Income Tax Credit to Reduce Unsecured Debt? %A Shaefer, H. Luke %A Song, Xiaoqing %A Williams Shanks, Trina R. %X Do Single Mothers in the United States use the Earned Income Tax Credit to Reduce Unsecured Debt? Shaefer, H. Luke; Song, Xiaoqing; Williams Shanks, Trina R. The Earned Income Tax Credit (EITC) is a refundable credit for low-income workers that is mainly targeted at families with children. This study uses the Survey of Income and Program Participation’s (SIPP) topical modules on Assets & Liabilities to examine the effects of EITC expansions during the early 1990s on the unsecured debt of the households of single mothers. We use two difference-in-differences comparisons over the study period 1988 to 1999, first comparing single mothers to single childless women, and then comparing single mothers with two or more children to single mothers with exactly one child. In both cases we find that the EITC expansions are associated with a relative decline in the unsecured debt of affected households of single mothers. This suggests that single mothers may have used part of their EITC to limit the growth of their unsecured debt during this period. %I University of Michigan %G eng %U http://hdl.handle.net/1813/34516 %9 Preprint %0 Report %D 2011 %T Estimating identification disclosure risk using mixed membership models %A Manrique-Vallier, Daniel %A Reiter, Jerome %X Estimating identification disclosure risk using mixed membership models Manrique-Vallier, Daniel; Reiter, Jerome Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confi dentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using log-linear modeling on the keys. However, log-linear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to log-linear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and off er an MCMC algorithm for fitting the model. We evaluate the approach by treating data from a recent US Census Bureau public use microdata sample as a population, taking simple random samples from that population, and benchmarking estimated probabilities of uniqueness against population values. Compared to log-linear models, GoM models provide more accurate estimates of the total number of uniques in the samples. Additionally, they offer record-level predictions of uniqueness that dominate those based on log-linear models. %I Duke University / National Institute of Statistical Sciences (NISS) %G eng %U http://hdl.handle.net/1813/33184 %9 Preprint %0 Report %D 2011 %T NCRN Meeting Fall 2011 %A Vilhuber, Lars %X NCRN Meeting Fall 2011 Vilhuber, Lars Taken place at Census Bureau Conference Center. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/46201 %9 Preprint %0 Report %D 2011 %T A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs %A Abowd, John M. %A Vilhuber, Lars %A Block, William %X A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs Abowd, John M.; Vilhuber, Lars; Block, William We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials. %I Cornell University %G eng %U http://hdl.handle.net/1813/30923 %9 Preprint %0 Journal Article %J Journal of Official Statistics %D 2011 %T Secure multiparty linear regression based on homomorphic encryption %A Robert Hall %A Stephen E. Fienberg %A Yuval Nardi %B Journal of Official Statistics %V 27 %P 669 %G eng %0 Journal Article %J Journal of Applied Research in Memory and Cognition %D 6 %T Parallel Associations and the Structure of Autobiographical Knowledge %A Belli, Robert F. %A Al Baghal, Tarek %K Autobiographical knowledge %K Autobiographical memory %K Autobiographical periods %K Episodic memory %K Retrospective reports %X The self-memory system (SMS) model of autobiographical knowledge conceives that memories are structured thematically, organized both hierarchically and temporally. This model has been challenged on several fronts, including the absence of parallel linkages across pathways. Calendar survey interviewing shows the frequent and varied use of parallel associations in autobiographical recall. Parallel associations in these data are commonplace, and are driven more by respondents’ generative retrieval than by interviewers’ probing. Parallel associations represent a number of autobiographical knowledge themes that are interrelated across life domains. The content of parallel associations is nearly evenly split between general and transitional events, supporting the importance of transitions in autographical memory. Associations in respondents’ memories (both parallel and sequential), demonstrate complex interactions with interviewer verbal behaviors during generative retrieval. In addition to discussing the implications of these results to the SMS model, implications are also drawn for transition theory and the basic-systems model. %B Journal of Applied Research in Memory and Cognition %V 5 %P 150-157 %8 6// %@ 2211-3681 %G eng %U http://www.sciencedirect.com/science/article/pii/S2211368116300183 %N 2 %0 Generic %D 0 %T Are Self-Description Scales Better than Agree/Disagree Scales in Mail and Telephone Surveys? %A Timbrook, Jerry %A Smyth, Jolene D. %A Olson, Kristen %G eng %0 Generic %D 0 %T Are Self-Description Scales Better than Agree/Disagree Scales in Mail and Telephone Surveys? %A Timbrook, Jerry %A Smyth, Jolene D. %A Olson, Kristen %G eng %0 Generic %D 0 %T The ATUS and SIPP-EHC: Recent developments %A Belli, R. F. %G eng %0 Generic %D 0 %T Audit trails, parallel navigation, and the SIPP %A Lee, Jinyoung %G eng %0 Journal Article %J Journal of the American Statistical Association %D 0 %T Bayesian estimation of bipartite matchings for record linkage %A Mauricio Sadinle %X The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador. %B Journal of the American Statistical Association %G eng %0 Journal Article %J Annals of Applied Statistics %D 0 %T Biomass prediction using density dependent diameter distribution models %A Schliep, E.M. %A A.E. Gelfand %A J.S. Clark %A B.J. Tomasek %X Prediction of aboveground biomass, particularly at large spatial scales, is necessary for estimating global-scale carbon sequestration. Since biomass can be measured only by sacrificing trees, total biomass on plots is never observed. Rather, allometric equations are used to convert individual tree diameter to individual biomass, perhaps with noise. The values for all trees on a plot are then summed to obtain a derived total biomass for the plot. Then, with derived total biomasses for a collection of plots, regression models, using appropriate environmental covariates, are employed to attempt explanation and prediction. Not surprisingly, when out-of-sample validation is examined, such a model will predict total biomass well for holdout data because it is obtained using exactly the same derived approach. Apart from the somewhat circular nature of the regression approach, it also fails to employ the actual observed plot level response data. At each plot, we observe a random number of trees, each with an associated diameter, producing a sample of diameters. A model based on this random number of tree diameters provides understanding of how environmental regressors explain abundance of individuals, which in turn explains individual diameters. We incorporate density dependence because the distribution of tree diameters over a plot of fixed size depends upon the number of trees on the plot. After fitting this model, we can obtain predictive distributions for individual-level biomass and plot-level total biomass. We show that predictive distributions for plot-level biomass obtained from a density-dependent model for diameters will be much different from predictive distributions using the regression approach. Moreover, they can be more informative for capturing uncertainty than those obtained from modeling derived plot-level biomass directly. We develop a density-dependent diameter distribution model and illustrate with data from the national Forest Inventory and Analysis (FIA) database. We also describe how to scale predictions to larger spatial regions. Our predictions agree (in magnitude) with available wisdom on mean and variation in biomass at the hectare scale. %B Annals of Applied Statistics %V 11 %P 340-361 %G eng %U https://projecteuclid.org/euclid.aoas/1491616884 %N 1 %0 Book Section %B Handbook of research methods in health and social sciences %D 0 %T Calendar and time diary methods: The tools to assess well-being in the 21st century %A Córdova Cazar, Ana Lucía %A Belli, Robert F. %E Liamputtong, P %B Handbook of research methods in health and social sciences %I Springer %G eng %0 Generic %D 0 %T Does relation of retrieval pathways to data quality differ by self or proxy response status? %A Lee, Jinyoung %A Belli, Robert F. %G eng %0 Generic %D 0 %T "During the LAST YEAR, Did You...": The Effect of Emphasis in CATI Survey Questions on Data Quality %A Olson, Kristen %A Smyth, Jolene D. %G eng %0 Generic %D 0 %T "During the LAST YEAR, Did You...": The Effect of Emphasis in CATI Survey Questions on Data Quality %A Olson, Kristen %A Smyth, Jolene D. %G eng %0 Generic %D 0 %T The Effect of Question Characteristics, Respondents and Interviewers on Question Reading Time and Question Reading Behaviors in CATI Surveys %A Olson, Kristen %A Smyth, Jolene %A Kirchner, Antje %G eng %0 Generic %D 0 %T The Effect of Question Characteristics, Respondents and Interviewers on Question Reading Time and Question Reading Behaviors in CATI Surveys %A Olson, Kristen %G eng %0 Generic %D 0 %T The Effects of Respondent and Question Characteristics on Respondent Behaviors %A Ganshert, Amanda %A Olson, Kristen %A Smyth, Jolene %G eng %0 Journal Article %J The American Statistician %D 0 %T An Empirical Comparison of Multiple Imputation Methods for Categorical Data %A Olanrewaju Akande %A Fan Li %A Jerome Reiter %X AbstractMultiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online. %B The American Statistician %P 0-0 %G eng %U http://dx.doi.org/10.1080/00031305.2016.1277158 %R 10.1080/00031305.2016.1277158 %0 Journal Article %J Stat %D 0 %T An ensemble quadratic echo state network for nonlinear spatio-temporal forecasting %A McDermott, P.L. %A Wikle, C.K. %X Spatio-temporal data and processes are prevalent across a wide variety of scientific disciplines. These processes are often characterized by nonlinear time dynamics that include interactions across multiple scales of spatial and temporal variability. The data sets associated with many of these processes are increasing in size due to advances in automated data measurement, management, and numerical simulator output. Non- linear spatio-temporal models have only recently seen interest in statistics, but there are many classes of such models in the engineering and geophysical sciences. Tradi- tionally, these models are more heuristic than those that have been presented in the statistics literature, but are often intuitive and quite efficient computationally. We show here that with fairly simple, but important, enhancements, the echo state net- work (ESN) machine learning approach can be used to generate long-lead forecasts of nonlinear spatio-temporal processes, with reasonable uncertainty quantification, and at only a fraction of the computational expense of a traditional parametric nonlinear spatio-temporal models. %B Stat %G eng %U https://arxiv.org/abs/1708.05094 %0 Generic %D 0 %T Evaluating Data quality in Time Diary Surveys Using Paradata %A Córdova Cazar, Ana Lucía %A Belli, Robert F. %G eng %0 Generic %D 0 %T An evaluation study of the use of paradata to enhance data quality in the American Time Use Survey (ATUS) %A Córdova Cazar, Ana Lucía %A Belli, Robert F. %G eng %0 Generic %D 0 %T Event History Calendar Interviewing Dynamics and Data Quality in the Survey of Income and Program Participation %A Lee, Jinyoung %G eng %0 Generic %D 0 %T Going off Script: How Interviewer Behavior Affects Respondent Behaviors in Telephone Surveys %A Kirchner, Antje %A Olson, Kristen %A Smyth, Jolene %G eng %0 Generic %D 0 %T How do Low Versus High Response Scale Ranges Impact the Administration and Answering of Behavioral Frequency Questions in Telephone Surveys? %A Sarwar, Mazen %A Olson, Kristen %A Smyth, Jolene %G eng %0 Generic %D 0 %T How do Mismatches Affect Interviewer/Respondent Interactions in the Question/Answer Process? %A Smyth, Jolene D. %A Olson, Kristen %G eng %0 Generic %D 0 %T Interviewer Influence on Interviewer-Respondent Interaction During Battery Questions %A Cochran, Beth %A Olson, Kristen %A Smyth, Jolene %G eng %0 Generic %D 0 %T Memory Gaps in the American Time Use Survey. Are Respondents Forgetful or is There More to it? %A Kirchner, Antje %A Belli, Robert F. %A Deal, Caitlin E. %A Córdova-Cazar, Ana Lucia %G eng %0 Generic %D 0 %T Relation of questionnaire navigation patterns and data quality: Keystroke data analysis %A Lee, Jinyoung %G eng %0 Generic %D 0 %T Respondent retrieval strategies inform the structure of autobiographical knowledge %A Belli, R. F. %G eng %0 Generic %D 0 %T Response Scales: Effects on Data Quality for Interviewer Administered Surveys %A Sarwar, Mazen %A Olson, Kristen %A Smyth, Jolene %G eng %0 Generic %D 0 %T Using audit trails to evaluate an event history calendar survey instrument %A Lee, Jinyoung %A Seloske, Ben %A Belli, Robert F. %G eng %0 Generic %D 0 %T Using behavior coding to understand respondent retrieval strategies that inform the structure of autobiographical knowledge %A Belli, R. F. %G eng %0 Generic %D 0 %T Why do Mobile Interviews Take Longer? A Behavior Coding Perspective %A Timbrook, Jerry %A Smyth, Jolene %A Olson, Kristen %G eng %0 Generic %D 0 %T Working with the SIPP-EHC audit trails: Parallel and sequential retrieval %A Lee, Jinyoung %A Seloske, Ben %A Córdova Cazar, Ana Lucía %A Eck, Adam %A Belli, Robert F. %G eng