%0 Journal Article %J Journal of Labor Economics %D 2018 %T Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data %A John M. Abowd %A Kevin L. Mckinney %A Nellie Zhao %X Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20%, the middle 60% and the top 20% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there. %B Journal of Labor Economics %G eng %0 Journal Article %J Annals of Economics and Statistics %D 2018 %T Sorting Between and Within Industries: A Testable Model of Assortative Matching %A John M. Abowd %A Francis Kramarz %A Sebastien Perez-Duarte %A Ian M. Schmutte %B Annals of Economics and Statistics %G eng %0 Report %D 2017 %T Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data %A John M. Abowd %A Kevin L. Mckinney %A Nellie Zhao %X Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20%, the middle 60% and the top 20% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there. %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/34/ %0 Report %D 2017 %T Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data %A Abowd, John M. %A McKinney, Kevin L. %A Zhao, Nellie %X Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data Abowd, John M.; McKinney, Kevin L.; Zhao, Nellie Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20%, the middle 60% and the top 20% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there. %I Cornell University %G eng %U http://hdl.handle.net/1813/52609 %9 Preprint %0 Report %D 2017 %T Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Secure the Future of the Federal Statistical System? %A Weinberg, Daniel %A Abowd, John M. %A Belli, Robert F. %A Cressie, Noel %A Folch, David C. %A Holan, Scott H. %A Levenstein, Margaret C. %A Olson, Kristen M. %A Reiter, Jerome P. %A Shapiro, Matthew D. %A Smyth, Jolene %A Soh, Leen-Kiat %A Spencer, Bruce %A Spielman, Seth E. %A Vilhuber, Lars %A Wikle, Christopher %X

Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Secure the Future of the Federal Statistical System? Weinberg, Daniel; Abowd, John M.; Belli, Robert F.; Cressie, Noel; Folch, David C.; Holan, Scott H.; Levenstein, Margaret C.; Olson, Kristen M.; Reiter, Jerome P.; Shapiro, Matthew D.; Smyth, Jolene; Soh, Leen-Kiat; Spencer, Bruce; Spielman, Seth E.; Vilhuber, Lars; Wikle, Christopher The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives. This paper began as a May 8, 2015 presentation to the National Academies of Science’s Committee on National Statistics by two of the principal investigators of the National Science Foundation-Census Bureau Research Network (NCRN) – John Abowd and the late Steve Fienberg (Carnegie Mellon University). The authors acknowledge the contributions of the other principal investigators of the NCRN who are not co-authors of the paper (William Block, William Eddy, Alan Karr, Charles Manski, Nicholas Nagle, and Rebecca Nugent), the co- principal investigators, and the comments of Patrick Cantwell, Constance Citro, Adam Eck, Brian Harris-Kojetin, and Eloise Parker. We note with sorrow the deaths of Stephen Fienberg and Allan McCutcheon, two of the original NCRN principal investigators. The principal investigators also wish to acknowledge Cheryl Eavey’s sterling grant administration on behalf of the NSF. The conclusions reached in this paper are not the responsibility of the National Science Foundation (NSF), the Census Bureau, or any of the institutions to which the authors belong

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52650 %9 Preprint %0 Journal Article %J The American Statistician %D 2017 %T An empirical comparison of multiple imputation methods for categorical data %A F. Li %A O. Akande %A J. P. Reiter %K latent %K missing %K mixture %K nonresponse %K tree %X Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online. %B The American Statistician %V 71 %8 01/2017 %G eng %U http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1277158 %N 2 %& 162 %R 10.1080/00031305.2016.1277158 %0 Journal Article %J Journal of Privacy and Confidentiality %D 2017 %T How Will Statistical Agencies Operate When All Data Are Private %A Abowd, John M %X

How Will Statistical Agencies Operate When All Data Are Private Abowd, John M The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies. 

%B Journal of Privacy and Confidentiality %I Cornell University %V 7 %G eng %U http://repository.cmu.edu/jpc/vol7/iss3/1/ %N 3 %0 Journal Article %J Journal of Business & Economic Statistics %D 2017 %T Modeling Endogenous Mobility in Earnings Determination %A John M. Abowd %A Kevin L. Mckinney %A Ian M. Schmutte %X We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax exogenous mobility by modeling the matched data as an evolving bipartite graph using a Bayesian latent-type framework. Our results suggest that allowing endogenous mobility increases the variation in earnings explained by individual heterogeneity and reduces the proportion due to employer and match effects. To assess external validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The mobility-bias corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %B Journal of Business & Economic Statistics %P 0-0 %G eng %U http://dx.doi.org/10.1080/07350015.2017.1356727 %R 10.1080/07350015.2017.1356727 %0 Report %D 2017 %T Modeling Endogenous Mobility in Wage Determination %A John M. Abowd %A Kevin L. Mckinney %A Ian M. Schmutte %X We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/28/ %0 Report %D 2017 %T Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A John M. Abowd %A Ian M. Schmutte %X We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial. %B Labor Dynamics Institute Document %8 04/2017 %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/37/ %0 Report %D 2017 %T Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A Abowd, John %A Schmutte, Ian M. %X Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John; Schmutte, Ian M. We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε, δ)-differential privacy with (α, β)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial. A complete archive of the data and programs used in this paper is available via http://doi.org/10.5281/zenodo.345385. %I Cornell University %G eng %U http://hdl.handle.net/1813/39081 %9 Preprint %0 Report %D 2017 %T Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A Abowd, John %A Schmutte, Ian M. %X Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John; Schmutte, Ian M. We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52612 %9 Preprint %0 Report %D 2017 %T Sorting Between and Within Industries: A Testable Model of Assortative Matching %A John M. Abowd %A Francis Kramarz %A Sebastien Perez-Duarte %A Ian M. Schmutte %X We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated. %I Labor Dynamics Institute %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/40/ %9 Document %0 Journal Article %J Proceedings of the 2017 ACM International Conference on Management of Data %D 2017 %T Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics %A Samuel Haney %A Ashwin Machanavajjhala %A John M. Abowd %A Matthew Graham %A Mark Kutzbach %X National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ε≥ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research. %B Proceedings of the 2017 ACM International Conference on Management of Data %@ 978-1-4503-4197-4 %G eng %U http://dl.acm.org/citation.cfm?doid=3035918.3035940 %R 10.1145/3035918.3035940 %0 Report %D 2017 %T Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics %A Haney, Samuel %A Machanavajjhala, Ashwin %A Abowd, John M %A Graham, Matthew %A Kutzbach, Mark %A Vilhuber, Lars %X Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Haney, Samuel; Machanavajjhala, Ashwin; Abowd, John M; Graham, Matthew; Kutzbach, Mark; Vilhuber, Lars National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ≥1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional %I Cornell University %G eng %U http://hdl.handle.net/1813/49652 %9 Preprint %0 Report %D 2016 %T How Will Statistical Agencies Operate When All Data Are Private? %A Abowd, John M. %X How Will Statistical Agencies Operate When All Data Are Private? Abowd, John M. The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies. %I Cornell University %G eng %U http://hdl.handle.net/1813/44663 %9 Preprint %0 Report %D 2016 %T Modeling Endogenous Mobility in Earnings Determination %A Abowd, John M. %A McKinney, Kevin L. %A Schmutte, Ian M. %X Modeling Endogenous Mobility in Earnings Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. Replication code can be found at DOI: http://doi.org/10.5281/zenodo.zenodo.376600 and our Github repository endogenous-mobility-replication . %I Cornell University %G eng %U http://hdl.handle.net/1813/40306 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016: Attitudes Towards Geolocation-Enabled Census Forms %A Brandimarte, Laura %A Chiew, Ernest %A Ventura, Sam %A Acquisti, Alessandro %X NCRN Meeting Spring 2016: Attitudes Towards Geolocation-Enabled Census Forms Brandimarte, Laura; Chiew, Ernest; Ventura, Sam; Acquisti, Alessandro Geolocation refers to the automatic identification of the physical locations of Internet users. In an online survey experiment, we studied respondent reactions towards different types of geolocation. After coordinating with US Census Bureau researchers, we designed and administered a replica of a census form to a sample of respondents. We also created slightly different forms by manipulating the type of geolocation implemented. Using the IP address of each respondent, we approximated the geographical coordinates of the respondent and displayed this location on a map on the survey. Across different experimental conditions, we manipulated the map interface between the three interfaces on the Google Maps API: default road map, Satellite View, and Street View. We also provided either a specific, pinpointed location, or a set of two circles of 1- and 2-miles radius. Snapshots of responses were captured at every instant information was added, altered, or deleted by respondents when completing the survey. We measured willingness to provide information on the typical Census form, as well as privacy concerns associated with geolocation technologies and attitudes towards the use of online geographical maps to identify one’s exact current location. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting %I Carnegie-Mellon University %G eng %U http://hdl.handle.net/1813/43889 %9 Preprint %0 Report %D 2016 %T NCRN Meeting Spring 2016: Developing job linkages for the Health and Retirement Study %A Mccue, Kristin %A Abowd, John %A Levenstein, Margaret %A Patki, Dhiren %A Rodgers, Ann %A Shapiro, Matthew %A Wasi, Nada %X NCRN Meeting Spring 2016: Developing job linkages for the Health and Retirement Study McCue, Kristin; Abowd, John; Levenstein, Margaret; Patki, Dhiren; Rodgers, Ann; Shapiro, Matthew; Wasi, Nada This paper documents work using probabilistic record linkage to create a crosswalk between jobs reported in the Health and Retirement Study (HRS) and the list of workplaces on Census Bureau’s Business Register. Matching job records provides an opportunity to join variables that occur uniquely in separate datasets, to validate responses, and to develop missing data imputation models. Identifying the respondent’s workplace (“establishment”) is valuable for HRS because it allows researchers to incorporate the effects of particular social, economic, and geospatial work environments in studies of respondent health and retirement behavior. The linkage makes use of name and address standardizing techniques tailored to business data that were recently developed in a collaboration between researchers at Census, Cornell, and the University of Michigan. The matching protocol makes no use of the identity of the HRS respondent and strictly protects the confidentiality of information about the respondent’s employer. The paper first describes the clerical review process used to create a set of human-reviewed candidate pairs, and use of that set to train matching models. It then describes and compares several linking strategies that make use of employer name, address, and phone number. Finally it discusses alternative ways of incorporating information on match uncertainty into estimates based on the linked data, and illustrates their use with a preliminary sample of matched HRS jobs. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting %I University of Michigan %G eng %U http://hdl.handle.net/1813/43895 %9 Preprint %0 Report %D 2016 %T NCRN Newsletter: Volume 2 - Issue 4 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X

NCRN Newsletter: Volume 2 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from September 2015 through December 2015. NCRN Newsletter Vol. 2, Issue 4: January 28, 2016.

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/42394 %9 Preprint %0 Report %D 2016 %T NCRN Newsletter: Volume 3 - Issue 1 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 3 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2016 through May 2016. NCRN Newsletter Vol. 3, Issue 1: June 10, 2016 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/44199 %9 Preprint %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Noise infusion as a confidentiality protection measure for graph-based statistics %A Abowd, John M. %A McKinney, Kevin L. %X We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau's Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs. %B Statistical Journal of the International Association for Official Statistics %V 32 %P 127-135 %G eng %U http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji958 %N 1 %& 127 %R 10.3233/SJI-160958 %0 Journal Article %J Journal of Applied Research in Memory and Cognition %D 2016 %T Parallel associations and the structure of autobiographical knowledge %A Belli, R.F. %A T. Al Baghal %K Autobiographical memory; Autobiographical knowledge; Autobiographical periods; Episodic memory; Retrospective reports %X The self-memory system (SMS) model of autobiographical knowledge conceives that memories are structured thematically, organized both hierarchically and temporally. This model has been challenged on several fronts, including the absence of parallel linkages across pathways. Calendar survey interviewing shows the frequent and varied use of parallel associations in autobiographical recall. Parallel associations in these data are commonplace, and are driven more by respondents’ generative retrieval than by interviewers’ probing. Parallel associations represent a number of autobiographical knowledge themes that are interrelated across life domains. The content of parallel associations is nearly evenly split between general and transitional events, supporting the importance of transitions in autographical memory. Associations in respondents’ memories (both parallel and sequential), demonstrate complex interactions with interviewer verbal behaviors during generative retrieval. In addition to discussing the implications of these results to the SMS model, implications are also drawn for transition theory and the basic-systems model. %B Journal of Applied Research in Memory and Cognition %V 5 %P 150–157 %8 03/2016 %G eng %N 2 %R 10.1016/j.jarmac.2016.03.004 %0 Journal Article %J Demography %D 2016 %T Spatial Variation in the Quality of American Community Survey Estimates %A Folch, David C. %A Arribas-Bel, Daniel %A Koschinsky, Julia %A Spielman, Seth E. %B Demography %V 53 %P 1535–1554 %G eng %0 Journal Article %J Statistical Journal of the International Association for Official Statistics %D 2016 %T Synthetic establishment microdata around the world %A Vilhuber, Lars %A Abowd, John M. %A Reiter, Jerome P. %K Business data %K confidentiality %K differential privacy %K international comparison %K Multiple imputation %K synthetic %X In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature. %B Statistical Journal of the International Association for Official Statistics %V 32 %P 65-68 %G eng %U http://content.iospress.com/download/statistical-journal-of-the-iaos/sji964 %N 1 %& 65 %R 10.3233/SJI-160964 %0 Report %D 2016 %T Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do %A John M. Abowd %G eng %U http://digitalcommons.ilr.cornell.edu/ldi/32/ %0 Thesis %B Statistical Science %D 2015 %T A Comparison of Multiple Imputation Methods for Categorical Data (Master's Thesis) %A Akande, O. %B Statistical Science %I Duke University %G eng %9 Masters %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Determining Potential for Breakoff in Time Diary Survey Using Paradata %A Wettlaufer, D. %A Arunachalam, H. %A Atkin, G. %A Eck, A. %A Soh, L.-K. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 05/2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2015 %T Economic Analysis and Statistical Disclosure Limitation %A Abowd, John M. %A Schmutte, Ian M. %X

Economic Analysis and Statistical Disclosure Limitation Abowd, John M.; Schmutte, Ian M. This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect confidentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies.

%I Cornell University %G eng %U http://hdl.handle.net/1813/40581 %9 Preprint %0 Journal Article %J Brookings Papers on Economic Activity %D 2015 %T Economic Analysis and Statistical Disclosure Limitation %A Abowd, John M. %A Schmutte, Ian M. %X Economic Analysis and Statistical Disclosure Limitation Abowd, John M.; Schmutte, Ian M. This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect confidentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies. %B Brookings Papers on Economic Activity %V Spring 2015 %8 03/2015 %G eng %U http://www.brookings.edu/about/projects/bpea/papers/2015/economic-analysis-statistical-disclosure-limitation %0 Journal Article %J arXiv %D 2015 %T An empirical comparison of multiple imputation methods for categorical data %A Akande, O. %A Li, Fan %A Reiter , J. P. %X Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. The results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and mixture model approaches. They also suggest competing advantages for the regression tree and mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. %B arXiv %G eng %U http://arxiv.org/abs/1508.05918 %N 1508.05918 %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Grids and Online Panels: A Comparison of Device Type from a Survey Quality Perspective %A Wang, Mengyang %A McCutcheon, Allan L. %A Allen, Laura %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Book Section %B Encyclopedia of Geographical Information Science %D 2015 %T Hierarchical Spatial Models %A Arab, A. %A Hooten, M.B. %A Wikle, C.K. %B Encyclopedia of Geographical Information Science %I Springer %G eng %0 Journal Article %J Geological Society %D 2015 %T Hierarchical, stochastic modeling across spatiotemporal scales of large river ecosystems and somatic growth in fish populations under various climate models: Missouri River sturgeon example %A Wildhaber, M.L. %A Wikle, C.K. %A Moran, E.H. %A Anderson, C.J. %A Franz, K.J. %A Dey, R. %B Geological Society %G eng %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T I Know What You Did Next: Predicting Respondent’s Next Activity Using Machine Learning %A Arunachalam, H. %A Atkin, G. %A Eck, A. %A Wettlaufer, D. %A Soh, L.-K. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 May 14-17, 2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2015 %T Modeling Endogenous Mobility in Wage Determination %A Abowd, John M. %A McKinney, Kevin L. %A Schmutte, Ian M. %X Modeling Endogenous Mobility in Wage Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %I Cornell University %G eng %U http://hdl.handle.net/1813/40306 %9 Preprint %0 Report %D 2015 %T Modeling Endogenous Mobility in Wage Determination %A Abowd, John M. %A McKinney, Kevin L. %A Schmutte, Ian M. %X Modeling Endogenous Mobility in Wage Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax exogenous mobility by modeling the matched data as an evolving bipartite graph using a Bayesian latent-type framework. Our results suggest that allowing endogenous mobility increases the variation in earnings explained by individual heterogeneity and reduces the proportion due to employer and match effects. To assess external validity, we match our estimates of the wage components to out-ofsample estimates of revenue per worker. The mobility-bias corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/52608 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network %A Abowd, John M. %A Fienberg, Stephen E. %X NCRN Meeting Spring 2015: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network Abowd, John M.; Fienberg, Stephen E. May 8, 2015 CNSTAT Public Seminar %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40186 %9 Preprint %0 Report %D 2015 %T NCRN Meeting Spring 2015: Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods %A Abowd, John M. %A Schmutte, Ian %X NCRN Meeting Spring 2015: Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John M.; Schmutte, Ian Presentation at the NCRN Meeting Spring 2015 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40184 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 1 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 2 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from October 2014 to January 2015. NCRN Newsletter Vol. 2, Issue 1: January 30, 2015. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40193 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 2 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40194 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 2 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from February 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/44200 %9 Preprint %0 Report %D 2015 %T NCRN Newsletter: Volume 2 - Issue 3 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X

NCRN Newsletter: Volume 2 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from June 2015 through August 2015. NCRN Newsletter Vol. 2, Issue 3: September 15, 2015.

%I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/42393 %9 Preprint %0 Report %D 2015 %T Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics %A Abowd, John A. %A McKinney, Kevin L. %X Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics Abowd, John A.; McKinney, Kevin L. We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs. %I Cornell University %G eng %U http://hdl.handle.net/1813/42338 %9 Preprint %0 Journal Article %J Science %D 2015 %T Privacy and human behavior in the age of information %A Alessandro Acquisti %A Laura Brandimarte %A George Loewenstein %K confidentiality %K privacy %X This Review summarizes and draws connections between diverse streams of empirical research on privacy behavior. We use three themes to connect insights from social and behavioral sciences: people’s uncertainty about the consequences of privacy-related behaviors and their own preferences over those consequences; the context-dependence of people’s concern, or lack thereof, about privacy; and the degree to which privacy concerns are malleable—manipulable by commercial and governmental interests. Organizing our discussion by these themes, we offer observations concerning the role of public policy in the protection of privacy in the information age. %B Science %V 347 %G eng %U http://www.sciencemag.org/content/347/6221/509 %N 6221 %& 509 %R 10.1126/science.aaa1465 %0 Journal Article %J Geological Society %D 2015 %T A stochastic bioenergetics model based approach to translating large river flow and temperature in to fish population responses: the pallid sturgeon example %A Wildhaber, M.L. %A Dey, R. %A Wikle, C.K. %A Anderson, C.J. %A Moran, E.H. %A Franz, K.J. %B Geological Society %V 408 %G eng %R 10.1144/SP408.10 %0 Report %D 2015 %T Synthetic Establishment Microdata Around the World %A Vilhuber, Lars %A Abowd, John A. %A Reiter, Jerome P. %X Synthetic Establishment Microdata Around the World Vilhuber, Lars; Abowd, John A.; Reiter, Jerome P. In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature. %I Cornell University %G eng %U http://hdl.handle.net/1813/42340 %9 Preprint %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Using Data Mining to Examine Interviewer-Respondent Interactions in Calendar Interviews %A Belli, R.F. %A Miller, L.D. %A Soh, L.-K. %A T. Al Baghal %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 05/2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %D 2015 %T Using Machine Learning Techniques to Predict Respondent Type from A Priori Demographic Information %A Atkin, G. %A Arunachalam, H. %A Eck, A. %A Wettlaufer, D. %A Soh, L.-K. %A Belli, R.F. %B 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) %C Hollywood, Florida %8 May 14-17, 2015 %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B UNL/SRAM/Gallup Symposium %D 2014 %T Designing an Intelligent Time Diary Instrument: Visualization, Dynamic Feedback, and Error Prevention and Mitigation %A Atkin, G. %A Arunachalam, H. %A Eck, A. %A Soh, L.-K. %A Belli, R.F. %B UNL/SRAM/Gallup Symposium %C Omaha, NE %G eng %U http://grc.unl.edu/unlsramgallup-symposium %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Designing an Intelligent Time Diary Instrument: Visualization, Dynamic Feedback, and Error Prevention and Mitigation %A Atkin, G. %A Arunachalam, H. %A Eck, A. %A Soh, L.-K. %A Belli, R. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA. %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Journal of Economic Literature %D 2014 %T The Economics of Privacy %A Acquisti, A. %A Taylor, C. %B Journal of Economic Literature %G eng %0 Journal Article %J Journal of Personality and Social Psychology %D 2014 %T I Cheated, but only a Little–Partial Confessions to Unethical Behavior %A Peer, E. %A Acquisti, A. %A Shalvi, S. %B Journal of Personality and Social Psychology %V 106 %P 202–217 %G eng %0 Conference Paper %B American Association for Public Opinion Research 2014 Annual Conference %D 2014 %T Making sense of paradata: Challenges faced and lessons learned %A Eck, A. %A Stuart, L. %A Atkin, G. %A Soh, L-K %A McCutcheon, A.L. %A Belli, R.F. %B American Association for Public Opinion Research 2014 Annual Conference %C Anaheim, CA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B UNL/SRAM/Gallup Symposium %D 2014 %T Making Sense of Paradata: Challenges Faced and Lessons Learned %A Eck, A. %A Stuart, L. %A Atkin, G. %A Soh, L-K %A McCutcheon, A.L. %A Belli, R.F. %B UNL/SRAM/Gallup Symposium %C Omaha, NE %G eng %U http://grc.unl.edu/unlsramgallup-symposium %0 Report %D 2014 %T NCRN Meeting Spring 2014: Aiming at a More Cost-Effective Census Via Online Data Collection: Privacy Trade-Offs of Geo-Location %A Brandimarte, Laura %A Acquisti, Alessandro %X NCRN Meeting Spring 2014: Aiming at a More Cost-Effective Census Via Online Data Collection: Privacy Trade-Offs of Geo-Location Brandimarte, Laura; Acquisti, Alessandro presentation at NCRN Spring 2014 meeting %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/36397 %9 Preprint %0 Report %D 2014 %T NCRN Newsletter: Volume 1 - Issue 2 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from November 2013 to March 2014. NCRN Newsletter Vol. 1, Issue 2: March 20, 2014 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40233 %9 Preprint %0 Report %D 2014 %T NCRN Newsletter: Volume 1 - Issue 3 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from March 2014 to July 2014. NCRN Newsletter Vol. 1, Issue 3: July 23, 2014 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40234 %9 Preprint %0 Report %D 2014 %T NCRN Newsletter: Volume 1 - Issue 4 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2014 to October 2014. NCRN Newsletter Vol. 1, Issue 4: October 15, 2014 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40192 %9 Preprint %0 Report %D 2014 %T A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data %A Schneider, Matthew J. %A Abowd, John M. %X A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data Schneider, Matthew J.; Abowd, John M. Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The United States Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce Indicators. %I Cornell University %G eng %U http://hdl.handle.net/1813/40828 %9 Preprint %0 Generic %D 2014 %T NewsViews: An Automated Pipeline for Creating Custom Geovisualizations for News %A Gao, T. %A Hullman, J. %A Adar, E. %A Hect, B. %A Diakopoulos, N. %X Interactive visualizations add rich, data-based context to online news articles. Geographic maps are currently the most prevalent form of these visualizations. Unfortunately, designers capable of producing high-quality, customized geovisualizations are scarce. We present NewsViews, a novel automated news visualization system that generates interactive, annotated maps without requiring professional designers. NewsViews’ maps support trend identification and data comparisons relevant to a given news article. The NewsViews system leverages text mining to identify key concepts and locations discussed in articles (as well as po-tential annotations), an extensive repository of “found” databases, and techniques adapted from cartography to identify and create visually “interesting” thematic maps. In this work, we develop and evaluate key criteria in automatic, annotated, map generation and experimentally validate the key features for successful representations (e.g., relevance to context, variable selection, "interestingness" of representation and annotation quality). %G eng %U http://cond.org/newsviews.html %R 10.1145/2556288.2557228 %0 Journal Article %J Behavior Research Methods %D 2014 %T Reputation as a Sufficient Condition for Data Quality on Amazon Mechanical Turk %A Peer, E. %A Vosgerau, J. %A Acquisti, A. %B Behavior Research Methods %V 46 %P 1023–1031 %8 December %G eng %0 Report %D 2014 %T Sorting Between and Within Industries: A Testable Model of Assortative Matching %A Abowd, John M. %A Kramarz, Francis %A Perez-Duarte, Sebastien %A Schmutte, Ian M. %X Sorting Between and Within Industries: A Testable Model of Assortative Matching Abowd, John M.; Kramarz, Francis; Perez-Duarte, Sebastien; Schmutte, Ian M. We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated. %I Cornell University %G eng %U http://hdl.handle.net/1813/52607 %9 Preprint %0 Conference Paper %B Proceedings of the Workshop on Usable Security (USEC) %D 2014 %T Spiny CACTOS: OSN Users Attitudes and Perceptions Towards Cryptographic Access Control Tools %A Balsa, E., %A Brandimarte, L., %A Acquisti, A., %A Diaz, C., %A Gürses, S. %B Proceedings of the Workshop on Usable Security (USEC) %G eng %U https://www.internetsociety.org/doc/spiny-cactos-osn-users-attitudes-and-perceptions-towards-cryptographic-access-control-tools %0 Report %D 2014 %T Uncertain Uncertainty: Spatial Variation in the Quality of American Community Survey Estimates %A Folch, David C. %A Arribas-Bel, Daniel %A Koschinsky, Julia %A Spielman, Seth E. %X Uncertain Uncertainty: Spatial Variation in the Quality of American Community Survey Estimates Folch, David C.; Arribas-Bel, Daniel; Koschinsky, Julia; Spielman, Seth E. The U.S. Census Bureau's American Community Survey (ACS) is the foundation of social science research, much federal resource allocation and the development of public policy and private sector decisions. However, the high uncertainty associated with some of the ACS's most frequently used estimates can jeopardize the accuracy of inferences based on these data. While there is high level understanding in the research community that problems exist in the data, the sources and implications of these problems have been largely overlooked. Using 2006-2010 ACS median household income at the census tract scale as the test case (where a third of small-area estimates have higher than recommend errors), we explore the patterns in the uncertainty of ACS data. We consider various potential sources of uncertainty in the data, ranging from response level to geographic location to characteristics of the place. We find that there exist systematic patterns in the uncertainty in both the spatial and attribute dimensions. Using a regression framework, we identify the factors that are most frequently correlated with the error at national, regional and metropolitan area scales, and find these correlates are not consistent across the various locations tested. The implication is that data quality varies in different places, making cross-sectional analysis both within and across regions less reliable. We also present general advice for data users and potential solutions to the challenges identified. %I University of Colorado at Boulder / University of Tennessee %G eng %U http://hdl.handle.net/1813/38122 %9 Preprint %0 Report %D 2014 %T Using Social Media to Measure Labor Market Flows %A Antenucci, Dolan %A Cafarella, Michael J %A Levenstein, Margaret C. %A Ré, Christopher %A Shapiro, Matthew %G eng %U http://www-personal.umich.edu/~shapiro/papers/LaborFlowsSocialMedia.pdf %9 Mimeo %0 Journal Article %J Journal of Survey Statistics and Methodology %D 2014 %T What are You Doing Now? Activity Level Responses and Errors in the American Time Use Survey %A T. Al Baghal %A Belli, R.F. %A Phillips, A.L. %A Ruther, N. %B Journal of Survey Statistics and Methodology %V 2 %G eng %N 4 %& 519-537 %0 Conference Paper %B Proceedings of the Tenth Symposium on Usable Privacy and Security (SOUPS) %D 2014 %T Would a Privacy Fundamentalist Sell their DNA for \$1000... if Nothing Bad Happened Thereafter? A Study of the Western Categories, Behavior Intentions, and Consequences %A Woodruff, A. %A Pihur, V. %A Acquisti, A. %A Consolvo, S. %A Schmidt, L. %A Brandimarte, L. %B Proceedings of the Tenth Symposium on Usable Privacy and Security (SOUPS) %I ACM %C New York, NY %G eng %U https://www.usenix.org/conference/soups2014/proceedings/presentation/woodruff %0 Conference Paper %B IEEE Security & Privacy %D 2013 %T Complementary Perspectives on Privacy and Security: Economics %A Acquisti, A. %B IEEE Security & Privacy %V 11 %P 93–95 %G eng %R 10.1109/MSP.2013.30 %0 Journal Article %J International Journal of Digital Curation %D 2013 %T Data Management of Confidential Data %A Carl Lagoze %A William C. Block %A Jeremy Williams %A John M. Abowd %A Lars Vilhuber %X Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data. %B International Journal of Digital Curation %V 8 %P 265-278 %G eng %R 10.2218/ijdc.v8i1.259 %0 Journal Article %J Journal of Empirical Legal Studies %D 2013 %T Empirical Analysis of Data Breach Litigation %A Romanosky, A. %A Hoffman, D. %A Acquisti, A. %B Journal of Empirical Legal Studies %V 11 %P 74–104 %G eng %0 Report %D 2013 %T Encoding Provenance of Social Science Data: Integrating PROV with DDI %A Lagoze, Carl %A Block, William C %A Williams, Jeremy %A Abowd, John %A Vilhuber, Lars %X Encoding Provenance of Social Science Data: Integrating PROV with DDI Lagoze, Carl; Block, William C; Williams, Jeremy; Abowd, John; Vilhuber, Lars Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface. Submitted to EDDI13 5th Annual European DDI User Conference December 2013, Paris, France %I Cornell University %G eng %U http://hdl.handle.net/1813/34443 %9 Preprint %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Examining response time outliers through paradata in Online Panel Surveys %A Lee, J. %A T. Al Baghal %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Examining the relationship between error and behavior in the American Time Use Survey using audit trail paradata %A Ruther, N. %A T. Al Baghal %A A. Eck %A L. Stuart %A L. Phillips %A R. Belli %A Soh, L-K %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Ohio State Law Journal %D 2013 %T From Facebook Regrets to Facebook Privacy Nudges %A Wang, Y. %A Leon, P. G. %A Chen, X. %A Komanduri, S. %A Norcie, G. %A Scott, K. %A Acquisti, A. %A Cranor, L. F. %A Sadeh, N. %B Ohio State Law Journal %G eng %0 Journal Article %J IEEE Security & Privacy %D 2013 %T Gone in 15 Seconds: The Limits of Privacy Transparency and Control %A Acquisti, A. %A Adjerid, I. %A Brandimarte, L. %B IEEE Security & Privacy %V 11 %P 72–74 %G eng %0 Report %D 2013 %T Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files %A Block, William C. %A Williams, Jeremy %A Vilhuber, Lars %A Lagoze, Carl %A Brown, Warren %A Abowd, John M. %X Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files Block, William C.; Williams, Jeremy; Vilhuber, Lars; Lagoze, Carl; Brown, Warren; Abowd, John M. Presentation at NADDI 2013 This record has also been archived at http://kuscholarworks.ku.edu/dspace/handle/1808/11093 . %I Cornell University %G eng %U http://hdl.handle.net/1813/33362 %9 Preprint %0 Conference Paper %B Proceedings of Learning from Authoritative Security Experiment Results (LASER) %D 2013 %T Is it the Typeset or the Type of Statistics? Disfluent Font and Self-Disclosure %A Balebako, R. %A Pe'er, E. %A Brandimarte, L. %A Cranor, L. F. %A Acquisti, A. %B Proceedings of Learning from Authoritative Security Experiment Results (LASER) %I USENIX Association %C New York, NY %G eng %U https://www.usenix.org/laser2013/program/balebako %0 Report %D 2013 %T Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata %A Vilhuber, Lars %A Abowd, John %A Block, William %A Lagoze, Carl %A Williams, Jeremy %X Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata Vilhuber, Lars; Abowd, John; Block, William; Lagoze, Carl; Williams, Jeremy Social science researchers are increasingly interested in making use of confidential micro-data that contains linkages to the identities of people, corporations, etc. The value of this linking lies in the potential to join these identifiable entities with external data such as genome data, geospatial information, and the like. Leveraging these linkages is an essential aspect of “big data” scholarship. However, the utility of these confidential data for scholarship is compromised by the complex nature of their management and curation. This makes it difficult to fulfill US federal data management mandates and interferes with basic scholarly practices such as validation and reuse of existing results. We describe in this paper our work on the CED2AR prototype, a first step in providing researchers with a tool that spans the confidential/publicly-accessible divide, making it possible for researchers to identify, search, access, and cite those data. The particular points of interest in our work are the cloaking of metadata fields and the expression of provenance chains. For the former, we make use of existing fields in the DDI (Data Description Initiative) specification and suggest some minor changes to the specification. For the latter problem, we investigate the integration of DDI with recent work by the W3C PROV working group that has developed a generalizable and extensible model for expressing data provenance. %I Cornell University %G eng %U http://hdl.handle.net/1813/34534 %9 Preprint %0 Journal Article %J Public Opinion Quarterly %D 2013 %T Memory, communication, and data quality in calendar interviews %A Belli, R. F., %A Bilgen, I., %A T. Al Baghal %B Public Opinion Quarterly %V 77 %P 194-219 %G eng %0 Journal Article %J Social Psychological and Personality Science %D 2013 %T Misplaced confidences: Privacy and the control paradox %A Laura Brandimarte %A Alessandro Acquisti %A George Loewenstein %B Social Psychological and Personality Science %V 4 %P 340–347 %G eng %R 10.1177/1948550612455931 %0 Report %D 2013 %T NCRN Newsletter: Volume 1 - Issue 1 %A Vilhuber, Lars %A Karr, Alan %A Reiter, Jerome %A Abowd, John %A Nunnelly, Jamie %X NCRN Newsletter: Volume 1 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2013 to November 2013. NCRN Newsletter Vol. 1, Issue 1: November 17, 2013 %I NCRN Coordinating Office %G eng %U http://hdl.handle.net/1813/40232 %9 Preprint %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Predicting survey breakoff in Internet survey panels %A McCutcheon, A.L. %A T. Al Baghal %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B Biennial conference of the Society for Applied Research in Memory and Cognition %D 2013 %T Predicting the occurrence of respondent retrieval strategies in calendar interviewing: The quality of autobiographical recall in surveys %A Belli, R.F. %A Miller, L.D. %A Soh, L-K %A T. Al Baghal %B Biennial conference of the Society for Applied Research in Memory and Cognition %C Rotterdam, Netherlands %G eng %U http://static1.squarespace.com/static/504170d6e4b0b97fe5a59760/t/52457a8be4b0012b7a5f462a/1380285067247/SARMAC_X_PaperJune27.pdf %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Predicting the occurrence of respondent retrieval strategies in calendar interviewing: The quality of retrospective reports %A Belli, R.F. %A Miller, L.D. %A Soh, L-K %A T. Al Baghal %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Report %D 2013 %T Presentation: Predicting Multiple Responses with Boosting and Trees %A Li, Ping %A Abowd, John %X Presentation: Predicting Multiple Responses with Boosting and Trees Li, Ping; Abowd, John Presentation by Ping Li and John Abowd at FCSM on November 4, 2013 %I Cornell University %G eng %U http://hdl.handle.net/1813/40255 %9 Preprint %0 Journal Article %J WebDB %D 2013 %T Ringtail: a generalized nowcasting system. %A Antenucci, Dolan %A Li, Erdong %A Liu, Shaobo %A Zhang, Bochun %A Cafarella, Michael J %A Ré, Christopher %X Social media nowcasting—using online user activity to de- scribe real-world phenomena—is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcast- ing systems to quickly become a standard tool among non- computer scientists, yet it has largely remained a research topic. We believe a major obstacle to widespread adoption is the nowcasting feature selection problem. Typical now- casting systems require the user to choose a handful of social media objects from a pool of billions of potential candidates, which can be a time-consuming and error-prone process. We have built Ringtail, a nowcasting system that helps the user by automatically suggesting high-quality signals. We demonstrate that Ringtail can make nowcasting easier by suggesting relevant features for a range of topics. The user provides just a short topic query (e.g., unemployment) and a small conventional dataset in order for Ringtail to quickly return a usable predictive nowcasting model. %B WebDB %V 6 %P 1358-1361 %G eng %U http://cs.stanford.edu/people/chrismre/papers/Ringtail-VLDB-demo.pdf %& 1358 %0 Journal Article %J WebDB %D 2013 %T Ringtail: Feature Selection for Easier Nowcasting. %A Antenucci, Dolan %A Cafarella, Michael J %A Levenstein, Margaret C. %A Ré, Christopher %A Shapiro, Matthew %X In recent years, social media “nowcasting”—the use of on- line user activity to predict various ongoing real-world social phenomena—has become a popular research topic; yet, this popularity has not led to widespread actual practice. We be- lieve a major obstacle to widespread adoption is the feature selection problem. Typical nowcasting systems require the user to choose a set of relevant social media objects, which is difficult, time-consuming, and can imply a statistical back- ground that users may not have. We propose Ringtail, which helps the user choose rele- vant social media signals. It takes a single user input string (e.g., unemployment) and yields a number of relevant signals the user can use to build a nowcasting model. We evaluate Ringtail on six different topics using a corpus of almost 6 billion tweets, showing that features chosen by Ringtail in a wholly-automated way are better or as good as those from a human and substantially better if Ringtail receives some human assistance. In all cases, Ringtail reduces the burden on the user. %B WebDB %P 49-54 %G eng %U http://www.cs.stanford.edu/people/chrismre/papers/webdb_ringtail.pdf %& 49 %0 Conference Paper %B Proceedings of the Ninth Symposium on Usable Privacy and Security (SOUPS) %D 2013 %T Sleights of Privacy: Framing, Disclosures, and the Limits of Transparency %A Adjerid, I. %A Acquisti, A. %A Loewenstein, G. %B Proceedings of the Ninth Symposium on Usable Privacy and Security (SOUPS) %I ACM %C New York, NY %G eng %0 Conference Paper %B American Association for Public Opinion Research 2013 Annual Conference %D 2013 %T Troubles with time-use: Examining potential indicators of error in the American Time Use Survey %A Phillips, A.L. %A T. Al Baghal %A Belli, R.F. %B American Association for Public Opinion Research 2013 Annual Conference %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Conference Paper %B American Association for Public Opinion Research %D 2013 %T What are you doing now?: Audit trails, Activity level responses and error in the American Time Use Survey %A T. Al Baghal %A Phillips, A.L. %A Ruther, N. %A Belli, R.F. %A Stuart, L. %A Eck, A. %A Soh, L-K %B American Association for Public Opinion Research %C Boston, MA %G eng %U http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx %0 Journal Article %J Journal of Legal Studies %D 2013 %T What is Privacy Worth? %A Acquisti, A. %A John, L. %A Loewenstein, G. %B Journal of Legal Studies %V 42 %P 249–274 %G eng %0 Conference Paper %B Eighth International Conference on Social Science Methodology %D 2012 %T Calendar interviewing in life course research: Associations between verbal behaviors and data quality %A Belli, R.F. %A Bilgen, I. %A T. Al Baghal %B Eighth International Conference on Social Science Methodology %C Sydney Australia %G eng %U https://conference.acspri.org.au/index.php/rc33/2012/paper/view/366 %0 Report %D 2012 %T Data Management of Confidential Data %A Lagoze, Carl %A Block, William C. %A Williams, Jeremy %A Abowd, John M. %A Vilhuber, Lars %X Data Management of Confidential Data Lagoze, Carl; Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data. %I Cornell University %G eng %U http://hdl.handle.net/1813/30924 %9 Preprint %0 Report %D 2012 %T An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) %A Block, William C. %A Williams, Jeremy %A Abowd, John M. %A Vilhuber, Lars %A Lagoze, Carl %X An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars; Lagoze, Carl This presentation will demonstrate the latest DDI-related technological developments of Cornell University’s $3 million NSF-Census Research Network (NCRN) award, dedicated to improving the documentation, discoverability, and accessibility of public and restricted data from the federal statistical system in the United States. The current internal name for our DDI-based system is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). CED²AR ingests metadata from heterogeneous sources and supports filtered synchronization between restricted and public metadata holdings. Currently-supported CED²AR “connector workflows” include mechanisms to ingest IPUMS, zero-observation files from the American Community Survey (DDI 2.1), and SIPP Synthetic Beta (DDI 1.2). These disparate metadata sources are all transformed into a DDI 2.5 compliant form and stored in a single repository. In addition, we will demonstrate an extension to DDI 2.5 that allows for the labeling of elements within the schema to indicate confidentiality. This metadata can then be filtered, allowing the creation of derived public use metadata from an original confidential source. This repository is currently searchable online through a prototype application demonstrating the ability to search across previously heterogeneous metadata sources. Presentation at the 4th Annual European DDI User Conference (EDDI12), Norwegian Social Science Data Services, Bergen, Norway, 3 December, 2012 %I Cornell University %G eng %U http://hdl.handle.net/1813/30922 %9 Preprint %0 Conference Paper %B The Oxford Handbook of the Digital Economy %D 2012 %T The Economics of Privacy %A Laura Brandimarte %A Alessandro Acquisti %E Martin Peitz %E Joel Waldfogel %B The Oxford Handbook of the Digital Economy %I Oxford University Press %P 547-570 %@ 9780195397840 %G eng %R 10.1093/oxfordhb/9780195397840.013.0020 %0 Conference Paper %B Midwest Association for Public Opinion Research 2012 Annual Conference %D 2012 %T Interviewer variance of interviewer and respondent behaviors: A new frontier in analyzing the interviewer-respondent interaction %A Charoenruk, N. %A Parkhurst, B. %A Ay, M. %A Belli, R. F. %B Midwest Association for Public Opinion Research 2012 Annual Conference %C Chicago, IL %8 November %G eng %U http://www.mapor.org/conferences.html %0 Report %D 2012 %T The NSF-Census Research Network: Cornell Node %A Block, William C. %A Lagoze, Carl %A Vilhuber, Lars %A Brown, Warren A. %A Williams, Jeremy %A Arguillas, Florio %X The NSF-Census Research Network: Cornell Node Block, William C.; Lagoze, Carl; Vilhuber, Lars; Brown, Warren A.; Williams, Jeremy; Arguillas, Florio Cornell University has received a $3M NSF-Census Research Network (NCRN) award to improve the documentation and discoverability of both public and restricted data from the federal statistical system. The current internal name for this project is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). The diagram to the right provides a high level architectural overview of the system to be implemented. The CED²AR will be based upon leading metadata standards such as the Data Documentation Initiative (DDI) and Statistical Data and Metadata eXchange (SDMX) and be flexibly designed to ingest documentation from a variety of source files. It will permit synchronization between the public and confidential instances of the repository. The scholarly community will be able to use the CED²AR as it would a conventional metadata repository, deprived only of the values of certain confidential information, but not their metadata. The authorized user, working on the secure Census Bureau network, could use the CED²AR with full information in authorized domains. %I Cornell University %G eng %U http://hdl.handle.net/1813/30925 %9 Preprint %0 Report %D 2012 %T Presentation: Revisiting the Economics of Privacy: Population Statistics and Privacy as Public Goods %A Abowd, John %X Presentation: Revisiting the Economics of Privacy: Population Statistics and Privacy as Public Goods Abowd, John Anonymization and data quality are intimately linked. Although this link has been properly acknowledged in the Computer Science and Statistical Disclosure Limitation literatures, economics offers a framework for formalizing the linkage and analyzing optimal decisions and equilibrium outcomes. The opinions expressed in this presentation are those of the author and neither the National Science Foundation nor the Census Bureau. %I Cornell University %G eng %U http://hdl.handle.net/1813/30937 %9 Preprint %0 Book Section %B Privacy in Statistical Databases %D 2012 %T A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs %A Abowd, John M. %A Vilhuber, Lars %A Block, William %E Domingo-Ferrer, Josep %E Tinnirello, Ilenia %K Data Archive %K Data Curation %K Privacy-preserving Datamining %K Statistical Disclosure Limitation %B Privacy in Statistical Databases %S Lecture Notes in Computer Science %I Springer Berlin Heidelberg %V 7556 %P 216-225 %@ 978-3-642-33626-3 %G eng %U http://dx.doi.org/10.1007/978-3-642-33627-0_17 %R 10.1007/978-3-642-33627-0_17 %0 Conference Paper %B Conference on Web Privacy Measurement %D 2012 %T Sleight of Privacy %A Idris Adjerid %A Alessandro Acquisti %A Laura Brandimarte %B Conference on Web Privacy Measurement %G eng %0 Conference Paper %B Midwest Association for Public Opinion Research 2012 Annual Conference %D 2012 %T Troubles with time-use: Examining potential indicators of error in the ATUS %A Phillips, A. L., %A T. Al Baghal %A Belli, R. F. %B Midwest Association for Public Opinion Research 2012 Annual Conference %C Chicago, IL %G eng %U http://www.mapor.org/conferences.html %0 Report %D 2011 %T A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs %A Abowd, John M. %A Vilhuber, Lars %A Block, William %X A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs Abowd, John M.; Vilhuber, Lars; Block, William We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials. %I Cornell University %G eng %U http://hdl.handle.net/1813/30923 %9 Preprint %0 Journal Article %J The American Statistician %D 0 %T An Empirical Comparison of Multiple Imputation Methods for Categorical Data %A Olanrewaju Akande %A Fan Li %A Jerome Reiter %X AbstractMultiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online. %B The American Statistician %P 0-0 %G eng %U http://dx.doi.org/10.1080/00031305.2016.1277158 %R 10.1080/00031305.2016.1277158