Bayesian Predictive Inference for a Non-probability Sample with Binary Responses from Small Areas Public

Downloadable Content

open in viewer

In a world of big data, non-probability samples are fast and easy to collect, and the responses can be binary. Traditional design surveys, where probability theory plays an important role, require enormous planning and are very expensive. Typically, to reduce cost and save time, large data sets are collected using haphazard methods instead of designed surveys. An issue with big data is that the selection probabilities are unknown and descriptive summaries are generally biased. It is often the case that selection probabilities are related to the covariates and the binary response variable, and selection is not at random (SNAR); the samples and non-samples are not random samples from the population. The main contribution of our research is a methodology for correcting selection bias in nonprobability samples with binary response and appropriate covariates. To study binary response data and to deal with the sampling bias that comes from the SNAR mechanism in a single area, we propose a non-ignorable selection model that uses a double logistic regression to link the response model with the selection model. When selection is at random (SAR), a single logistic regression model could be used to serve as an ignorable selection model (a link to the selection mechanism is not needed). Both models are fit using full Bayesian methods. We use simulation studies to evaluate the ability of the non-ignorable selection model to adjust for the selection bias from the SNAR mechanism. The results show that when samples are SNAR, the non-ignorable selection model gives unbiased population proportion prediction, and when samples are SAR, the non-ignorable selection model performs similarly to the ignorable selection model. We also demonstrate the use of the model with real data from the Third National Health and Nutrition Examination Survey (NHANES III), where a binary version of body mass index is derived as the response with demographic covariates (age, race, sex). Additional work includes a study on priors and a methodology for situations where individual covariates are typically unknown for the non-sampled population, but other sources of data are integrated into the ensemble. We extend the non-ignorable selection model to incorporate area level information, which is accommodated using random effects in the response sub-model and selection sub-model respectively. Small area estimation has become enormously important where inference from one area cannot be reliably made. Both the non-ignorable and ignorable selection models are applied to simulated data sets and real data from NHANES III with thirty-five counties. Furthermore, we develop two variations of this model using (a) more robust assumptions by assigning Dirichlet process priors to the random effects, and (b) a bivariate model to incorporate the correlation of the two sets of random effects.

  • etd-4271
Defense date
  • 2020
Date created
  • 2020-09-09
Resource type
Rights statement


In Collection:


Permanent link to this page: