14  Data science, conceptually

14.1 Introduction

In this chapter, we’ll focus on data science theory, thinking and philosophy. We’re going to omit any standard treatment of statistical theory and inference, like maximum likelihood optimality and asymptotics, since those are well covered elsewhere. However, it’s worth mentioning that those topics are obviously part of data science theory.

Instead, we’ll focus on meta-inferential and meta-empirical science questions, as well as some of the conceptual and theoretical language that’s worth knowing.

14.2 All models are wrong

14.2.1 Wrong means wrong

“All models are wrong, but some are useful”, or some variant, is a quote from statistician George Box that is so well known and used that it has a lengthy wikipedia page. Restricting our attention to probabilistic models, it is interesting to note that this quote, which is near universally agreed upon, has implications that are often not. For example, the quote suggests that there is not, and never has been: an IID sample, normally distributed data (as in having been generated from a normal distribution), a true population stochastic model … In other words, there is no correct probabilistic model, ever, ever, ever (according to the quote).

One way to interpret this is that there are correct probability models, we just haven’t found them yet, or maybe can’t find them via some incompleteness law. If we ever find one, I guess we’d have to change the quote. But, I don’t think the quote is implying the existence of true probabilistic models that we don’t, or can’t, know. I tend to think it is suggesting that randomness doesn’t exist and hence probabilitity models are, like Newtonian mechanics, just models, not truth.

This is a well discussed topic in philosophy. On some meaningful level, the quote is obviously true. Most of the things we’re interested in public health are clearly purely functionally caused by antecedent variables, some of which we know and can measure and some of which we can’t. This is obviously true of of things like die rolls or casino games. Perhaps the most perfect example is random number generators, where we know the actual deterministic formula generating the numbers yet they are designed to appear as random as possible.

But does ranomness exist for some weird quantum setting, or is it just a useful model? The best answer for this question came from a famous result called Bell’s theorem, which suggests that hidden variables are not the missing ingridient to explain quantum phenomena. This theory has since been corraborated via experiments. To explain this theory using pure determinism (i.e. hidden variables) requires giving up some important notions in this area, such as locality. How much of a proof of the existence of randomness the theorem and experiments are is debatable (and heavily debated). Bell’s theorem notwithstanding, it still remains in question whether these measurements are truly random in some sense or just well modeled by randomness. But, I’ll stop here, since I don’t understand this stuff at all.

For our purposes, this seems irrelevant. Even if randomness does exist at the scale of the extremely small, our data generally references systems where we believe that determinism holds. There is no Bell’s theorem of disease. Typically, we think observed and hidden variables explain the relevant majority of an individual or population’s state of health. Moreover, regardless if some models are right or it’s just true that all stochastic models are wrong, many models are extremely useful. It is more important to be able to accurately represent one’s assumptions and how they connect to the experimental setting than it is to argue that one’s assumptions are true on some bizarre abstract level. So, my recommendation is to ignore this line of thinking entirely, and instead focus on being able to articulate your assumptions and describe what it is you’re treating as random, regardless of whether or not it actually is.

14.3 Some models are useful

Do we even care if models are ultimately correct? The quote ends with, “some models are useful”. How are they useful? The smallest meaningful statement that I can come up with is a model is useful if it convinces you or someone else of a true statement. This allows for the possibility that a poorly fitting, or obviously incorrect model, can still be useful.

Scott Zeger had a wonderful quote, I recall, which is that models are lenses through which we look at data. I really like this statement, since it mirrors the richness of ways that we can use models. Infrared, satellite and microscopic images all tell very different stories, but no one is the true depiction of world, though all are useful. The idea of models as lens through which we look at data is further explored in the research on exploratory modeling (Wickham 2006). The idea is to use modeling in the same way that we do exploratory data analysis.

Another useful definition of a model is that it connects our data to a hypothetical or actual population. This requires defining the population of interest. Having a good sense of the target population is an important step in data analysis. Without this step, we are creating estimators without estimands. Defined this way, our model helps us generalize our results beyond our sample. Kass discusses this idea of modeling in (Kass 2011). He expouses a form of “statistical pragmatism”. Notably, he makes a stark delineation between statisticl concepts and the “real world”. As an example

Statistical inferences of all kinds use statistical models, which embody theoretical assumptions. … like scientific models, statistical models exist in an abstract framework; to distinguish this framework from the real world inhabited by data we may call it a “theoretical world.” Random variables, confidence intervals, and posterior probabilities all live in this theoretical world. When we use a statistical model to make a statistical inference we implicitly assert that the variation exhibited by data is captured reasonably well by the statistical model, so that the theoretical world corresponds reasonably well to the real world. Conclusions are drawn by applying a statistical inference technique, which is a theoretical construct, to some real data.

14.4 Discussing randomness in modeling

It is almost always a useful exercise to ask, “What exactly is it that I’m treating as if random in a problem?”. Consider the following question:

  1. I create a confidence interval using the binomial distribution for the prevalance of a disease from a cohort study.
  2. In the same study, I fit a regression model on a continue measure of the disease outcome with an exposure predictor along with confounders.

In 1., it seems natural to say that it’s the sampling that’s random. In other words, there’s a population percentage of people with the disease and each subject in my sample is a random draw from that distribution. In 2. a reasonable assumption would be that my errors are an accumulation of unmodeled factors and that this accumulation is well modeled by assuming that the errors are IID.

Gerenerally a few commmon themes arise when discussing randomness in models, along with some combination of these things. In addition, there is a distinction in how one views randomness from Bayesian or frequency perspectives. This is covered well elsewhere. Finally, there are ideas, such as exchangeability, which can offer weaker assumptions and still provide inference.

Here we’ll discuss two major directions of randomness that are discussed in statistics.

14.4.1 Design based randomness

In randomized trials and finite population studies, often the randomness being modeled is the randomness in the design. This has a benefit of very good knowledge of the source of randomness if the design was actually followed. Researchers will also make inferences using design based arguments, even if that design was not utilized. For example, imagine estimating a prevalence from a sample that wasn’t actually random. You could make a statement such as “A 95% confidence interval pevalence would be [0.3, 0.70] if the sample were random.”

It should be noted that using randomization for design based inference only makes statements about your specific sample. For example, let \(Y_{i}(j)\) be the outcome one would have seen if subject \(i\) received treatment \(j=0,1\). We then observe \(Y_i(T_i) = Y_i\) where \(T_i\) is the actual treatment. Our estimate of the treatment effect is:

\[ E_{obs} = \frac{\sum_{i : T_i = 1} Y_{i}}{n_1} - \frac{\sum_{i : T_i = 0} Y_{i}}{n_0} \]

Consider calculating a P-value under the null assumption \(Y_i(1) = Y_i(0) = Y_i\) by redoing the randomization. Let \(T_i^{(m)}\) where \(m\) is resample be a hypergeometric (i.e. observing the sample sizes \(n_1\) and \(n_0\)) reallocation of treatment assignment.

\[ E_m = \frac{1}{M} \left(\frac{\sum_{i : T_i^{(m)} = 1} Y_{i}}{n_1} - \frac{\sum_{i : T_i^{(m)} = 0} Y_{i}}{n_0}\right) \]

Then a P-value could be calculated using \[ \frac{1}{M} \sum_{m=1}^M I(|E_m| > |E_{obs}|). \]

Notice, this inference is about the sample we observed, \(Y_i\), and maybe their counterfacultuals. But, without further assumptions conclusions would be relative to the sample. Typically, one does make these further assumptions, but note they typically require less direct knowledge of the source of randomness.

Sampling variation deserves a special mention as design based inference. Very often we make iid (random) sampling assumptions even if the assumptions are obviously false. It is important to acknowledge that conclusions are under this assumption.

14.4.2 Accumulated errors

In many settings we assume that we are modeling mechanisms, for example when modeling \(Y=f(x)+\epsilon\). Often, we assume that \(\epsilon\) is comprised of acculated errors from unmodeled variables. There is further an assumption that these errors accumulate in a way that is well modeled by randomness. Some examples where this fails to be true is when we omit an important confounder from our model. There, the errors are systematic in a way that is essential for understanding the scientific phenomena that we are studying. In a later chapter, we’ll discuss causal diagrams and think about what variables we include and exclude from our models and how to unmodeled variables can influence our results.

14.5 Summary

It’s important to emphasize that rather than thinking of one way to think about modeling as being right and others wrong, one should be able to describe with precision what they are doing in their analysis. Here are some examples of questions one might ask themselves:

  1. What am I trying to accomplish with this model?
  2. Is there a (potentially hypothetical) population that is being generalized, what is it, and what are we estimating from it?
  3. How were the data collected and does our model connect to the design well?
  4. What exactly is it that we’re modeling as if random?
  5. Are the modeling assumptions reasonable and is there any data evidence of that?
  6. How robust are our conclusions to our modeling assumptions; have we tried multiple models?

This is perhaps just a subset of the questions one should ask themselves. I wish I could be more prescriptive, with a checklist or something like that. However, how models are used differs a great deal by the various scientific communities, within a community and even within individual statisticians over time.

14.6 Example

Here’s a neat example of using sampling assumptions to make an inference that could create some discussion. Assume repeated sampling sampling to capture bunnies in a field on two occasions where the bunnies are tagged on the first occasion if caught. The goal is to figure out the number of bunnies in the field.

Caught 2 Not caught 2
Caught 1 \(n_{11}\) \(n_{12}\)
Not caught 1 \(n_{21}\) \(n_{22}\)

Note, we do not get to see \(n_{22}\) and knowledge of this number is equivalent to knowing the population size. Let \(N=\sum_{i,j} n_{ij}\) be the total sample size and \(\pi_{ij}\) be the probability of capture or not on occasion \(i\) and \(j\) so that \(E[n_{ij}] = N \pi_{ij}\). Assume capture between occasions is indepnedent and where each bunny lands is an independent draw from a multinomial distribution with probabilities \(\pi_{ij}\). Under the independence assumption the odds ratio is 1. Therefore, we assume \[ \frac{n_{11}n_{22}}{n_{12}n_{22}} \approx 1 \] Setting the left and right equal, we can solve for \(n_{22}\) as \[ \hat n_{22} = \frac{n_{12}n_{22}}{n_{11}} \]

Questions:

  1. What are we assuming is random in this estimation process?
  2. How can we make inference using these assumptions?
  3. Is the model likely to be useful?
  4. How could we relax the assumptions?

14.7 Reading