16  Data science as an applied science

16.1 Introduction

The term “Data science” is typically used to refer to a set of tools, techniques and thought processes used to perform scientific inquiries using data. But is that a scientific topic worthy of study in its own right? It is. And, from a practical, theoretical and philosophical level, it’s already extremely well studied in statistics, computer science, engineering and philosophy departments.

Despite these fields, at the Data Science Lab at JHU, we’ve had lengthy discussions about data science as an inductive, empirical applied science, like biology or medicine, rather than an deductive discipline, like mathematics or statistical theory, or rather than a set of heuristics, like rules of thumb and agreed upon best practices. An inductive, empirical, applied science itself needs data science, since it’s empirical and thus depends on data. So, maybe there’s an infinite regress of some sort making this an ultimately doomed endeavor. But, nevertheless, we persist.

There are some fields of data analysis that are well covered, let’s talk about them first.

16.1.1 Graphics, visualization and EDA

General EDA and visualization is covered in the next chapter. There is a vast literature of experiments to understand perception of visual data. This is a more neatly circumscribed area of data science as a science. This is possibly because the general field of visual perception, from a variety of angles, is well developed. Let’s cover a specific example, observer studies in radiology.

16.1.1.1 Observer studies

In diagnostic radiology, new technology in the form of new machines for imaging or new ways of processing images, are being invented constantly. To evaluate these new technologies, a gold standard is to have randomized studies where the underlying truth is known. The images from the new technology and images from a control technology are then randomized to observers. Several issues come about in observer studies. First, establishing truth can be hard. This is often done by digital or physical phantoms or in instances where patients are longitudinally followed up and the disease status is made apparent. A second issue is in the cost associated with observers. Often, instead of attending radiologists, the studies are done with residents or fellows, or students who have received specific training to qualify as reasonable proxies. An interesting alternative approach is to have digital observers.

My friends at the JHU Division of Medical Imaging Physics do this quite well. In one process, they first create highly accurate models of the human/non-human animal being studied (so called pantoms, see here). Next they create accurate models of the imaging system, say X-Ray CT or positron imaging. Suppose that they want to study two different ways of performing tomography, say a Bayes algorithm or an EM algorithm. They take generated images from their digital phantom and process them using the two candidate algorithms. Then they use human or mathematical observers to try to diagnose the disease using the processed images. Here’s some examples He et al. (2004).

16.1.2 Psychology

One of the most celebrated versions of data science as a science is the understanding of how we perceive chance, randomness and decision making under uncertainty, by Kahneman, Tversky, Slovic and many others. A good starting point for this is the book Kahneman et al. (1982).

One key point is that we all, regardless of training, rely on heuristics and biases to mentally represent uncertainty. An example is the lack of use of prior probabilities

Subjects were shown brief personality descriptions of several individuals, allegedly sampled at random from a group of 100 professionals - engineers and lawyers. The subjects were asked to assess, for each description, the probability that it belonged to an engineer rather than to a lawyer. In one experimental condition, subjects were told that the group from which the descriptions had been drawn consisted of 70 engineers and 30 lawyers. In another condition, subjects were told that the group consisted of 30 engineers and 70 lawyers. The odds that any particular description belongs to an engineer rather than to a lawyer should be higher in the first condition, where there is a majority of engineers, than in the second condition, where there is a majority of lawyers. Specifically, it can be shown by applying Bayes’ rule that the ratio of these odds should be (.7/.3)2, or 5.44, for each description. In a sharp violation of Bayes’ rule, the subjects in the two conditions produced essentially the same probability judgments. Apparently, subjects evaluated the likelihood that a particular description belonged to an engineer rather than to a lawyer by the degree to which this description was representa￾tive of the two stereotypes, with little or no regard for the prior probabili￾ties of the categories.

A major thrust of this work was on sample representativeness (Kahneman and Tversky 1972). They summarize their findings as:

  1. people expect samples to be highly similar to their parent population, and also to represent the randomness of the sampling process (Tversky & Kahneman, 1971, 1974); (ii) people often rely on representativeness as a heuristic for judgment and prediction (Kahneman & Tversky, 1972, 1973).

These biases creep into our judgements. Kahneman and Tversky (1972) discusses people’s ability to rank \(P(A)\) and \(P(A\cap B)\), the latter being mathematically guaranteed to be the smaller. They observed a strong conjunction bias. For example:

Linda is 31 years old, single, outspoken and very bright. She maJored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

When asked to rank the probability of “Linda is a bank teller” (T), “Linda is active in the feminist movement.” (F) versus “Linda is a bank teller and active in the feminist movement” (T&F). They similarly describe Bill as having stereotypical accountant traits and ask to rank “Bill is an accountant.”, “Bill plays jazz for a hobby.” and “Bill is an accountant who plays jazz for a hobby.” In 90% of the cases, the subjects ranked the intersection in between the two marginal probabilities (eg. ranking F > T&F > T for Linda).