Statistics for data science and measurement¶

Babak Moghadas 1 and Brian Caffo 1, 2¶

1 Department of Biostatistics¶

2 Department of Biomedical Engineering¶

Bloomberg School of Public Health¶

Johns Hopkins University¶

Binary outcomes (two-class classification)¶

Consider the following definitions for the results of a diagnostic test $\hat Y\in \{0,1\}$ , where the actual disease state is $Y\in \{0,1\}$ . Assume 1 represents having/testing for the disease.

Measures conditional on disease status:¶

Sensitivity P(ˆY=1 | Y=1), probability that the prediction is positive given the disease is present.
- Also called the true positive rate.
- $P(\hat Y =0 ~|~ Y=1)$ , one minus the sensitivity, is the false positive or miss rate.
- Also called recall and hit rate
Specificity P(ˆY=0 | Y=0), probability that the prediction is negative given the disease is absent.
- Also called the true negative rate.
- $P(\hat Y=1~|~ Y=0)$ , one minus the specificity is the false negative rate or fall out.
- Also called selectivity.

Other measures¶

DLR+, diagnostic likelihood ratio of a positive prediction $P(\hat Y=1 ~|~ Y=1) / P(\hat Y=1 ~|~ Y= 0)$ is also the sensitivity over one minus the specificity, i.e. the true positive rate divided by the false positive rate.
DLR-, diagnostic likelihood ratio of a negative prediction $P(\hat Y=0 ~|~ Y=1) / P(\hat Y=0 ~|~ Y=0)$ is one minus the sensitivity divided by the specificity or the false negative rate divided by the true negative rate.
The disease prevalence is $P(Y=1)$ (and, generally less discussed, the prediction disease prevalance $P(\hat Y =1)$ .)
The accuracy is $P(\hat Y = y) = P(\hat Y = 1 ~|~ Y = 1) P(Y = 1) + P(\hat Y = 0 ~|~ Y = 0) P(Y = 0)$ , which is the sensitivity times the prevalence plus the specificity times one minus the prevalance.

Role of sampling in estimation¶

If you have a cross-sectional sample, then all of these quatities are directly estimable.
If the data were sampled by disease status, or data was augmented to force outcome balance, then Y should be conditioned and the sensitivity, specificity and DLR+/- are directly estimable.
- You can obtain the PPV and NPV using Bayes' rule given a disease prevalance.
In a rare setting where the design fixes the prediction result, the NPV and PPV would be directly estimable and one would have to use the prevalance of a positive prediction and Bayes' rule to obtain the sensitivity and specificity.
Accuracy depends on the prevalence, so imbalance in the outcome categories bounds a reasonable accuracy.

	FLAIR	GOLD_Lesions
0	1.143692	0
1	1.652552	0
2	1.036099	0
3	1.037692	0
4	1.580589	0

The ROC curve¶

Consider a given FLAIR threshold, say $c$ .
The true positive rate given that threshold is $T(c) = P(X \geq c ~|~ Y=1)$
The false positive rate is $F(c) = P(X \geq c~|~ Y=0 )$ .
The function $f\rightarrow T\{F^{-1}(f)\}$ is the ROC curve.
The ROC curve is typically displayed in the plot: $(f, T\{F^{-1}(f)\})$
An empirical estimate of the ROC curve requires empirical estimates of $T$ and $F$ which can be done non-parametrically or parametrically.

Binormal estimation¶

We could also assume distributional forms for $T$ and $F$
For example, suppose $X ~|~ Y=y \sim N(\mu_y, \sigma_y^2)$ .
Then, note if $\Phi$ is the standard normal distribution function then

$T(c) = 1 - \Phi\{ (c - \mu_1) / \sigma_1 \}$

$F(c) = 1 - \Phi\{ (c - \mu_0) / \sigma_0 \}$ $F^{-1}(f) = \mu_0 + \sigma_0 \Phi^{-1}(1-f)$

Thus, the ROC curve is

$T\{F^{-1}(f)\} = 1 - \Phi\left\{ \frac{\mu_0 -\mu_1}{\sigma_1} + \frac{\sigma_0}{{\sigma_1}} \Phi^{-1}(1-f) \right\}$

where $\mu_y$ and $\sigma_y$ can be estimated from the data.

Statistics for data science and measurement Babak Moghadas 1 and Brian Caffo 1, 2 1 Department of Biostatistics 2 Department of Biomedical Engineering Bloomberg School of Public Health Johns Hopkins University

Statistics for data science and measurement¶

Babak Moghadas 1 and Brian Caffo 1, 2¶

1 Department of Biostatistics¶

2 Department of Biomedical Engineering¶

Bloomberg School of Public Health¶

Johns Hopkins University¶

About these slides¶

Part 3 validation of ML algorithms¶

Binary outcomes (two-class classification)¶

Measures conditional on disease status:¶

Measures conditional on the data¶

Other measures¶

Role of sampling in estimation¶

Basic example¶

Bayes's rule¶

DLR¶

DLR continued¶

DLR+ example¶

DRL- example¶

ROC curves¶

The ROC curve¶

The ROC curve satisfies:¶

Non-parametric estimation¶

Binormal estimation¶

AUC¶

Binormal AUC¶