Consider the following definitions for the results of a diagnostic test $\hat Y\in \{0,1\}$, where the actual disease state is $Y\in \{0,1\}$. Assume 1 represents having/testing for the disease.
A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%.
Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the positive predictive value?
and
$$ P(Y=0 ~|~ \hat Y=1) = \frac{P(\hat Y=1 ~|~ Y=0)P(Y=0)}{P(\hat Y=1)}. $$$$ \frac{P(Y = 1 ~|~ \hat Y=1)}{P(Y=0 ~|~ \hat Y=1)} = \frac{P(\hat Y = 1 ~|~ Y=1)}{P(\hat Y = 1 ~|~ Y=0)}\times \frac{P(Y=1)}{P(Y=0)} $$import pandas as pd
dat = pd.read_csv("https://raw.githubusercontent.com/bcaffo"\
"/ds4bme_intro/master/data/oasis.csv")
dat[ ['FLAIR', 'GOLD_Lesions'] ].head()
FLAIR | GOLD_Lesions | |
---|---|---|
0 | 1.143692 | 0 |
1 | 1.652552 | 0 |
2 | 1.036099 | 0 |
3 | 1.037692 | 0 |
4 | 1.580589 | 0 |
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
x0 = dat['FLAIR'][dat['GOLD_Lesions'] == 0]
x1 = dat['FLAIR'][dat['GOLD_Lesions'] == 1]
sns.kdeplot(x0, shade = True, label = 'Gold Std = 0')
sns.kdeplot(x1, shade = True, label = 'Gold Std = 1')
plt.show();
We can estimate these probabilities using the fraction of times a FLAIR values is above the threshold in the two gold standard groups
$$ \hat T(c) = \frac{1}{|\Gamma_1|} \sum_{i \in \Gamma_1} I(x_i \geq c) $$$$ \hat F(c) = \frac{1}{|\Gamma_0|} \sum_{i \in \Gamma_0} I(x_i \geq c) $$where $\Gamma_1 = \{i ~|~ y_i = 1\}$ and $\Gamma_0 = \{i ~|~ y_i = 0\}$.
import numpy as np
x = dat['FLAIR']; y = dat['GOLD_Lesions']
c = np.concatenate( [[ np.min(x) - 1], np.sort(np.unique(x))
, [np.max(x) + 1]])
tpr = [np.mean( (x1 >= citer) ) for citer in c]
fpr = [np.mean( (x0 >= citer) ) for citer in c]
plt.plot(fpr, tpr);
plt.plot([0,1], [0,1]);
$$F(c) = 1 - \Phi\{ (c - \mu_0) / \sigma_0 \}$$ $$F^{-1}(f) = \mu_0 + \sigma_0 \Phi^{-1}(1-f)$$
where $\mu_y$ and $\sigma_y$ can be estimated from the data.
from scipy.stats import norm
mu0, mu1 = np.mean(x0), np.mean(x1)
s0, s1 = np.std(x0), np.std(x1)
c_seq = np.linspace(0, 3, 1000)
fpr_binorm = 1-norm.cdf(c_seq, mu0, s0)
tpr_binorm = 1-norm.cdf(c_seq, mu1, s1)
plt.plot(fpr, tpr)
plt.plot([0,1], [0,1]);
plt.plot(fpr_binorm, tpr_binorm);
## Calculating AUC directly as the faction of instances where X1 > X0
auc = np.sum([j >= i for i in x0 for j in x1]) / (len(x0) * len(x1))
print(auc)
## using sKlearn
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y, x)
print(auc(fpr, tpr))
0.7588 0.7587999999999999
We can just directly calculate
$$P(X_1 > X_0) = P(X_0 - X_1 < 0)$$import scipy.stats as stats
stats.norm.cdf(0, loc = mu0 - mu1, scale = np.sqrt(s0**2 + s1**2))
0.766401836491996