Consider the following definitions for the results of a diagnostic test ˆY∈{0,1}, where the actual disease state is Y∈{0,1}. Assume 1 represents having/testing for the disease.
A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%.
Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the positive predictive value?
and
P(Y=0 | ˆY=1)=P(ˆY=1 | Y=0)P(Y=0)P(ˆY=1).P(Y=1 | ˆY=1)P(Y=0 | ˆY=1)=P(ˆY=1 | Y=1)P(ˆY=1 | Y=0)×P(Y=1)P(Y=0)import pandas as pd
dat = pd.read_csv("https://raw.githubusercontent.com/bcaffo"\
"/ds4bme_intro/master/data/oasis.csv")
dat[ ['FLAIR', 'GOLD_Lesions'] ].head()
FLAIR | GOLD_Lesions | |
---|---|---|
0 | 1.143692 | 0 |
1 | 1.652552 | 0 |
2 | 1.036099 | 0 |
3 | 1.037692 | 0 |
4 | 1.580589 | 0 |
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
x0 = dat['FLAIR'][dat['GOLD_Lesions'] == 0]
x1 = dat['FLAIR'][dat['GOLD_Lesions'] == 1]
sns.kdeplot(x0, shade = True, label = 'Gold Std = 0')
sns.kdeplot(x1, shade = True, label = 'Gold Std = 1')
plt.show();
We can estimate these probabilities using the fraction of times a FLAIR values is above the threshold in the two gold standard groups
ˆT(c)=1|Γ1|∑i∈Γ1I(xi≥c)ˆF(c)=1|Γ0|∑i∈Γ0I(xi≥c)where Γ1={i | yi=1} and Γ0={i | yi=0}.
import numpy as np
x = dat['FLAIR']; y = dat['GOLD_Lesions']
c = np.concatenate( [[ np.min(x) - 1], np.sort(np.unique(x))
, [np.max(x) + 1]])
tpr = [np.mean( (x1 >= citer) ) for citer in c]
fpr = [np.mean( (x0 >= citer) ) for citer in c]
plt.plot(fpr, tpr);
plt.plot([0,1], [0,1]);
F(c)=1−Φ{(c−μ0)/σ0} F−1(f)=μ0+σ0Φ−1(1−f)
where μy and σy can be estimated from the data.
from scipy.stats import norm
mu0, mu1 = np.mean(x0), np.mean(x1)
s0, s1 = np.std(x0), np.std(x1)
c_seq = np.linspace(0, 3, 1000)
fpr_binorm = 1-norm.cdf(c_seq, mu0, s0)
tpr_binorm = 1-norm.cdf(c_seq, mu1, s1)
plt.plot(fpr, tpr)
plt.plot([0,1], [0,1]);
plt.plot(fpr_binorm, tpr_binorm);
## Calculating AUC directly as the faction of instances where X1 > X0
auc = np.sum([j >= i for i in x0 for j in x1]) / (len(x0) * len(x1))
print(auc)
## using sKlearn
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y, x)
print(auc(fpr, tpr))
0.7588 0.7587999999999999
import scipy.stats as stats
stats.norm.cdf(0, loc = mu0 - mu1, scale = np.sqrt(s0**2 + s1**2))
0.766401836491996