Exploratory data analysis

Exploratory data analysis#

In this lecture, we’ll talk about EDA (exploratory data analysis) using the python framework seaborn. EDA is the process of using graphs to uncover features in your data often interactively. EDA is hard to quantify, but is touted by most applied data scientists as a crucial component of their craft. EDA is often summarized by the famous sayings

A picture is worth a 1,000 words

Or saying how impactful intrer-ocular content is (i.e. when information hits you right between the eyes).

I’m using Seaborn as the framework. There’s several plotting frameworks in python, but I find that seaborn has the nicest default plotting options. Also, it’s built on top of matplotlib, which is the main plotting library for DS for python.

Let’s start with loading up some libraries.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
## this sets some style parameters
sns.set()

First let’s download the data. Then we’ll read it in and drop some columns that aren’t needed for this analysis.

## Reading it in, keeping only volume
df = pd.read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/kirby21.csv")
df = df.drop(['Unnamed: 0', 'rawid', 'min', 'max', 'mean', 'std'],\
             axis = 1)
df.head(4)

	roi	volume	type	level	id	icv	tbv
0	Telencephalon_L	531111	1	1	127	1378295	1268519
1	Telencephalon_R	543404	1	1	127	1378295	1268519
2	Diencephalon_L	9683	1	1	127	1378295	1268519
3	Diencephalon_R	9678	1	1	127	1378295	1268519

Let’s look at the Type 1 Level 1 data and create a variable called comp which is brain composition, defined as the regional volumes over total brain volume. We’ll do this by selecting roi and comp then grouping by roi and taking the mean of the compostions.

## Extract the Type 1 Level 1 data
t1l1 = df.loc[(df.type == 1) & (df.level == 1)]

## create a composition variable
t1l1 = t1l1.assign(comp = t1l1.volume / t1l1.tbv)
t1l1 = t1l1.loc[t1l1.roi != 'CSF']
## get the mean of the composition variable across
## subjects by ROI
summary = t1l1[['roi', 'comp']].groupby('roi', as_index=False).mean()
print(summary)

               roi      comp
 Diencephalon_L  0.007563
 Diencephalon_R  0.007634
  Mesencephalon  0.008647
  Metencephalon  0.124883
 Myelencephalon  0.003785
Telencephalon_L  0.420305
Telencephalon_R  0.427184

OK, let’s try our first plot, a seaborn bar plot.

g = sns.barplot(x='roi', y = 'comp', data = summary)
## this is the matplotlib command for rotating 
## axis tick labels by 90 degrees.
plt.xticks(rotation = 90)

(array([0, 1, 2, 3, 4, 5, 6]),
 [Text(0, 0, 'Diencephalon_L'),
  Text(1, 0, 'Diencephalon_R'),
  Text(2, 0, 'Mesencephalon'),
  Text(3, 0, 'Metencephalon'),
  Text(4, 0, 'Myelencephalon'),
  Text(5, 0, 'Telencephalon_L'),
  Text(6, 0, 'Telencephalon_R')])

_images/796ea4905da793f8817248f7691a3a756bdae4e58bb8c0f791ad680bbd690fc2.png

Unfortunately, seaborn doesn’t have a stakced bar chart. However, pandas does have one built in. To do this, however, we have to create a version of the data with ROIs as the columns. This is done with a pivot statement.

t1l1pivot = t1l1.pivot(index = 'id', columns = 'roi', values = 'volume')
t1l1pivot.head(4)

roi	Diencephalon_L	Diencephalon_R	Mesencephalon	Metencephalon	Myelencephalon	Telencephalon_L	Telencephalon_R
id
127	9683	9678	10268	159402	4973	531111	543404
142	9156	9071	10516	165803	4912	545603	552216
239	8937	9004	9070	124504	4023	483107	490805
346	8828	8933	9788	135090	4428	558849	568830

t1l1pivot.plot(kind='bar', stacked=True, legend= False)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

<matplotlib.legend.Legend at 0x7ba67426d870>

_images/96c2b27873270e91ac9ff275884aeef11005d3843e7c40836f309fa24589a516.png

Let’s do some scatterplots. Let’s look at bilateral symmetry in the telencephalon.

sns.scatterplot(x = 'Telencephalon_L', y = 'Telencephalon_R', data = t1l1pivot)
plt.xticks(rotation = 90)

#plot an identity line from the data min to the data max
x1 = min([t1l1pivot.Telencephalon_L.min(), t1l1pivot.Telencephalon_R.min()])
x2 = max([t1l1pivot.Telencephalon_L.max(), t1l1pivot.Telencephalon_R.max()])
plt.plot([x1, x2], [x1 , x2])

[<matplotlib.lines.Line2D at 0x7ba67426e6b0>]

_images/20b61672cd8dbce52ff5efd1ff310fd216b41cb2ccebe1a5cff80eefcbe9bfc2.png

This plot has the issue that there’s a lot of blank space. This is often addressed via a mean difference plot. This plot shows (X+Y) / 2 versus (X-y). This is basically just rotating the plot above by 45 degrees to get rid of all of the blank space around the diagonal line. Alternatively, you could plot (log(x) + log(y)) / 2 versus log(X) - log(Y). This plots the log of the geometric mean of the two observations versus the log of their ratio. Sometimes people use log base 2 or log base 10.

t1l1pivot = t1l1pivot.assign(Tel_logmean = lambda x: (np.log(x.Telencephalon_L) * .5 +  np.log(x.Telencephalon_R)* .5))
t1l1pivot = t1l1pivot.assign(Tel_logdiff = lambda x: (np.log(x.Telencephalon_R) -  np.log(x.Telencephalon_L)))
sns.scatterplot(x = 'Tel_logmean', y = 'Tel_logdiff', data = t1l1pivot)
plt.axhline(0, color='green')
plt.xticks(rotation = 90)

(array([12.8 , 12.85, 12.9 , 12.95, 13.  , 13.05, 13.1 , 13.15, 13.2 ,
        13.25, 13.3 ]),
 [Text(12.8, 0, '12.80'),
  Text(12.850000000000001, 0, '12.85'),
  Text(12.9, 0, '12.90'),
  Text(12.950000000000001, 0, '12.95'),
  Text(13.0, 0, '13.00'),
  Text(13.05, 0, '13.05'),
  Text(13.100000000000001, 0, '13.10'),
  Text(13.15, 0, '13.15'),
  Text(13.200000000000001, 0, '13.20'),
  Text(13.25, 0, '13.25'),
  Text(13.3, 0, '13.30')])

_images/74b1cb0bfe2fa1c5bd8c55b47b62c69c55b034cb8673e7c93e5eeca911f24021.png

Thus, apparently, the right side is always a little bigger than the left and the scale of the ratio is \(e^{0.02}\) while the scale of the geometric mean is \(e^{13}\). Note, \(\exp(x) \approx 1 + x\) for \(x \approx 0\). So it’s about 2% larger. A note about right versus left in imaging. Often the labels get switched as there are different conventions (is it the right of the subject or the right of the viewer when looking straight at the subject?). Typically, it’s known that some of the areas of subject’s left hemisphere are larger and so it’s probably radiological (right of the viewer) convention here. Here’s a nicely done article about right versus left brain.

Also, in case you don’t believe me, here’s a plot of \(e^x\) versus \(1+x\) for values up to 0.1. This is obtained as the Taylor expasion for \(e^x\) around 0.

x = np.arange(0, .1, .001)
ex = np.exp(x)

sns.lineplot(x = x, y = ex)
plt.plot(x, x + 1)

/home/bcaffo/miniconda3/envs/ds4bio_new/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/home/bcaffo/miniconda3/envs/ds4bio_new/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

[<matplotlib.lines.Line2D at 0x7ba66be77a60>]

_images/54890d3319e12def6c6cbc736e8a92dd02ecd4726aad723e8a8a8ee5fddef943.png