Exploratory data analysis

Open In Colab

Exploratory data analysis#

In this lecture, we’ll talk about EDA (exploratory data analysis) using the python framework seaborn. EDA is the process of using graphs to uncover features in your data often interactively. EDA is hard to quantify, but is touted by most applied data scientists as a crucial component of their craft. EDA is often summarized by the famous sayings

A picture is worth a 1,000 words

Or saying how impactful intrer-ocular content is (i.e. when information hits you right between the eyes).

I’m using Seaborn as the framework. There’s several plotting frameworks in python, but I find that seaborn has the nicest default plotting options. Also, it’s built on top of matplotlib, which is the main plotting library for DS for python.

Let’s start with loading up some libraries.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
## this sets some style parameters
sns.set()

First let’s download the data. Then we’ll read it in and drop some columns that aren’t needed for this analysis.

## Reading it in, keeping only volume
df = pd.read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/kirby21.csv")
df = df.drop(['Unnamed: 0', 'rawid', 'min', 'max', 'mean', 'std'],\
             axis = 1)
df.head(4)
roi volume type level id icv tbv
0 Telencephalon_L 531111 1 1 127 1378295 1268519
1 Telencephalon_R 543404 1 1 127 1378295 1268519
2 Diencephalon_L 9683 1 1 127 1378295 1268519
3 Diencephalon_R 9678 1 1 127 1378295 1268519

Let’s look at the Type 1 Level 1 data and create a variable called comp which is brain composition, defined as the regional volumes over total brain volume. We’ll do this by selecting roi and comp then grouping by roi and taking the mean of the compostions.

## Extract the Type 1 Level 1 data
t1l1 = df.loc[(df.type == 1) & (df.level == 1)]
## create a composition variable
t1l1 = t1l1.assign(comp = t1l1.volume / t1l1.tbv)
t1l1 = t1l1.loc[t1l1.roi != 'CSF']
## get the mean of the composition variable across
## subjects by ROI
summary = t1l1[['roi', 'comp']].groupby('roi', as_index=False).mean()
print(summary)
               roi      comp
0   Diencephalon_L  0.007563
1   Diencephalon_R  0.007634
2    Mesencephalon  0.008647
3    Metencephalon  0.124883
4   Myelencephalon  0.003785
5  Telencephalon_L  0.420305
6  Telencephalon_R  0.427184

OK, let’s try our first plot, a seaborn bar plot.

g = sns.barplot(x='roi', y = 'comp', data = summary)
## this is the matplotlib command for rotating 
## axis tick labels by 90 degrees.
plt.xticks(rotation = 90)
(array([0, 1, 2, 3, 4, 5, 6]),
 [Text(0, 0, 'Diencephalon_L'),
  Text(1, 0, 'Diencephalon_R'),
  Text(2, 0, 'Mesencephalon'),
  Text(3, 0, 'Metencephalon'),
  Text(4, 0, 'Myelencephalon'),
  Text(5, 0, 'Telencephalon_L'),
  Text(6, 0, 'Telencephalon_R')])
_images/796ea4905da793f8817248f7691a3a756bdae4e58bb8c0f791ad680bbd690fc2.png

Unfortunately, seaborn doesn’t have a stakced bar chart. However, pandas does have one built in. To do this, however, we have to create a version of the data with ROIs as the columns. This is done with a pivot statement.

t1l1pivot = t1l1.pivot(index = 'id', columns = 'roi', values = 'volume')
t1l1pivot.head(4)
roi Diencephalon_L Diencephalon_R Mesencephalon Metencephalon Myelencephalon Telencephalon_L Telencephalon_R
id
127 9683 9678 10268 159402 4973 531111 543404
142 9156 9071 10516 165803 4912 545603 552216
239 8937 9004 9070 124504 4023 483107 490805
346 8828 8933 9788 135090 4428 558849 568830
t1l1pivot.plot(kind='bar', stacked=True, legend= False)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
<matplotlib.legend.Legend at 0x7ba67426d870>
_images/96c2b27873270e91ac9ff275884aeef11005d3843e7c40836f309fa24589a516.png

Let’s do some scatterplots. Let’s look at bilateral symmetry in the telencephalon.

sns.scatterplot(x = 'Telencephalon_L', y = 'Telencephalon_R', data = t1l1pivot)
plt.xticks(rotation = 90)

#plot an identity line from the data min to the data max
x1 = min([t1l1pivot.Telencephalon_L.min(), t1l1pivot.Telencephalon_R.min()])
x2 = max([t1l1pivot.Telencephalon_L.max(), t1l1pivot.Telencephalon_R.max()])
plt.plot([x1, x2], [x1 , x2])
[<matplotlib.lines.Line2D at 0x7ba67426e6b0>]
_images/20b61672cd8dbce52ff5efd1ff310fd216b41cb2ccebe1a5cff80eefcbe9bfc2.png

This plot has the issue that there’s a lot of blank space. This is often addressed via a mean difference plot. This plot shows (X+Y) / 2 versus (X-y). This is basically just rotating the plot above by 45 degrees to get rid of all of the blank space around the diagonal line. Alternatively, you could plot (log(x) + log(y)) / 2 versus log(X) - log(Y). This plots the log of the geometric mean of the two observations versus the log of their ratio. Sometimes people use log base 2 or log base 10.

t1l1pivot = t1l1pivot.assign(Tel_logmean = lambda x: (np.log(x.Telencephalon_L) * .5 +  np.log(x.Telencephalon_R)* .5))
t1l1pivot = t1l1pivot.assign(Tel_logdiff = lambda x: (np.log(x.Telencephalon_R) -  np.log(x.Telencephalon_L)))
sns.scatterplot(x = 'Tel_logmean', y = 'Tel_logdiff', data = t1l1pivot)
plt.axhline(0, color='green')
plt.xticks(rotation = 90)
(array([12.8 , 12.85, 12.9 , 12.95, 13.  , 13.05, 13.1 , 13.15, 13.2 ,
        13.25, 13.3 ]),
 [Text(12.8, 0, '12.80'),
  Text(12.850000000000001, 0, '12.85'),
  Text(12.9, 0, '12.90'),
  Text(12.950000000000001, 0, '12.95'),
  Text(13.0, 0, '13.00'),
  Text(13.05, 0, '13.05'),
  Text(13.100000000000001, 0, '13.10'),
  Text(13.15, 0, '13.15'),
  Text(13.200000000000001, 0, '13.20'),
  Text(13.25, 0, '13.25'),
  Text(13.3, 0, '13.30')])
_images/74b1cb0bfe2fa1c5bd8c55b47b62c69c55b034cb8673e7c93e5eeca911f24021.png

Thus, apparently, the right side is always a little bigger than the left and the scale of the ratio is $e^{0.02}$ while the scale of the geometric mean is $e^{13}$. Note, $\exp(x) \approx 1 + x$ for $x \approx 0$. So it’s about 2% larger. A note about right versus left in imaging. Often the labels get switched as there are different conventions (is it the right of the subject or the right of the viewer when looking straight at the subject?). Typically, it’s known that some of the areas of subject’s left hemisphere are larger and so it’s probably radiological (right of the viewer) convention here. Here’s a nicely done article about right versus left brain.

Also, in case you don’t believe me, here’s a plot of $e^x$ versus $1+x$ for values up to 0.1. This is obtained as the Taylor expasion for $e^x$ around 0.

x = np.arange(0, .1, .001)
ex = np.exp(x)

sns.lineplot(x = x, y = ex)
plt.plot(x, x + 1)
/home/bcaffo/miniconda3/envs/ds4bio_new/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/home/bcaffo/miniconda3/envs/ds4bio_new/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
[<matplotlib.lines.Line2D at 0x7ba66be77a60>]
_images/54890d3319e12def6c6cbc736e8a92dd02ecd4726aad723e8a8a8ee5fddef943.png