Data cleaning by example

Data cleaning by example#

We’re going to cover data cleaning by an example. Primarily, you’re going to work in pandas, a library for manipulating tabular data.

Imports and files#

The first thing we’ll try is loading some data and plotting it. To do this, we’ll need some packages. Let’s load up pandas, a package for data management, and matplotlib. The python command for this is import.

import pandas as pd ## Pandas is our main data cleaning library
import numpy as np ## Numpy is our main numerical library
import matplotlib as mpl ## Matplotlib is our main plotting library

Reading data in with pandas#

Let’s now read in an MRICloud dataset using pandas. We want to use the function read_csv within pandas. Notice we imported pandas as pd so the command is pd.read_csv. Also, pandas can accept URLs, so we just put the link to the file in the argument. The data we want to read in is in a github repo I created.

## Pandas can read in from a URL
df = pd.read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/kirby127a_3_1_ax_283Labels_M2_corrected_stats.csv")

Let’s look at the first 4 rows of our dataframe. The object dataset is a pandas object with associated methods. One is head which allows one to see the first few rows of data.

df.head(4)

	Unnamed: 0	rawid	roi	volume	min	max	mean	std	type	level
0	1	kirby127a_3_1_ax.img	Telencephalon_L	531111	0	374	128.3013	51.8593	1	1
1	2	kirby127a_3_1_ax.img	Telencephalon_R	543404	0	300	135.0683	53.6471	1	1
2	3	kirby127a_3_1_ax.img	Diencephalon_L	9683	15	295	193.5488	32.2733	1	1
3	4	kirby127a_3_1_ax.img	Diencephalon_R	9678	10	335	193.7051	32.7869	1	1

Working with the data#

Let’s get rid of the column rawid and the unnamed column since they’re kind of useless for today’s lecture. Also let’s work with only the volume.

df = df.drop(['Unnamed: 0', 'rawid', 'min', 'max', 'mean', 'std'], axis = 1)

Now let’s create a column called icv for intra-cranial volume. ICV is defined as the summ of the Type I Level 1 structures and cerebrospinal fluid. For the rest of this lecture, we’re just going to look at this type and level.

## Extract the Type 1 Level 1 data
t1l1 = df.loc[(df['type'] == 1) & (df['level'] == 1)].copy()
## Create a new column based on ICV
t1l1['icv'] = sum(t1l1['volume'])
t1l1

	roi	volume	type	level	icv
0	Telencephalon_L	531111	1	1	1378295
1	Telencephalon_R	543404	1	1	1378295
2	Diencephalon_L	9683	1	1	1378295
3	Diencephalon_R	9678	1	1	1378295
4	Mesencephalon	10268	1	1	1378295
5	Metencephalon	159402	1	1	1378295
6	Myelencephalon	4973	1	1	1378295
7	CSF	109776	1	1	1378295

One can access variables with methods, like df.type, or using brackets like df[‘type’]. I prefer the latter, since it can accomodate things like spaces, periods or other special characters in the varible name. In addition to defining new varibles using brackets, one can use assign. The .copy() command is used because I want a new dataframe, not just referencing the slices of the other.

Now the TBV is defined as the sum of the volume for all rows except CSF.

t1l1 = t1l1.assign(tbv = sum(t1l1['volume'][(t1l1['roi'] != 'CSF')]))
t1l1

	roi	volume	type	level	icv	tbv
0	Telencephalon_L	531111	1	1	1378295	1268519
1	Telencephalon_R	543404	1	1	1378295	1268519
2	Diencephalon_L	9683	1	1	1378295	1268519
3	Diencephalon_R	9678	1	1	1378295	1268519
4	Mesencephalon	10268	1	1	1378295	1268519
5	Metencephalon	159402	1	1	1378295	1268519
6	Myelencephalon	4973	1	1	1378295	1268519
7	CSF	109776	1	1	1378295	1268519

Let’s look at brain composition.

t1l1['comp'] = t1l1['volume'] / t1l1['tbv']
t1l1

	roi	volume	type	level	icv	tbv	comp
0	Telencephalon_L	531111	1	1	1378295	1268519	0.418686
1	Telencephalon_R	543404	1	1	1378295	1268519	0.428377
2	Diencephalon_L	9683	1	1	1378295	1268519	0.007633
3	Diencephalon_R	9678	1	1	1378295	1268519	0.007629
4	Mesencephalon	10268	1	1	1378295	1268519	0.008094
5	Metencephalon	159402	1	1	1378295	1268519	0.125660
6	Myelencephalon	4973	1	1	1378295	1268519	0.003920
7	CSF	109776	1	1	1378295	1268519	0.086539

Plotting#

Pandas has built in methods for plotting. Later on, we’ll try different plotting packages.

t1l1.plot.bar(x='roi',y='comp');

_images/c12b9bbfd5f7e79bd15929a8fd9eabadd5fba7c7e46a6d5f8daea66885db1de6.png

In colab, you have to install packages it doesn’t have everytime you reconnect the runtime. I’ve commented this out here, since plotly is already installed locally for me. To install in colab, use a ! in front of the unix command. In this case we’re using the python package management system pip to install plotly, an interactive graphing envinronment.

#!pip install plotly

We can create an interactive plot with plotly. This is a professionally developed package that makes interactive plotting very easy. Also, it renders nicely within colab or jupyter notebooks. For plotly graphics, I would suggest assigning the graph to a variable then calling that variable to show the plot. This way you can modify the plot later if you’d like.

import plotly.express as px
myplot = px.bar(t1l1, x='roi', y='volume')
myplot.show()