R tidyverse quick example#
R has had undergone a bit of a revolution of sorts in the past few years. There has been a coordinated effort to develop a collection of code focusing on the major use cases for data science. Rstudio, the developer of the well known R/python IDE, in specific, has been a major leader in this effort. Howwever, there have also too many additional open source contributors to even try to mention a meaningful subset. The result is a really pleasant and easy to learn data science environment. In this course, we’ll focus on this so-called “tidy” data analysis. Let’s run through a quick analysis.
Let’s start with the definition of a tidy dataset. To quote:
Note
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Every column is a variable.
Every row is an observation.
Every cell is a single value.
The idea is, spend your time getting the data into a tidy format. This will make everything that follows so much easier, from summaries to plots to analyses. The set of tidyverse packages and functions (of course) focuses on tidy data, but nicely adds a common syntax and set of conventions. You can install the tidyverse with the R command install.packages("tidyverse")
, which only needs to be done once. You can add the tidyverse to a conda environment with conda install -c r r-tidyverse
.
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Make note of the conflict filter
which is both used to filter rows a dataframe and in signal processing. The ::
looks for functions in that package. You can use that if you want to quickly use a function without loading the package.
Let’s read in our data that we worked with previously. Note the function data
is taken in R, so it’s best not to use it to define a dataset. Better to just use dat
or something like that.
dat = read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/kirby127a_3_1_ax_283Labels_M2_corrected_stats.csv", show_col_types = FALSE)
head(dat)
New names:
• `` -> `...1`
...1 | rawid | roi | volume | min | max | mean | std | type | level |
---|---|---|---|---|---|---|---|---|---|
<dbl> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
1 | kirby127a_3_1_ax.img | Telencephalon_L | 531111 | 0 | 374 | 128.3013 | 51.8593 | 1 | 1 |
2 | kirby127a_3_1_ax.img | Telencephalon_R | 543404 | 0 | 300 | 135.0683 | 53.6471 | 1 | 1 |
3 | kirby127a_3_1_ax.img | Diencephalon_L | 9683 | 15 | 295 | 193.5488 | 32.2733 | 1 | 1 |
4 | kirby127a_3_1_ax.img | Diencephalon_R | 9678 | 10 | 335 | 193.7051 | 32.7869 | 1 | 1 |
5 | kirby127a_3_1_ax.img | Mesencephalon | 10268 | 55 | 307 | 230.8583 | 29.2249 | 1 | 1 |
6 | kirby127a_3_1_ax.img | Metencephalon | 159402 | 2 | 299 | 138.5200 | 52.2241 | 1 | 1 |
I don’t need the X1
or rawid
columns, let’s get rid of those. Note the pipe operator %>%
is really useful. Think of it as funneling the output from the previous statement to the next. So, below the first statement, dat
, just returns the dataset itself then it gets passed to select
. Of note, the newer versions of R have a built in pipe operator whereas the one we’re using is part of a package called magrittr.
The negative signs in front of the variables mean to remove them.
dat = dat %>% select(-"...1", -rawid)
dat %>% head
roi | volume | min | max | mean | std | type | level |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
Telencephalon_L | 531111 | 0 | 374 | 128.3013 | 51.8593 | 1 | 1 |
Telencephalon_R | 543404 | 0 | 300 | 135.0683 | 53.6471 | 1 | 1 |
Diencephalon_L | 9683 | 15 | 295 | 193.5488 | 32.2733 | 1 | 1 |
Diencephalon_R | 9678 | 10 | 335 | 193.7051 | 32.7869 | 1 | 1 |
Mesencephalon | 10268 | 55 | 307 | 230.8583 | 29.2249 | 1 | 1 |
Metencephalon | 159402 | 2 | 299 | 138.5200 | 52.2241 | 1 | 1 |
Let’s get the Type 1 Level 1 data.
t1l1 = dat %>% filter(type == 1, level == 1)
t1l1
roi | volume | min | max | mean | std | type | level |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
Telencephalon_L | 531111 | 0 | 374 | 128.3013 | 51.8593 | 1 | 1 |
Telencephalon_R | 543404 | 0 | 300 | 135.0683 | 53.6471 | 1 | 1 |
Diencephalon_L | 9683 | 15 | 295 | 193.5488 | 32.2733 | 1 | 1 |
Diencephalon_R | 9678 | 10 | 335 | 193.7051 | 32.7869 | 1 | 1 |
Mesencephalon | 10268 | 55 | 307 | 230.8583 | 29.2249 | 1 | 1 |
Metencephalon | 159402 | 2 | 299 | 138.5200 | 52.2241 | 1 | 1 |
Myelencephalon | 4973 | 12 | 286 | 199.8497 | 36.6501 | 1 | 1 |
CSF | 109776 | 0 | 258 | 33.0193 | 26.3262 | 1 | 1 |
## Set the base plot
g = ggplot(data = t1l1, aes(x = roi, y = volume, fill = roi))
## Add the bar graphs
g = g + geom_col()
## My fonts weren't rendering correctly, so changing to a different one
g = g + theme(text=element_text(family="Consolas"))
## The x axis labels are long and overlap if you don't rotate them
g = g + theme(axis.text.x = element_text(angle = 45))
## Show the plot
g