R tidyverse quick example

Open In Colab

R tidyverse quick example#

R has had undergone a bit of a revolution of sorts in the past few years. There has been a coordinated effort to develop a collection of code focusing on the major use cases for data science. Rstudio, the developer of the well known R/python IDE, in specific, has been a major leader in this effort. Howwever, there have also too many additional open source contributors to even try to mention a meaningful subset. The result is a really pleasant and easy to learn data science environment. In this course, we’ll focus on this so-called “tidy” data analysis. Let’s run through a quick analysis.

Let’s start with the definition of a tidy dataset. To quote:

Note

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Every column is a variable.

  2. Every row is an observation.

  3. Every cell is a single value.

The idea is, spend your time getting the data into a tidy format. This will make everything that follows so much easier, from summaries to plots to analyses. The set of tidyverse packages and functions (of course) focuses on tidy data, but nicely adds a common syntax and set of conventions. You can install the tidyverse with the R command install.packages("tidyverse"), which only needs to be done once. You can add the tidyverse to a conda environment with conda install -c r r-tidyverse .

library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.3      readr     2.1.4
 forcats   1.0.0      stringr   1.5.0
 ggplot2   3.4.4      tibble    3.2.1
 lubridate 1.9.3      tidyr     1.3.0
 purrr     1.0.2     
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Make note of the conflict filter which is both used to filter rows a dataframe and in signal processing. The :: looks for functions in that package. You can use that if you want to quickly use a function without loading the package.

Let’s read in our data that we worked with previously. Note the function data is taken in R, so it’s best not to use it to define a dataset. Better to just use dat or something like that.

dat = read_csv("https://raw.githubusercontent.com/bcaffo/ds4bme_intro/master/data/kirby127a_3_1_ax_283Labels_M2_corrected_stats.csv", show_col_types = FALSE)
head(dat)
New names:
 `` -> `...1`
A tibble: 6 × 10
...1rawidroivolumeminmaxmeanstdtypelevel
<dbl><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1kirby127a_3_1_ax.imgTelencephalon_L531111 0374128.301351.859311
2kirby127a_3_1_ax.imgTelencephalon_R543404 0300135.068353.647111
3kirby127a_3_1_ax.imgDiencephalon_L 968315295193.548832.273311
4kirby127a_3_1_ax.imgDiencephalon_R 967810335193.705132.786911
5kirby127a_3_1_ax.imgMesencephalon 1026855307230.858329.224911
6kirby127a_3_1_ax.imgMetencephalon 159402 2299138.520052.224111

I don’t need the X1 or rawid columns, let’s get rid of those. Note the pipe operator %>% is really useful. Think of it as funneling the output from the previous statement to the next. So, below the first statement, dat, just returns the dataset itself then it gets passed to select. Of note, the newer versions of R have a built in pipe operator whereas the one we’re using is part of a package called magrittr.

The negative signs in front of the variables mean to remove them.

dat = dat %>% select(-"...1", -rawid)
dat %>% head
A tibble: 6 × 8
roivolumeminmaxmeanstdtypelevel
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Telencephalon_L531111 0374128.301351.859311
Telencephalon_R543404 0300135.068353.647111
Diencephalon_L 968315295193.548832.273311
Diencephalon_R 967810335193.705132.786911
Mesencephalon 1026855307230.858329.224911
Metencephalon 159402 2299138.520052.224111

Let’s get the Type 1 Level 1 data.

t1l1 = dat %>% filter(type == 1, level == 1)
t1l1
A tibble: 8 × 8
roivolumeminmaxmeanstdtypelevel
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Telencephalon_L531111 0374128.301351.859311
Telencephalon_R543404 0300135.068353.647111
Diencephalon_L 968315295193.548832.273311
Diencephalon_R 967810335193.705132.786911
Mesencephalon 1026855307230.858329.224911
Metencephalon 159402 2299138.520052.224111
Myelencephalon 497312286199.849736.650111
CSF 109776 0258 33.019326.326211
## Set the base plot
g = ggplot(data = t1l1, aes(x = roi, y = volume, fill = roi))
## Add the bar graphs
g = g + geom_col()
## My fonts weren't rendering correctly, so changing to a different one
g = g + theme(text=element_text(family="Consolas"))
## The x axis labels are long and overlap if you don't rotate them
g = g + theme(axis.text.x = element_text(angle = 45))
## Show the plot
g