5 Pipelines

Pipelining software enacts a directed acyclic graph on the tasks in order to build a project. There are so many pipeline and workflow examples out there, you probably want to pick the one that is most commonly used in your area. For example, snakemake seems to be very common in computational biology while nypipe is used a lot in neuroimaging.

We’ll cover make here, which is less commonly used for data science, but is ubiquitously used for software development. A makefile simply says what tasks needs to be done to construct a project. Then, it only runs those tasks that need to be updated. Imagine something like the folowing. I have R code that creates a figure, that figure is required by a latex file.

project.pdf : rplots.pdf project.tex
        pdflatex project.tex

rplots.pdf : gen_rplots.R
       R --no-save < gen_rplots.R

clean :
    rm rplots.pdf
    rm project.pdf
    rm project.log

Typing make at the command like will make my file project.pdf. I can type make clean to remove my build files. I can also type make rplots.pdf to make just that file. Make uses timestamps to only update what is needed to be updated. So if I only change project.tex it won’t regenerate my rplots.pdf.

Why use something like make when we have rmarkdown, quarto …? The main reason is that make is completely general. You can have any command used in building and any type of output. This is why these kind of build utilities are so ubiquitously used in pipelining.