Scalable and reproducible workflows with Pachyderm

Data scientists must manage analyses that consist of multiple stages, large datasets and a great number of tools, all the while maintaining reproducibility of results. Amongst the variety of available tools to undertake parallel computations, Pachyderm is an open-source workflow-engine and distributed data processing tool that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem. In this workshop you will learn how to:
  • create a simple local Kubernetes infrastructure,
  • install and interact with Pachyderm and
  • implement a scalable and reproducible workflow using containers.

Instructions: https://github.com/jonandernovella/gridka-pachyderm

Relevant paper: https://doi.org/10.1093/bioinformatics/bty699