Scalable and reproducible workflows with Pachyderm
Data scientists must manage analyses that consist of multiple stages, large datasets and a great number of tools, all the while maintaining reproducibility of results. Amongst the variety of available tools to undertake parallel computations, Pachyderm is an open-source workflow-engine and distributed data processing tool that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem. In this workshop you will learn how to:
- create a simple local Kubernetes infrastructure,
- install and interact with Pachyderm and
- implement a scalable and reproducible workflow using containers.
Instructions: https://github.com/jonandernovella/gridka-pachydermRelevant paper: https://doi.org/10.1093/bioinformatics/bty699