PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines

June 30, 2022 § 1 Comment

See Also

Overview

The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:

  • Explicitly declare and track dependencies
  • Enforce organizational quality and reproducibility standards
  • Enable easy testing, validation, and alerting

PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:

  • Express arbitrary data transformations
  • As a series of idempotent Data Actions
  • Via a single, easy-to-parse YAML file
« Read the rest of this entry »

Where Am I?

You are currently browsing entries tagged with python at iHack, therefore iBlog.