PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines
June 30, 2022 § 1 Comment
See Also
- The Data Config by Benn Stancil (Medium)
- https://github.com/TheSwanFactory/pipebook (App)
- PipeBook: UX Design Brief (Blog)
- https://github.com/TheSwanFactory/fridaay (Framework)
- Data on Rails: Solving the Data App Imperative (YouTube)
Overview
The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:
- Explicitly declare and track dependencies
- Enforce organizational quality and reproducibility standards
- Enable easy testing, validation, and alerting
PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:
- Express arbitrary data transformations
- As a series of idempotent Data Actions
- Via a single, easy-to-parse YAML file