PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines
June 30, 2022 § 1 Comment
- The Data Config by Benn Stancil (Medium)
- https://github.com/TheSwanFactory/pipebook (App)
- PipeBook: UX Design Brief (Blog)
- https://github.com/TheSwanFactory/fridaay (Framework)
- Data on Rails: Solving the Data App Imperative (YouTube)
The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:
- Explicitly declare and track dependencies
- Enforce organizational quality and reproducibility standards
- Enable easy testing, validation, and alerting
PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:
- Express arbitrary data transformations
- As a series of idempotent Data Actions
- Via a single, easy-to-parse YAML file
Using a universal “programming format” instead of multiple incompatible languages enables a whole new data ecosystem consisting of, e.g.:
- Innovative, purpose-built user interfaces
- Meta-pipelines for creating, analyzing, and refactoring other PipeBooks
- Cross-platform optimization and orchestration services
Extensible Data Actions
Each Data Action is YAML dictionary entry used as a recipe for constructing a new data frame from one or more existing frames (or inline data). Groups of actions are defined in special Python module know as a Data Action Definitions (DAD). Importantly, the actual implementation of a DAD can depend on the underlying platform. Version 0.1 uses
DuckDB Pandas, but we plan to support Spark in v0.2 and provide a platform SDK in v0.3; however, the actual semantics of each PipeBook (though obviously not the performance) are independent of which platform it is running on.
PipeBench itself only defines a single action init which is used to import the other DADs. For example, most notebooks would use the dad-
sql module supporting common table operations. The first action is named ‘fridaay’ to make it easy to distinguish from other YAML files, and should specify the version.
The PipeBook below creates a (temporary) table from inline CSV data (with headers, as is the default), then saves it to the table ‘recent_pets’ containing only non-human rows with recent dates.
fridaay: version: 0.1 do: init imports: sql: dad-sql set: # global constants (COMMENT) NAME: pipebook_demo1 SAPIENT: Human test_data: doc: Sample data for test purposes do: sql.load csv.inline: - ['Name','Age','Weight', 'Type', 'Timestamp'] - ['Ernie', 54, 170.5, 'Human Tech Nerd', date'2020-03-20'] - ['Qhuinn', 7, 36.3, 'English Cocker Spaniel', date'2022-06-27'] - ['Frolic', 2, 76.2, 'Chocolate Labrador', date'2022-06-27'] recent_pets: do: sql.select from: $$ # last frame cols: Name: STRING Personal Name Age: INTEGER.year Age Weight: DOUBLE.pound Current Weight where.all: - ['Type', NOT LIKE, $SAPIENT] # dereference constant - ['Timestamp','>', date'2022-01-01'] save: table
We are also working a simple PipeBench debugger in
WxPython Toga for macOS where the main view is the YAML tree of Data Actions, but inspectors allow you to view the status and details of each table.
[…] key design goal of PipeBook is to break away from the single-browser-window user experience of traditional data notebooks, to […]