PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines
June 30, 2022 § 1 Comment
See Also
- The Data Config by Benn Stancil (Medium)
- https://github.com/TheSwanFactory/pipebook (App)
- PipeBook: UX Design Brief (Blog)
- https://github.com/TheSwanFactory/fridaay (Framework)
- Data on Rails: Solving the Data App Imperative (YouTube)
Overview
The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:
- Explicitly declare and track dependencies
- Enforce organizational quality and reproducibility standards
- Enable easy testing, validation, and alerting
PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:
- Express arbitrary data transformations
- As a series of idempotent Data Actions
- Via a single, easy-to-parse YAML file
Using a universal “programming format” instead of multiple incompatible languages enables a whole new data ecosystem consisting of, e.g.:
- Innovative, purpose-built user interfaces
- Meta-pipelines for creating, analyzing, and refactoring other PipeBooks
- Cross-platform optimization and orchestration services
Extensible Data Actions
Each Data Action is YAML dictionary entry used as a recipe for constructing a new data frame from one or more existing frames (or inline data). Groups of actions are defined in special Python module know as a Data Action Definitions (DAD). Importantly, the actual implementation of a DAD can depend on the underlying platform. Version 0.1 uses DuckDB Pandas, but we plan to support Spark in v0.2 and provide a platform SDK in v0.3; however, the actual semantics of each PipeBook (though obviously not the performance) are independent of which platform it is running on.
PipeBench itself only defines a single action init which is used to import the other DADs. For example, most notebooks would use the dad-sql
module supporting common table operations. The first action is named ‘fridaay’ to make it easy to distinguish from other YAML files, and should specify the version.
PipeBook Example
The PipeBook below creates a (temporary) table from inline CSV data (with headers, as is the default), then saves it to the table ‘recent_pets’ containing only non-human rows with recent dates.
fridaay:
version: 0.1
do: init
imports:
sql: dad-sql
set: # global constants (COMMENT)
NAME: pipebook_demo1
SAPIENT: Human
test_data:
doc: Sample data for test purposes
do: sql.load
csv.inline:
- ['Name','Age','Weight', 'Type', 'Timestamp']
- ['Ernie', 54, 170.5, 'Human Tech Nerd', date'2020-03-20']
- ['Qhuinn', 7, 36.3, 'English Cocker Spaniel', date'2022-06-27']
- ['Frolic', 2, 76.2, 'Chocolate Labrador', date'2022-06-27']
recent_pets:
do: sql.select
from: $$ # last frame
cols:
Name: STRING Personal Name
Age: INTEGER.year Age
Weight: DOUBLE.pound Current Weight
where.all:
- ['Type', NOT LIKE, $SAPIENT] # dereference constant
- ['Timestamp','>', date'2022-01-01']
save: table
We are also working a simple PipeBench debugger in WxPython Toga for macOS where the main view is the YAML tree of Data Actions, but inspectors allow you to view the status and details of each table.
[…] key design goal of PipeBook is to break away from the single-browser-window user experience of traditional data notebooks, to […]