PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines

June 30, 2022 § 1 Comment

Overview

The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:

Explicitly declare and track dependencies
Enforce organizational quality and reproducibility standards
Enable easy testing, validation, and alerting

PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:

Express arbitrary data transformations
As a series of idempotent Data Actions
Via a single, easy-to-parse YAML file

Using a universal “programming format” instead of multiple incompatible languages enables a whole new data ecosystem consisting of, e.g.:

Innovative, purpose-built user interfaces
Meta-pipelines for creating, analyzing, and refactoring other PipeBooks
Cross-platform optimization and orchestration services

Extensible Data Actions

Each Data Action is YAML dictionary entry used as a recipe for constructing a new data frame from one or more existing frames (or inline data). Groups of actions are defined in special Python module know as a Data Action Definitions (DAD). Importantly, the actual implementation of a DAD can depend on the underlying platform. Version 0.1 uses ~~DuckDB~~ Pandas, but we plan to support Spark in v0.2 and provide a platform SDK in v0.3; however, the actual semantics of each PipeBook (though obviously not the performance) are independent of which platform it is running on.

PipeBench itself only defines a single action init which is used to import the other DADs. For example, most notebooks would use the dad-sql module supporting common table operations. The first action is named ‘fridaay’ to make it easy to distinguish from other YAML files, and should specify the version.

PipeBook Example

The PipeBook below creates a (temporary) table from inline CSV data (with headers, as is the default), then saves it to the table ‘recent_pets’ containing only non-human rows with recent dates.

fridaay:
  version: 0.1
  do: init
  imports:
   sql: dad-sql
  set: # global constants (COMMENT)
    NAME: pipebook_demo1
    SAPIENT: Human
    
test_data:
  doc: Sample data for test purposes
  do: sql.load
  csv.inline:
  - ['Name','Age','Weight', 'Type', 'Timestamp']
  - ['Ernie', 54, 170.5, 'Human Tech Nerd', date'2020-03-20']
  - ['Qhuinn', 7, 36.3, 'English Cocker Spaniel', date'2022-06-27']
  - ['Frolic', 2, 76.2, 'Chocolate Labrador', date'2022-06-27']

recent_pets:
  do: sql.select
  from: $$ # last frame
  cols:
    Name: STRING Personal Name
    Age: INTEGER.year Age
    Weight: DOUBLE.pound Current Weight
  where.all:
   - ['Type', NOT LIKE, $SAPIENT] # dereference constant
   - ['Timestamp','>', date'2022-01-01']
  save: table

We are also working a simple PipeBench debugger in ~~WxPython~~ Toga for macOS where the main view is the YAML tree of Data Actions, but inspectors allow you to view the status and details of each table.

Tagged: data, format, notebook, python

§ One Response to PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines

PipeBook: UX Design Brief | iHack, therefore iBlog says:

July 9, 2022 at 10:10 am

[…] key design goal of PipeBook is to break away from the single-browser-window user experience of traditional data notebooks, to […]

Reply

iHack, therefore iBlog