PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines

June 30, 2022 § 1 Comment

See Also

Overview

The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:

  • Explicitly declare and track dependencies
  • Enforce organizational quality and reproducibility standards
  • Enable easy testing, validation, and alerting

PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:

  • Express arbitrary data transformations
  • As a series of idempotent Data Actions
  • Via a single, easy-to-parse YAML file

Using a universal “programming format” instead of multiple incompatible languages enables a whole new data ecosystem consisting of, e.g.:

  • Innovative, purpose-built user interfaces
  • Meta-pipelines for creating, analyzing, and refactoring other PipeBooks
  • Cross-platform optimization and orchestration services

Extensible Data Actions

Each Data Action is YAML dictionary entry used as a recipe for constructing a new data frame from one or more existing frames (or inline data). Groups of actions are defined in special Python module know as a Data Action Definitions (DAD). Importantly, the actual implementation of a DAD can depend on the underlying platform. Version 0.1 uses DuckDB Pandas, but we plan to support Spark in v0.2 and provide a platform SDK in v0.3; however, the actual semantics of each PipeBook (though obviously not the performance) are independent of which platform it is running on.

PipeBench itself only defines a single action init which is used to import the other DADs. For example, most notebooks would use the dad-sql module supporting common table operations. The first action is named ‘fridaay’ to make it easy to distinguish from other YAML files, and should specify the version.

PipeBook Example

The PipeBook below creates a (temporary) table from inline CSV data (with headers, as is the default), then saves it to the table ‘recent_pets’ containing only non-human rows with recent dates.

fridaay:
  version: 0.1
  do: init
  imports:
   sql: dad-sql
  set: # global constants (COMMENT)
    NAME: pipebook_demo1
    SAPIENT: Human
    
test_data:
  doc: Sample data for test purposes
  do: sql.load
  csv.inline:
  - ['Name','Age','Weight', 'Type', 'Timestamp']
  - ['Ernie', 54, 170.5, 'Human Tech Nerd', date'2020-03-20']
  - ['Qhuinn', 7, 36.3, 'English Cocker Spaniel', date'2022-06-27']
  - ['Frolic', 2, 76.2, 'Chocolate Labrador', date'2022-06-27']

recent_pets:
  do: sql.select
  from: $$ # last frame
  cols:
    Name: STRING Personal Name
    Age: INTEGER.year Age
    Weight: DOUBLE.pound Current Weight
  where.all:
   - ['Type', NOT LIKE, $SAPIENT] # dereference constant
   - ['Timestamp','>', date'2022-01-01']
  save: table

We are also working a simple PipeBench debugger in WxPython Toga for macOS where the main view is the YAML tree of Data Actions, but inspectors allow you to view the status and details of each table.

Tagged: , , ,

§ One Response to PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

What’s this?

You are currently reading PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines at iHack, therefore iBlog.

meta

%d bloggers like this: