Beyond (Data) Contracts: A Response to Benn Stancil

September 23, 2022 § Leave a comment

This essay by Benn Stancil provoked me so deeply my intended “comment” evolved into a full-fledged blog post:

Fine, let’s talk about data contracts

Benn’s “rant” feels profound on so many levels, especially if I can assume he’s captured the zeitgeist of our industry as accurately as he usually does.

My first observation is that he seem to (wisely!) invert Postel’s Law for data: be strict in what you accept, and generous in what you emit. The profound truth here is that we cannot control other people. We can only honestly and gracefully fail, if we are not getting what we need to succeed.

We can only honestly and gracefully fail, if we are not getting what we need to succeed.

I can’t help but wonder how much of the energy around “data contracts” is the desire to avoid facing exactly that reality.

Next, the corollary to this is something I literally wrote last night in an internal planning document: “transparency is more important than compliance”. The context is that don’t want employees worried about “appearing” to reach nominal goals. I want them to be ruthlessly honest with us about the true risks to delivering genuine impact.

“Transparency is more important than Compliance”

Third, the profound implications of this is that we must shift power from centralized hierarchies to decentralized networks. We have to stop chasing Xanadu — the mythical demo of reliable hyperlinks — and embrace the chaotic generativity of the World Wide Web. That is the only kind of system that ever truly scales.

Shift power from centralized hierarchies to decentralized networks

Finally, Benn is right that it is foolish to replace a technical problem with a human problem. But I fear you can never avoid the human problem, only squish it somewhere else. The challenge is finding the “right” human problem to solve, so the rest of the system can support that as efficiently as possible.

Finding the “right” human problem to solve, so the rest of the system can support that as efficiently as possible.

I think Benn is calling for pipelines to “fail quickly” when it is better for consumers to get explicitly old data versus implicitly wrong data. But that implies non-fatal errors must be communicated transparently yet efficiently throughout the stack.

This is literally impossible (née Masnick), but I believe it is THE human problem that must be addressed — even if we can never solve it! Once we embrace that ugly truth, we can devote all of our effort to doing the best we can technically, while giving each other grace to recognize our human limits.

That’s a contract I’m willing to sign up for. How about you?

PipeBook: UX Design Brief

July 9, 2022 § 1 Comment

A key design goal of PipeBook is to break away from the single-browser-window user experience of traditional data notebooks, to take full advantage of the large screens on today’s laptops and desktops.

The PipeBook Multi-Window User Experience (Live prototype, annotated)
« Read the rest of this entry »

PipeBook.yml: Reimagining Notebooks as Resilient Data Pipelines

June 30, 2022 § 1 Comment

See Also

Overview

The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:

  • Explicitly declare and track dependencies
  • Enforce organizational quality and reproducibility standards
  • Enable easy testing, validation, and alerting

PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:

  • Express arbitrary data transformations
  • As a series of idempotent Data Actions
  • Via a single, easy-to-parse YAML file
« Read the rest of this entry »

Where Am I?

You are currently browsing the Modern Data Stack category at iHack, therefore iBlog.