Benn’s “rant” feels profound on so many levels, especially if I can assume he’s captured the zeitgeist of our industry as accurately as he usually does.
My first observation is that he seem to (wisely!) invert Postel’s Law for data: be strict in what you accept, and generous in what you emit. The profound truth here is that we cannot control other people. We can only honestly and gracefully fail, if we are not getting what we need to succeed.
We can only honestly and gracefully fail, if we are not getting what we need to succeed.
I can’t help but wonder how much of the energy around “data contracts” is the desire to avoid facing exactly that reality.
Next, the corollary to this is something I literally wrote last night in an internal planning document: “transparency is more important than compliance”. The context is that don’t want employees worried about “appearing” to reach nominal goals. I want them to be ruthlessly honest with us about the true risks to delivering genuine impact.
“Transparency is more important than Compliance”
Third, the profound implications of this is that we must shift power from centralized hierarchies to decentralized networks. We have to stop chasing Xanadu — the mythical demo of reliable hyperlinks — and embrace the chaotic generativity of the World Wide Web. That is the only kind of system that ever truly scales.
Shift power from centralized hierarchies to decentralized networks
Finally, Benn is right that it is foolish to replace a technical problem with a human problem. But I fear you can never avoid the human problem, only squish it somewhere else. The challenge is finding the “right” human problem to solve, so the rest of the system can support that as efficiently as possible.
Finding the “right” human problem to solve, so the rest of the system can support that as efficiently as possible.
I think Benn is calling for pipelines to “fail quickly” when it is better for consumers to get explicitly old data versus implicitly wrong data. But that implies non-fatal errors must be communicated transparently yet efficiently throughout the stack.
This is literally impossible (née Masnick), but I believe it is THE human problem that must be addressed — even if we can never solve it! Once we embrace that ugly truth, we can devote all of our effort to doing the best we can technically, while giving each other grace to recognize our human limits.
That’s a contract I’m willing to sign up for. How about you?
A key design goal of PipeBook is to break away from the single-browser-window user experience of traditional data notebooks, to take full advantage of the large screens on today’s laptops and desktops.
The modern data notebook has its roots in academic tools for mathematical research. Because of that, notebooks are fantastic for open-ended exploration, but an awkward match for production data pipelines. In particular, they don’t:
Explicitly declare and track dependencies
Enforce organizational quality and reproducibility standards
Enable easy testing, validation, and alerting
PipeBooks are a simple but radical re-imagining of notebooks as “tools for iteratively constructing resilient data pipelines.” The key is a novel data format called FRIDAAY that allows us to:
TL:DR Businesses may start by developing a technical solution, but only succeed by integrating around a human problem. The same is true of the Modern Data Stack.
LightDash is a super-cool Open Source business intelligence tool built on top of DBT (which I think of as node for SQL). While it is distributed as open source, the usual way to deploy it locally is by simply running a docker container.
If you want to actually built lightdash directly from source yourself, you need to follow the instructions under CONTRIBUTING. However, what was written there (as of November 11, 2021) did not quite work for me, so here are my workarounds.
I will also file this as a GitHub issue, and they are super-responsive so hopefully this page will be obsolete soon!
The bane of my IT existence is a business user who says, “Please get me the latest version of <random Excel file I have never seen before, named using idiosyncratic or ambiguous words>. Oh, and I need it tomorrow or else we won’t {make our numbers | pass our audit | satisfy the board}.”
I call this “zombie data” because it:
Lacks any self-awareness
Doesn’t remember where it came from
Has no relationship to its current context
Infects everyone it touches with that same mindlessness.
Can I evangelize a corporate data platform by just emailing out reports with sufficiently smart URLs?
Rationale
I don’t have the power to pull others onto a new platform. But I can push useful data to others in a way that inspires them to participate more directly with the platform
Proposal
Replace friendly Salesforce Reports and powerful NetSuite Saved Searches with a unified interface for viewing, editing, sharing, and managing:
versioned reports
personalized alerts
variant analyses
that are delivered via self-contained emails that also onboard people into greater use of the platform
Definitions
Friendly
Browseable
Drag and Drop
Live previews
Powerful
Complex formulas
Scaleable notifications
Easy joins and relabeling
Motivation
The main value of Quilt to my business is as a point of leverage to shift the culture of communication from “zombie data” in tables to “smart reports” in a repository
You must be logged in to post a comment.