What is multimodal AI?
ChatGPT Prompt
What would multimodal data management look like?
Write the skit where three thought leaders accidentally invent the term and surprisingly converge on what it means.
1. Dramatis Personae
- Ben Lorica — Editor-at-large, professional pattern recognizer
- Matei Zaharia — Distributed systems builder, allergic to hand-waving
- Chip Huyen — ML systems thinker, intolerant of fuzzy language
2. Setting
A neglected corner of an AI infrastructure conference.
One wobbling café table. Three bad coffees.
A whiteboard that still says “Agents???” from the previous panel.
3. The Skit
3.1 Opening: The Shared Irritation
Ben (sipping coffee, wincing)
“Every panel today had the same energy.
Very confident.
Very unhelpful.”
Chip
“Let me guess.
Someone said, ‘Just improve your data quality.’”
Ben
“Twice.
With a straight face.”
Matei
“That phrase has now lost all information content.
It’s like saying ‘be better at engineering.’”
(They sit. Silence. The whiteboard stares at them.)
3.2 The Symptom Dump
Chip
“Here’s what people actually email me about.
They have text, images, audio, logs, traces—
and none of it agrees with itself.”
Matei
“Because it lives in five systems.
Owned by three teams.
Versioned by accident.”
Ben
“And every new model requires a bespoke archaeological dig.
Someone’s job becomes ‘find the right video from six months ago.’”
Chip
“Which they then preprocess differently.
So now you have five truths.
All named final_v3_really_final.”
3.3 The Whiteboard Incident
Matei (standing, marker in hand)
“Let me show you why this is broken.”
He draws boxes:
- Object storage: Images / Video
- A data lake: Tables
- A vector database: Embeddings
- A sad arrow pointing to: Notebook Hell
Matei
“This is not an architecture.
This is a crime scene.”
Ben
“What’s missing is the thing we had for structured data.
But for… all of this.”
Chip
“A system that knows these are the same thing.
The same event.
The same customer.
The same moment in time.”
3.4 The Accidental Phrase
Ben (half-thinking aloud)
“So it’s not just unstructured data management.
It’s not just vector search.
It’s managing data across—”
Chip
“—modalities.”
(Beat.)
Matei
“…as data.”
(Another beat.)
Ben
“So.
Multimodal data management?”
(They all freeze.)
3.5 The Recognition
Chip
“I hate how clear that is.”
Matei
“I hate that it doesn’t oversell anything.”
Ben
“It’s deeply unsexy.
Which means it might survive.”
(They look at the whiteboard. The phrase is written, accidentally underlined.)
3.6 Sanity Check (The Unspoken Agreement)
Chip
“Okay, sanity test.
This would include lineage, right?”
Matei
“And versioning.
Across raw data and derived artifacts.”
Ben
“And governance.
Because lawyers will absolutely be involved.”
Chip
“And retrieval that isn’t modality-specific.”
Matei
“And recomposition at inference time.
Without rewriting pipelines.”
(They nod. No one objects.)
3.7 The Inevitable Consequences
Chip
“Someone’s going to put this on a slide.”
Ben
“Someone else will claim they already do it.”
Matei
“They won’t.”
Ben
“Analysts will sanitize it.”
Chip
“And engineers will quietly build it anyway.”
3.8 Exit Line
Matei (putting the marker down)
“Did we just invent a category?”
Ben
“No.
We noticed one.”
Chip
“Which is worse.
Now it’s real.”
(Lights fade. The whiteboard remains.)
4. Appendix: What They Actually Meant (Technical)
4.1 Definition
Multimodal data management refers to systems and practices that treat text, images, audio, video, time-series, and structured data as first-class, aligned, governable, and versioned assets, independent of any single model or pipeline.
4.2 Core Capabilities
- Unified ingestion across modalities
- Cross-modal identity and temporal alignment
- Versioning of raw and derived artifacts
- End-to-end data lineage
- Modality-agnostic retrieval and filtering
- Policy-aware access and data governance
4.3 What It Is Not
- Not just a vector database
- Not merely a labeling or annotation tool
- Not MLOps rebranded
- Not model-centric
- Not solved by “better prompts”
4.4 Why It Emerges Now
- Multimodal foundation models
- Retrieval-augmented generation
- AI agents that dynamically assemble context
- Regulatory pressure around provenance, consent, and auditability
5. Final Note
Categories are rarely invented on purpose.
They appear when multiple competent people realize they are all compensating for the same missing abstraction.
Or, as someone once almost said:
“There is nothing new under the sun—only newly named.”

Leave a comment