Mahdi Karabiben (@mahdiqb)

☕ #13: The Semantic Layer Renaissance, The Death of the “Shopping List” Architecture, and The Big/Small Metadata Question A recipe for an AI-ready semantic layer, the new dynamic of the data stack's tools and categories, and a counter-thesis to distributed metadata.

Fresh Data Espresso is out ☕️

-Why the semantic layer needs a "hard reset" for AI (from metrics to context)
- Why the MDS "shopping list" data architecture is not coming back
- DuckLake's counter-thesis to distributed metadata

dataespresso.substack.com/p/13-the-sem...

15.12.2025 22:18 👍 0 🔁 0 💬 0 📌 0

Building a Semantic Layer for the AI Era: Beyond SQL Generation A guide to capturing the “What, Why, and Who” for Agent functionality

I wrote a deep dive for Data Engineer Things on how to build a semantic layer that moves beyond "SQL generation" to one that AI can actually reason with.

You can read the full article here: blog.dataengineerthings.org/semantic-lay...

4/4

27.11.2025 13:47 👍 0 🔁 0 💬 0 📌 0

This shifts the definition of a Semantic Layer:

❌ From: "Here is the math to calculate Churn." (SQL Generation)

✅ To: "Here is what Churn means, why we track it, and who owns it." (Reasoning)

We need to go from defining metric logic for BI to building knowledge graphs for AI. 3/4

27.11.2025 13:47 👍 1 🔁 0 💬 1 📌 0

But the reality has shifted: We are moving from Analytics (humans looking at charts) to Agents (AI taking actions).

Humans have implicit context ("ignore test accounts," "ask the VP"), while AI is context-blind.

If you give an Agent raw SQL snippets, it will hallucinate. 2/4

27.11.2025 13:47 👍 1 🔁 0 💬 1 📌 0

In 2017, we redefined "Data Engineering." Today, we're at a similar inflection point for the Semantic Layer.

In 2022 (MDS R.I.P.), we treated it as a fancy SQL generator: Thousands of YAML lines to calculate metrics for dashboards.

Deep down, we all knew it was overkill. 1/4

27.11.2025 13:47 👍 1 🔁 0 💬 1 📌 0

From Data Trust to Decision Trust: The Case for Unified Data + AI Observability Data observability was built to catch errors for humans. Unified observability is built to control risk for autonomous AI.

We've solved observability for dashboards. Now we need to solve it for agents.

Had a great time writing for the Metadata Weekly newsletter about the shift from Data Trust to Decision Trust:

metadataweekly.substack.com/cp/179547368

22.11.2025 01:21 👍 0 🔁 0 💬 0 📌 0

On a broader note, treating skills like writing concise emails or running effective meetings as things you just "pick up" is a massive institutional blind spot for universities & companies, and it costs us all daily. 2/2

14.10.2025 21:04 👍 0 🔁 0 💬 0 📌 0

Smart Brevity by Jim VandeHei, Mike Allen, and Roy Schwartz should be mandatory reading before you're allowed to send your 1st email/Slack message.

And it's more vital than ever now, since we upgraded from "human-generated fluff" to "AI-supercharged fluff". 1/2

14.10.2025 21:04 👍 0 🔁 0 💬 1 📌 0

The dbt Labs-Fivetran merger is such a Logan Roy move by Fivetran - they now own all three data transformation tools that emerged from the MDS: dbt, SQLMesh, and SDF (via dbt Labs).
Post-MDS world is full of surprises

13.10.2025 17:26 👍 0 🔁 0 💬 0 📌 0

Data Modelling for Data Products by Mahdi Karabiben | Modern Data 101 Community Learn how to design business-aligned data models and scalable data products with the right metrics, frameworks, and governance from day one.

In the course, I introduce a practical framework (+ tools & principles) to help you design scalable data models and ship impactful data products that deliver business value (and not just vanity metrics 🤷🏼‍♂️). 2/2
Watch it here:
www.moderndata101.com/masterclass/...

11.10.2025 20:30 👍 0 🔁 0 💬 0 📌 0

Excited to share that my masterclass, "Data Modeling for Data Products," is now available on-demand via ModernData101!

If you're a Data PM, Analytics Engineer, or Data Engineer focused on building valuable (& scalable) data products, this session is for you! 1/2

11.10.2025 20:30 👍 0 🔁 0 💬 1 📌 0

Peak Paris is going to a (fantastic) contemporary dance show at a department store - (Babel at Le Bon Marché👌🏼)

07.09.2025 20:40 👍 0 🔁 0 💬 0 📌 0

Embedding User-Defined Indexes in Apache Parquet Files - Apache DataFusion Blog

Very interesting article by @apachedatafusion.bsky.social team on user-defined/custom indexes in Parquet - really surprised that other Parquet readers/writers don't leverage this, given that file pruning remains limited with "vanilla" Parquet in many scenarios.
datafusion.apache.org/blog/2025/07...

16.08.2025 16:24 👍 2 🔁 0 💬 0 📌 0

Data teams have a reputation for building cool things that aren't useful. I break down a simple two-step path to fix this: 1) Find low-hanging fruit for quick wins. 2) Dive deep into the business to find real problems, like enriching product analytics with raw event data. 4/4

04.08.2025 07:51 👍 0 🔁 0 💬 0 📌 0

Spotify Wrapped is great, but why is it such a rare example of a personal data product? We have mountains of data siloed in our apps. I explore how AI could be the "last-mile" enabler to connect these APIs and create a coherent narrative from the data of our own lives. 3/4

04.08.2025 07:51 👍 0 🔁 0 💬 1 📌 0

Data modeling is cool again, and that's good, but we need to adapt it to today's world. My new article proposes a "Go Wide, then Go Deep" strategy to adapt modeling for a world of data products. More in the newsletter. 2/4

04.08.2025 07:51 👍 0 🔁 0 💬 1 📌 0

Espresso #12: Data modeling for data products, a Spotify Wrapped for everything, and building things that matter A modern playbook for data modeling in a product-driven world, how AI can power a supercharged Spotify Wrapped, and a two-step formula for building valuable data products.

Data Espresso #12 is out ☕
This edition covers:
- A modern playbook for data modeling in a product-driven world
- Why we need a "Spotify Wrapped for everything"
- A two-step formula for building data products that actually matter
dataespresso.substack.com/p/espresso-1...
1/4

04.08.2025 07:51 👍 0 🔁 0 💬 1 📌 0

Data Modeling for Data Products: A Practical Guide A modern playbook for data modeling in a product-driven world.

In my latest article, I present a set of strategies, techniques, and frameworks for adapting data modeling to the world of data products - from distributed ownership to metric trees and entity-centric modeling. 2/2

blog.det.life/data-modelin...

30.06.2025 21:22 👍 0 🔁 0 💬 0 📌 0

Data modeling is back (and it's good news!), but we can't just copy-paste the old playbook. Instead, there's a big opportunity to adapt existing data modeling methodologies to today's world: data products, limitless compute, and a big need for speed. 1/2

30.06.2025 21:22 👍 1 🔁 0 💬 1 📌 0

How AI is Finally Democratizing the Data Platform’s Last-Mile Layer Why the ‘Last Mile’ of the data experience — polished data platform capabilities — is no longer just for Big Tech.

This means data teams of all sizes can finally start building those experiences, making their data platforms more intuitive and powerful. AI is democratizing this capability, and the "last mile" is looking a lot more accessible. I dive into this topic in my latest blog post: shorturl.at/f9mDi
5/5

31.05.2025 23:12 👍 0 🔁 0 💬 0 📌 0

That simple experiment genuinely shifted my perspective. Building experience layers (the UIs and streamlined workflows on top of data components) isn't just for companies with massive engineering resources anymore. AI is truly acting as a "last-mile enabler" here. 4/5

31.05.2025 23:12 👍 1 🔁 0 💬 1 📌 0

Last weekend, using Replit's AI assistant, I tried building a basic dbt run timeline visualizer. The result? A functional (albeit basic) UI in under an hour. Just me, no big platform team. (Even Claude isn't a big fan of D3.js though 😅)
3/5

31.05.2025 23:12 👍 0 🔁 0 💬 1 📌 0

Historically, building this layer was a luxury few beyond Big Tech could afford - mainly because data teams are constantly overwhelmed and in firefighting mode. But AI is changing this, fast.
2/5

31.05.2025 23:12 👍 0 🔁 0 💬 1 📌 0

Visualizing Data Timeliness at Airbnb by Chris Williams, Ken Chen, Krist Wongsuphasawat, and Sylvia Tomiyama

Today's data platforms are powerful, but the actual experience is still clunky (tool hopping, siloed metadata, etc.). The missing piece is the last mile/experience layer – a polished UI/UX layer connecting all the backend systems. A great example is Airbnb's data timeliness UI: shorturl.at/3n5w9
1/5

31.05.2025 23:12 👍 1 🔁 0 💬 1 📌 0

Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data Explore how Discord supercharged dbt with a tailored solution designed for performance, developer productivity, and data quality.

This is a great post by the Discord data team about how they augmented several dbt features like materializations and macros.

The "meta" attribute continues to be severely underused by data teams, but they neatly leverage it here for custom versioning.
discord.com/blog/overclo...

29.04.2025 23:02 👍 0 🔁 0 💬 0 📌 0

Constraint-heavy environments (niche tech, missing tools) force a different engineering discipline (+ recent talk by Jane Street as an example): Less abstraction reliance demands more creativity & deeper understanding. Can be frustrating, but valuable learning ground. 4/4

16.04.2025 17:00 👍 0 🔁 0 💬 0 📌 0

Revisiting design decisions often yields more than just fixing tech debt (+ a recent article by GumGum as an example): Tool capabilities & pricing evolve; patterns that made sense before might be suboptimal now. Regularly analyzing bottlenecks vs current features pays off. 3/4

16.04.2025 17:00 👍 0 🔁 0 💬 1 📌 0

Iceberg: What its 'standard' status means for Hudi/Delta users (integrations, support drift?). When does the ecosystem pull justify a switch/adoption? If new to table formats, I offer some tips to assess real value vs hype before adopting. More in the newsletter. 2/4

16.04.2025 17:00 👍 0 🔁 0 💬 1 📌 0

Espresso #10: A new ice(berg) age, revisiting old designs, and thriving on constraints Hello data friends,

Data Espresso #10 is out ☕️
This edition covers:
- If/when you should migrate to Apache Iceberg
- The benefits of revisiting design decisions (and why you should do it often)
- How constraint-heavy environments can foster engineering ingenuity
open.substack.com/pub/dataespr...
1/4

16.04.2025 17:00 👍 1 🔁 0 💬 1 📌 0

Trace - Introduction to Metric Trees A proper introduction to metric trees with a real world example

Now that the semantic layer is back in the data space's spotlight, it's definitely worth revisiting adjacent concepts like metric trees. If you're unfamiliar with the term, Trace has a fantastic (and brief) intro that's worth your time: www.hellotrace.io/blog/introdu...

09.02.2025 15:52 👍 1 🔁 0 💬 0 📌 0

Mahdi Karabiben

Latest posts by Mahdi Karabiben @mahdiqb