20 May 2026
Delta, Iceberg, DuckLake - an experiment in open-source data sharing
A note before anything else: this is an account of the open-source path. Databricks, Snowflake, and the other managed offerings around these formats almost certainly just work, because that is what you are paying them for. What follows is what happens when you try to assemble the same picture from the parts that are free.
There is a particular kind of conversation that happens inside a data platform team somewhere around year two or three of running a production warehouse. The team has built something that works. Queries return, dashboards refresh, customers are not complaining. And yet someone, usually in a planning meeting, says the sentence that begins the next chapter: we should be able to just hand them the data. Not a dashboard. Not an export. The data itself, in a form a consumer can connect to with their own tools, on their own schedule, without our infrastructure in between.
The sentence is innocent. What it sets in motion is not.
This post is about what happens when someone that takes that sentence seriously and looks at the three open table formats that promise, in different ways, to make it true: Delta, Iceberg, and DuckLake. I prototyped all three.
The promise of the open table formats is that the layer disappears. The storage is the share. A consumer with the right connector reads the same files you wrote, governed by the same metadata, with no service in between. PowerBI plugs into an endpoint. The analyst opens a notebook. The data is just there.
This is a beautiful idea. It is also, in 2026, mostly an idea.
Delta and the silence of the maintainers
I started with Delta because it promised the shortest path from where I was to where I wanted to be. Delta Sharing is, on paper, exactly the protocol the brochure describes: a recipient holds a token, points a client at a URL, receives signed URLs to parquet files, and reads them. PowerBI was supposed to be a one-click consumer. Same as Tableu. The recipient does not need our infrastructure. They need the spec.
I implemented the producer side end-to-end. In the course of doing so I opened two issues against the open-source delta-sharing repository. One of them I traced to what looked unmistakably like a copy-paste error — a one-line fix, the PR essentially writing itself. Both issues have been open for a year. Neither has been acknowledged.
PowerBI, for its part, never connected. Not once, not in any configuration I tried. The protocol exists. The reference implementation exists. The vendor connectors, the part that would make any of this matter for a customer who is not also a data engineer, do not work.
There is more, on the writer side. Delta as I used it required:
- custom bucket IDs for partitioning, since the natural partitioning I wanted was not expressible directly
- a locking mechanism around concurrent online refreshes, because Delta on object storage does not give you atomic swaps for free
- an incremental load path that builds an Arrow frame and overwrites, because the natural Polars path did not survive contact with the schema layer
- Arrow as the schema source of truth, not Polars, because the round-trip otherwise loses fidelity
Each of these is solvable. Each of them is a thing I built. None of them were what I came for.
The verdict on Delta is not that it is bad. The verdict is that I wrote more code to feed it than I would have written to live without it, and the thing it was supposed to give us in exchange, frictionless sharing, never arrived.
Iceberg and the second service
Iceberg was the second prototype. The ETL was visibly cleaner. The features I had been building by hand in Delta - atomic swaps, branching, native partitioning - are in Iceberg as primitives. There is a healthy open-source ecosystem around it. Lakekeeper and similar projects have made the catalog layer a real choice rather than a vendor lock-in.
But Iceberg has a catalog. The catalog is a service. The service has to run, be backed up, be authenticated against, be reachable from every consumer, and be versioned in lockstep with the readers that talk to it.
For the ETL this is a tax I would happily pay. The implementation was concise, the primitives matched the problem.
For sharing, it is the same problem as Delta in a different shape. A consumer needs to reach the catalog and the blob storage and have a client that understands the current Iceberg spec the way the catalog presents it. The “just connect to an endpoint” promise becomes “connect to two endpoints and authenticate against both and hope the connector version aligns.” PowerBI as a first-class Iceberg consumer is still arriving, not arrived.
So Iceberg solved the ETL elegantly and reproduced the sharing problem one level down.
DuckLake and certificates
Note: I used DuckLake v1.0.0, which got released only weeks prior to my tests and this blog post.
DuckLake was the simplest implementation by a wide margin. No separate catalog service: the metadata lives in Postgres, and I already ran a Postgres per environment for other reasons, so the marginal infrastructure was approximately zero. The writer side was almost embarrassingly short.
The sharing story, on paper, was the cleanest of the three. DuckDB has an ODBC driver. PowerBI speaks ODBC. The DuckDB engine reads DuckLake directly. The connector is the engine.
What actually happened:
- Most corporate firewalls forbid the DuckDB driver from auto-installing the extensions it needs at connect time.
- The Azure extension links its own OpenSSL with hardcoded certificate paths. Under Ubuntu (and WSL2, some other distros) the system cert bundle lives somewhere else. Symlinking the bundle into the expected location works for some operations and not others.
read_parquetover the curl transport works. The internalCHECKPOINTdoes not.CHECKPOINTgoes through the Azure C++ SDK directly, bypassing curl, and the SSL handshake fails on the path the curl transport had already smoothed over.abfss://URLs breakCHECKPOINTregardless. Switching toaz://works around it.- PowerBI cannot host an in-memory DuckDB in read-only mode. A persistent
.duckdbfile has to exist somewhere on disk just to hold the connection.
Each of these is a real ticket. None of them is DuckLake’s fault in the sense that anyone designed them. They are the friction of an ecosystem that has not yet finished assembling itself in the places where it has to meet enterprise IT.
DuckLake is the most promising of the three for the ETL story. For sharing, it is the youngest, and youth shows.
The trade-off, written out
| Delta | Iceberg | DuckLake | |
|---|---|---|---|
| ETL implementation effort | high - manual partitioning, custom locking, Arrow plumbing | moderate - primitives match the problem | low - closest to “just write” |
| Required services beyond storage | none | catalog | Postgres (already had it) |
| Sharing primitive | Delta Sharing protocol | catalog + blob | DuckDB driver + blob |
| BI tool reality (PowerBI) | did not connect | partial, version-sensitive | connects but fragile in restricted environments |
| Open-source responsiveness | issues unacknowledged for a year | active, healthy | active, young |
| What it solves cleanly | nothing of ours | ETL semantics | ETL ergonomics |
| What it does not solve | sharing | sharing | sharing |
The columns differ. The last two rows do not.
What the experiment was actually about
The pattern is the same in all three cases. The producer side is a solved problem, or close to it. The friction is on the consumer side, and the friction has two sources: the connector ecosystem in BI tools, which moves slowly, and the corporate environment that the consumer actually sits inside, which does not allow the runtime conveniences the producer-side abstractions assume.
ClickHouse does not have this problem because ClickHouse is a service. A service is, among other things, a contract about what the consumer is allowed to assume. The open table formats traded that contract for files-in-a-bucket, which is more open in principle and less reachable in practice.
A note on disagreement
If your consumers are other data teams with their own engineers and their own clusters, none of what I have written applies. The open table formats are excellent in that world and the friction I described is somebody else’s problem to solve, often pleasantly. If your consumers are analysts in BI tools sitting behind corporate firewalls, my account is probably your account, and the calendar matters: the same survey done eighteen months from now may read very differently.
I would be glad to be told I was wrong about any of this.