What the Big Fuss Over Table Formats and Metadata Catalogs Is All About

The big data community gained clarity on the future of data lakehouses earlier this week as a result of Snowflake’s open sourcing of its new Polaris metadata catalog and Databricks’ acquisition of Tabular. The actions cemented Apache Iceberg as the winner of the battle of open table formats, which is a big win for customers and open data, while it exposes a new competitive front: the metadata catalog.

The news Monday and Tuesday was as hot as the weather in San Francisco this week, and left some longtime big data watchers gasping for breath. To recap:

On Monday, Snowflake announced that it was open sourcing Polaris, a new metadata catalog based on Apache Iceberg. The move will enable Snowflake customers to use their choice of query engine to process data stored in Iceberg, including Spark, Flink, Presto, Trino, and soon Dremio.

Snowflake followed that up on Tuesday by announcing that, after a year and a half of being in tech preview, support for Iceberg was generally available. The moves, while expected, culminated a dramatic about-face for Snowflake from proud supporter of proprietary storage formats and query engines into a champion of openness and customer choice.

Source: Snowflake

Later Tuesday, Databricks came out of left field with its own groundbreaking news: the acquisition of Tabular, the company founded by the creators of Iceberg.

The move, made in the middle of Snowflake’s Data Cloud Summit at the Moscone Center in San Francisco (and a week before its own AI + Data Summit at the same venue), was a defacto admission by Databricks that Iceberg had won the table format war. Its own open table format, called Delta Lake, was trailing Iceberg in terms of support and adoption in the community.

Databricks clearly hoped the move would slow some of the momentum Snowflake was building around Iceberg. Databricks couldn’t afford to allow its archrival to become a more devout defender of open data, open source, and customer choice by basing its lakehouse strategy on the winning horse, Iceberg, while its own horse, Delta, lost ground. By going to the source of Iceberg and hiring the technical team that built it for a cool $1 billion to $2 billion (per the Wall Street Journal), Databricks made a big statement, even if it refuses to say it explicitly: Iceberg has won the battle over open table formats.

The moves by Databricks and Snowflake are important because they showcase the tectonic shifts that are playing out the big data space. Open table formats like Apache Iceberg, Delta, and Apache Hudi have become critical elements of the big data stack because they allow multiple compute engines to access the same data (usually Parquet files) without fear of corrupted data from unmanaged interactions. In addition to ACID transactions, table formats provide “time travel” and rollback capabilities that are important for production use cases. While Hudi, which was developed at Uber to improve its Hadoop lake, was the first open table format, it hasn’t gained the same traction as Delta or Iceberg.

Open table formats are a critical piece of the data lakehouse, the Databricks-named data architecture that melds the flexibility and scalability of data lakes built atop object stores (or HDFS) with the accuracy and reliability of traditional data warehouse built atop analytical databases like Teradata and others. It’s a continuation of the decomposition of the database into separate components.

But table formats aren’t the only element of the lakehouse. Another critical piece is the metadata catalog, which acts as the glue that connects the various compute engines to the data residing in the table format (in fact, AWS calls its metadata catalog Glue). Metadata catalogs also are important for data governance and security, since they control the level of access that processing engines (and therefore users) get to the underlying data.

Table formats and metadata catalogs, when combined with management of the tables (structure design, compaction, partitioning, cleanup) is what gives you a lakehouse. All of the data lakehouse offerings, including those from Databricks, Snowflake, Tabular, Starburst, Dremio, and Onehouse (among others) include metadata catalog and table management atop a table format. Open query engines are the final piece that sit on top of these lakehouse stacks.

In recent years, open table formats and metadata catalogs have threatened to create new lock-in points for lakehouse customers and their customers. Companies have grown concerned about picking the “wrong” open table format, relegating them to piping data among different silos to reach their preferred query engine on their preferred platform, thereby defeating the promise of having a single lakehouse where all data resides. Incompatibility among metadata catalogs also threatened to create new silos when it came to data access and governance.

Recently, the Iceberg community worked to establish an open standard for how compute engines talk to the metadata catalog. It wrote a REST-based interface with the hope that metadata catalog vendors would adopt it. Some already have, notably Project Nessie, a metadata catalog developed by the folks at Dremio.

Snowflake developed its new metadata catalog Polaris to support this new REST interface, which is building momentum in the community. The company will be donating the project to open source within 90 days; the company says it most likely will choose the Apache Software Foundation. Snowflake hopes that, by open sourcing Polaris and giving it to the community, it will become the defacto standard for metadata catalog for Iceberg, effectively ending the metadata catalog’s run as another potential lock-in point.

Now the ball is in Databricks’ court. By acquiring Tabular, it has effectively conceded that Iceberg has won the table format war. The company will keep investing in both formats in the short run, but in the long run, it won’t matter to customers which one they choose, Databricks tells Datanami.

Now Databricks is under pressure to do something with Unity Catalog, the metadata catalog that it developed for use with Delta Lake. It is currently not open source, which raises the potential for lock-in. With the Data + AI Summit next week, look for Databricks to provide more clarity on what will become of Unity Catalog.

Databricks trolled Snowflake down the street from its Data Cloud Summit this week

At the end of the day, these moves are great for customers. Customers demanded data platforms that are open, that don’t lock them in, that allow them to move data in and out as they please, and that allow them to use whatever compute engine they want, when they want. And the amazing thing is, the industry gave them what they wanted.

The open platform dream may have been born nearly 20 years at the start of the Hadoop era. The technology just wasn’t good enough to deliver on the promise. But with the advent of open table formats, open metadata catalogs, and open compute engines–not to mention infinite storage paired with unlimited on-demand compute in the cloud–the fulfillment of the dream of an open data platform is finally within reach.

With the AI revolution promising to spawn even bigger big data and more meaningful use cases that generate trillions of dollars in value, the timing couldn’t have been much better.

Snowflake Embraces Open Data with Polaris Catalog

How Open Will Snowflake Go at Data Cloud Summit?

The post What the Big Fuss Over Table Formats and Metadata Catalogs Is All About appeared first on Datanami.