It’s Go Time for Open Data Lakehouses

If you’re a supporter of open data, it’s hard not to feel good about last week’s news around Apache Iceberg. Customers demanded an open storage format, and the two leading providers, Snowflake and Databricks, are delivering it, in a big way.

To recap: Databricks surprised the big data community last Tuesday by throwing its weight behind Apache Iceberg with the announcement of its intent to acquire Tabular, which was founded by former Netflix engineers who created Iceberg.

That announcement came a day after Snowflake unveiled Polaris, a new metadata catalog designed to work with Iceberg, thereby enabling customers to use open query engines with their data. The move furthered Snowflake’s transition from a proudly proprietary cloud data warehouse into an open data platform for analytics and AI.

Members of the open data ecosystem responded with applause. Among the biggest supporters is Dremio, which develops an open-source query engine of the same name, is the main backer for an open metadata catalog, Project Nessie, and also manages an Iceberg-based lakehouse for customers.

“I think it’s a statement that, in table formats, Iceberg won. I think it’s the realization of that,” said James Rowland-Jones (JRJ), Dremio’s vice president of product management. “It’s also the realization that table format bifurcation, when you are not winning, is not helpful to your business.”

Databricks’ table format, called Delta, was the most-used table format when Dremio surveyed customers on their lakehouse technologies in late 2023. While Delta was number one in terms of total deployments, Iceberg was the leader in terms of planned deployments over the next three years, said Read Maloney, Dremio’s chief marketing officer.

“Who’s driving these changes? It’s customers. Customers are sick of being locked-in, and the only way to do that is to ensure that you’re not only in an open table format, but then you have an open catalog,” Maloney told Datanami in an interview at Snowflake’s Data Cloud Summit in San Francisco last week.

“So now customers own their own storage, they own their own data, they own their own metadata, and then all the vendors in the ecosystem build around that. And the customer now has the ability to say ‘I want that vendor for this, I want that vendor for this,’ and they all work within the common ecosystem,” he says. “The more there’s commonality in the specification around the catalogs, it makes it way easier for everyone to get involved in the ecosystem.”

“We’re listening to customers,” Ron Ortluff, the head of data lake and iceberg at Snowflake, told Datanami in an interview last week. “That’s kind of the guiding principle.”

The pending launch of Polaris, which Snowflake plans to donate to the open source community within 90 days, means that Snowflake customers soon will be able to query their Iceberg data using any query engine that supports Iceberg’s REST-based API. That list includes Apache Spark, Apache Flink, Presto, Trino, and (soon) Dremio. And of course, they will also be able to query Iceberg data using Snowflake’s fast proprietary SQL engine.

Source: Snowflake

The momentum behind open data is sign of the continued decoupling of compute stacks, said Siva Padisetty, the CTO for New Relic, which develops an observability platform.

“After storage and compute became decoupled, all of the layers from storage through analytics began to be similarly unbundled, a process currently taking place with tables,” Padisetty said via email. “Overall, the focus here remains on data stack optimization and how organizations assemble the appropriate storage, table format, and compute engines to process their data use cases in the fastest possible manner.”

The key, Padisetty says, “is maintaining vendor unlock, speed, and agility across compute and storage while solving business use cases in the most cost-effective manner with the gravity of data without multiple copies.”

The value of having a centralized data platform that can handle huge data volumes and maintain performance and security for multiple use cases, such as IT telemetry, data lake, and SQL analytics is paramount, he said.

“Enterprises get the value add of open-source technology while maintaining centralized data,” Padisetty continued. “The centralization of the use cases is going to happen, and companies should be positioning themselves to address that.”

The folks at Starburst, the commercial outfit behind the open source Trino, are also watching the Iceberg developments closely. Iceberg was originally developed in part to enable Netflix to use Presto, which Trino forked from, so the growth of Iceberg is definitely a positive one.

“The benefit to the market and customers is that this competition actually creates openness,” said Justin Borgman, the CEO and chairman of Starburst, which also offers an Iceberg-based lakehouse service. “Starburst is one such beneficiary and can now be considered a strong third option in the Databricks vs. Snowflake debate.”

Borgman is closely watching what comes next, particularly around the metadata catalog. Just as the battle over open table formats ended up being a new source of data silo-ization (which is ironic, since they were created to foster open data), the metadata catalogs are also a potential source of lock-in, as they broker connections between processing engines and the data.

“With Tabular, Databricks’s Unity catalog has the potential to capture a lot more market share, including organizations using either Delta Lake or Iceberg,” Borgman told Datanami via email. “Snowflake’s open-sourcing of Polaris is a way to compete against Databricks by highlighting that while the market is rapidly moving to open storage formats like Iceberg, catalogs like Unity are a new source of lock-in. One could speculate that this will pressure Databricks to eventually open source Unity, but it is too early to know for sure.”

Taken as a whole, however, the news of the past week is very good for customers and supporters of open data. Momentum for open data platforms is building, and it couldn’t come at a better time.

“The Iceberg ecosystem has been growing quickly. I think it’s going to grow even faster on the back of both of these announcements,” Maloney said. “If you’re in the Iceberg community, this is go time in terms of entering the next era.”

Databricks Nabs Iceberg-Maker Tabular to Spawn Table Uniformity

Snowflake Embraces Open Data with Polaris Catalog

The post It’s Go Time for Open Data Lakehouses appeared first on Datanami.