Databricks today rolled out a new open table format in Delta Lake 3.0 that it says will eliminate the possibility of picking the wrong one. Dubbed Universal Format, or UniForm, the new table format can read and write data in all three popular data table formats, including Delta Table, Apache Iceberg, and Apache Hudi.
Open table formats help customers by providing a standard and consistent way to access big data sets. Following the chaos of the Hadoop era and the overreliance on the Apache Hive metastore, the calm and dependable data that organizations experience under any one of the three open table formats has to be seen as a major improvement in big data management.
Whether it’s Databricks’ own Delta Table, the Apache Iceberg project that came out of Netflix and Apple, or the Apache Hudi project that emerged from Uber’s big data team, the table formats deliver similar capabilities. Above all, they give organizations the assurance that data won’t be corrupted and can be relied upon during transactions when multiple users and data processing engines access the same data–something that Hadoop users figured out the hard way (or made the problem of the downstream application programmer).
The positive impact of open table formats has been growing over the past few years. While Hudi was arguably first on the market, Iceberg has been building momentum over the past 18 months thanks to support from data platform vendors like Snowflake, AWS, and Cloudera. Databricks, which developed its own Delta Table format, responded to the growing demand for open table formats a year ago by contributing the remainder of the Delta Table formats to open source at the 2022 Data + AI Summit.
But what may seem like a good old-fashioned battle for technological supremacy played out in the open market actually has a darker side, according to Databricks CEO and co-founder Ali Ghodsi.
“Right now, I have to pick. Which color do I pick? If I pick the wrong color, I might get fired,” Ghodsi said during a press conference at the 2023 Data + AI Summit in San Francisco.
Just as consumers were caught in the middle of the videocassette wars of the 1980s, which pitted JVC’s open VHS standard versus Sony’s technologically superior but proprietary Betamax format, the current open table format wars that pit Delta Table versus Iceberg versus Hudi threatens the well-being of customers trying to make their way in the data lakehouse, Ghodsi said.
In other words, nobody wants to get stuck with the big data-equivalent of dozens of Beta tapes (even if they were technically superior).
“There’s all this talk about format wars, and it’s actually really unfortunate,” the 2019 Datanami Person to Watch continued. “We democratized data. We got it out of these data warehouses. We made it cheaper. But you have to pick which flavor you want. And once you pick your favorite flavor, if you pick blue or red or green, you’re stuck with that color forever. It’s unfortunate.”
Some vendors want this war to happen, according to Ghodsi. While he didn’t name names, he said the war helps competing vendors’ positions “because it’s in their interests that people don’t use these open-source formats,” he says.
So Databricks decided to do something about it. Instead of requiring customers to use its Delta Lake format when storing data in its Delta Lake platform to the detriment of Hudi and Iceberg, Databricks customers can now adopt the universal format, or UniForm, and expose their data to processing engines in as Delta Lake, Iceberg, or Hudi.
Ghodsi explains how UniForm works:
“Universal format means we are generating metadata for all three projects–Delta, Hudi, Iceberg– inside Delta,” he says. “Metadata is very cheap. The expensive part is all the big data, and that’s only stored one time in a format called Parquet.”
While the metadata accounts for a small portion of the total data payload–less than 1%, and users can turn it off if they want–it’s still very important, Ghodsi says.
“If you get the metadata wrong, you can’t actually access this stuff well,” he says. “So the metadata is different in each of them. But the metadata is actually quite small. And since all three projects are open source, we just went and understood exactly how to do it in each of them.
“And now inside Databricks, when we create data, we create the metadata for all three,” he continues. “So anyone who thinks they are talking to an Iceberg data set, the metadata for Iceberg is right there, and all the data is in Parquet, and it works.”
Like Delta Table, the UniForm format is open source, which means other organizations and even vendors can adopt it too. Only time will tell whether UniForm is something Databricks’ competitors will adopt. In any event, Ghodsi is determined this will benefit Databricks customers.
“We unified and removed the format wars and we democratized data, so we’re very excited about that,” he says. “I think it’s going to matter for a lot of enterprises…Now you can just pick Delta, and it supports all the colors. You get any of the flavors you like.” (Sadly, your Betamax tapes are still useless.)
Delta Table 3.0 features a pair of other enhancements, including Delta Kernel and Liquid Clustering.
Databricks says the new Delta Kernel will address “connector fragmentation” by ensuring that data connectors that bring data into Delta Lake are built against a standard specification that doesn’t change. That will help to reduce the need to continually adapt the connectors to address each new version or protocol change used in Delta.
“With one stable API to code against,” Databricks says, “developers in the Delta ecosystem are able to seamlessly keep their connectors up-to-date with the latest Delta innovation, without the burden of having to rework connectors.”
Databricks says the new Liquid Clustering enhancement will help data architects ensure the highest performance of their growing big data systems. It does this by forgoing the traditional Hive-style partitioning that uses a fixed data layout in favor of a flexible data layout format. While Hive-style portioning may improve read performance, it does so at the cost of greater complexity of data management.
Delta Lake 3.0 will be available in the second half of 2023, the company says.
Related Items:
Databricks’ $1.3B MosaicML Buyout: A Strategic Bet on Generative AI
Why the Open Sourcing of Databricks Delta Lake Table Format Is a Big Deal
The post Databricks Puts Unified Data Format on the Table with Delta Lake 3.0 appeared first on Datanami.
0 Commentaires