Apache Hudi Is Not What You Think It Is

Vinoth Chandar, the creator of Apache Hudi, never set out to develop a table format, let alone be thrust into a three-way war with Apache Iceberg and Delta Lake for table format supremacy. So when Databricks recently pledged to essentially merge the Iceberg and Delta specs, it didn’t hurt Hudi’s prospects at all, Chandar says. It turns out we’ve all been thinking about Hudi the wrong way the whole time.

“We never were in that table format war, if you will. That’s not how we think about it,” Chandar tells Datanami in an interview ahead of today’s news that his Apache Hudi startup, Onehouse, has raised $35 million in a Series B round. “We have a specialized table format, if you will, but that’s one component of our platform.”

Hudi went into production at Uber Technologies eight years ago to solve a pesky data engineering problem with its Hadoop infrastructure. The ride-sharing company had developed real-time data pipelines for fast-moving data, but it was expensive to run. It also had batch data pipelines, which were reliable but slow. The primary goal with Hudi, which Chandar started developing years earlier, was to develop a framework that paired the benefits of both, thereby giving Uber fast data pipelines that were also affordable.

“We always talked about Hudi as an incremental data processing framework or a lakehouse platform,” Chandar said. “It started as an incremental data processing framework and evolved due to the community into this open lakehouse platform.”

Hadoop Upserts, Deletes, Incrementals

Uber wanted to use Hadoop like more of a traditional database, as opposed to a bunch of append-only files sitting in HDFS. In addition to a table format, it needed support for upserts and deletes. It needed support for incremental processing on batch workloads. All of those features came together in 2016 with the very first release of Hudi, which stands for Hadoop Upserts, Deletes, and Incrementals.

“The features that we built, we needed on the first rollout,” Chandar says. “We needed to build upserts, we needed to build indexes [on the write path], we needed to build incremental streams, we needed to build table management, all in our 0.3 version.”

Over time, Hudi evolved into what we now call a lakehouse platform. But even with that 0.3 release, many of the core table management tasks that we associate with lakehouse platform providers, such partitioning, compaction, and cleanup, were already built into Hudi.

Despite the broad set of capabilities Hudi offered, the broader big data market saw it as one thing: open table formats. And when Databricks launched Delta Lake back in 2017, a year after Hudi went into production, and Apache Iceberg came out of Netflix, also in 2017, the market saw those projects as a natural competitor to Hudi.

But Chandar never really bought into it.

“This table format war was invented by people who I think felt that was their edge,” Chandar says. “Even today, if you if you look at Hudi users…they frame it as Hudi is better for streaming ingest. That’s a little bit of a loaded statement, because sometimes it kind of overlaps with the Kafka world. But what that really means is Hudi, from day one, has always been focused on incremental data workloads.”

A Future Shared with ‘Deltaburg’

The big data community was rocked by a pair of announcements earlier this month at the annual user conferences for Snowflake and Databricks, which took place in back-to-back weeks in San Francisco.

Vinoth Chandar, creator of Apache Hudi and the CEO and founder of Onehouse

First, Snowflake announced Polaris, a metadata catalog that would use Apache Iceberg’s REST API. In addition to enabling Snowflake customers to use their choice of data processing engine on data residing in Iceberg tables, Snowflake also committed to giving Polaris to the open source community, likely the Apache Software Foundation. This move not only solidified Snowflake’s bonafides as a backer of open data and open compute, but the strong support for Iceberg also potentially boxed in Databricks, which was committed to Delta and its associated metadata catalog, Unity Catalog.

But Databricks, sensing the market momentum behind Iceberg, reacted by acquiring Tabular, the commercial outfit founded by the creators of Iceberg, Ryan Blue and Dan Weeks. At its conference following the Tabular acquisition, which cost Databricks between $1 billion and $2 billion, Databricks pledged to support interoperability between Iceberg and Delta Lake, and to eventually merge the two specifications into a unified format (Deltaberg?), thereby eliminating any concern that companies today would pick the “wrong” horse for storing their big data.

As Snowflake and Databricks slugged it out in a battle of words, dollars, and pledges of openness, Chandar never waivered in his belief that the future of Hudi was strong, and getting stronger. While some were quick to write off Hudi as the third-place finisher, that’s far from the case, according to Chandar, who says the newfound commitment to interoperability and openness in the industry actually benefits Hudi and Hudi users.

“This general trend towards interoperability and compatibility helps everyone,” he says.

Open Lakehouse Lifts All Boats

The open table formats are essentially metadata that provide a log of changes to data stored in Parquet or ORC files, with Parquet being, by far, the most popular option. There is a clear benefit to enabling all open engines to be able to read that Parquet data, Chandar says. But the story is a little more nuanced on the write side of that I/O ledger.

“On the other side, for example, when you manage and write your data, you should be able to do differentiated kind of things based on the workload,” Chandar says. “There, the choice really matters.”

Writing huge amounts of data in a reliable manner is what Hudi was originally designed to do at Uber. Hudi has specific features, like indexes on the write path and support for concurrency control, to speed data ingestion while maintaining data integrity.

“If you want near real-time continuous data ingestion or ETL pipelines to populate your data lakehouse, we need to be able to do table management without blocking the writers,” he says. “You really cannot imagine, for example, TikTok, who’s ingesting some 15 gigabytes per second, or Uber stopping their data pipelines to do management and bringing it online.”

Onehouse has backed projects like Onetable (now Apache Xtable), an open source project that provides read and write compatibility among Hudi, Iceberg, and Delta. And while Databricks’ UniForm project essentially duplicates the work of Xtable, the folks at Onehouse have worked with Databricks to ensure that Hudi is fully supported with UniForm, as well as Unity Catalog, which Databricks CTO and Apache Spark creator Matei Zaharia open sourced live on stage two weeks ago.

“Hudi is not going anywhere,” Chandar says. “We’re beyond the point where there’s one standard. These things are really fun to talk about, to say ‘He won, he lost,’ and all of that. But end of the day, there are massive amounts of pipelines pumping data into all three formats today.

Clearly, the folks at Craft Ventures, who led today’s $35 million Series B, think there’s a future in Hudi and Onehouse. “One day, every organization will be able to take advantage of truly open data platforms, and Onehouse is at the center of this transformation,” said Michael Robinson, partner at Craft Ventures.

“We can’t and we won’t turn our backs on our community,” Chandar continues. “Even with the marketing headwinds around this, we will do our best to continue educating the market and making these things easier.”

What the Big Fuss Over Table Formats and Metadata Catalogs Is All About

Onehouse Breaks Data Catalog Lock-In with More Openness

The post Apache Hudi Is Not What You Think It Is appeared first on Datanami.