Big data lakehouses are spreading, thanks to their capability to mix the data stability and correctness of a traditional warehouse with the flexibility and scalability of a data lake. One of the technologists who was key to the success of the data lakehouse is Vinoth Chandar, who is the creator of the Apache Hudi open table format and also a 2024 BigDATAwire Person to watch.
Chandar led the development of Apache Hudi while at Uber to address high-speed data ingest issues with the company’s Hadoop cluster. While it bears similarities to other open table formats, like Apache Iceberg and Delta Lake, Hudi also retains capabilities in data streaming that are unique.
As the CEO of Onehouse, Chandar oversees the development of a cloud-based lakehouse offering, as well as the development of XTable, which provides interoperability among Hudi and other open table formats. BigDATAwire recently caught up with Chandar to discuss his contributions to big data, distributed systems development, and Onehouse.
BigDATAwire: You’ve been involved in the development of distributed systems at Oracle, LinkedIn, Uber, Confluent, and now Onehouse. In your opinion, are distributed systems getting easier to develop and run?
Vinoth Chandar: Building any distributed system is always challenging. From the early days at LinkedIn building the more basic blocks like key-value storage, pub-sub systems or even just shard management, we have come a long way. A lot of those CAP theorem debates have subsided, and the cloud storage/compute infrastructure of today abstracts away many of the complexities of consistency, durability, and scalability that developers previously managed manually or wrote specialized code to handle. A good chunk of this simplification is because of the rise of cloud storage systems such as Amazon S3 that have brought the “shared storage” model to the forefront. With shared storage being such an abundant and inexpensive resource, the complexities around distributed data systems have come down a fair bit. For example, Apache Hudi provides a full suite of database functionality on top of cloud storage, and is far easier to implement and manage than the shared-nothing distributed key-value store my team built at LinkedIn back in the day.
Further, the use of theorems like PACELC to understand how distributed systems behave shows how much focus is now placed on performance at scale, given the exponential growth in compute services and data volumes. While conventional wisdom says performance is just one factor, it can be a pretty costly mistake to pick the wrong tool for your data scale. At Onehouse, we are spending a vast amount of time helping customers who have such ballooning cloud data warehouse costs or have chosen a slow data lake storage format for their more modern workloads.
BDW: Tell us about your startup, Onehouse. What does the company do better than any other company? Why should a data lake owner look into using Onehouse?
Chandar: The problem we’re trying to solve for our customers is to eliminate the cost, complexity, and lock-in imposed by today’s leading data platforms. For example, a user may choose Snowflake or BigQuery as the best-of-breed solution for their BI and reporting use case. Unfortunately, their data is locked into Snowflake and they can’t reuse it to support other use cases such as machine learning, data science, generative AI, or real-time analytics. So they then have to deploy a second platform such as a plain old data lake, and these additional platforms come with high costs and complexity. We believe the industry needs a better approach: a fast, cost-efficient, and truly open data platform that can manage all of an organization’s data centrally, supporting all of their use cases and query engines from one platform. That’s what we’re setting out to build.
If you look at the team here at Onehouse, one thing that immediately stands out is that we have been behind some of the biggest innovations in data lakes and now data lakehouses from day one. As far as what we are building at Onehouse, it is really unique in that it provides all of the openness one should be able to expect from a data lakehouse in terms of the types of data you can ingest, but also what engines you can integrate with downstream, so you can always apply the right tool for your given use case. We like to call this model the “Universal Data Lakehouse.”
Because we’ve been at this for a while, we’ve been able to develop a lot of best practices around pretty technical challenges such as indexing, automatic compaction, intelligent clustering and so on, that are all critical for data ingestion and pipelines at large. By automating those with our fully-managed service, we are seeing customers cut cloud data infrastructure cost by 50% or more, accelerating ETL and ingestion pipelines and query performance by 10x to 100x, while freeing up data engineers to deliver on projects with more business facing impact. The technology we are built on is powering data lakehouses growing at petabytes per day, so we are doing all of this at massive scale.
BDW: How do you view the current battle for table formats? Does there need to be one standard, or do you think Apache Hudi, Apache Iceberg, or Delta Lake will eventually win out?
Chandar: I think the current debate on table formats is misplaced. My personal view is that all three leading formats – Hudi, Iceberg, and Delta Lake – are here to stay. They all have their particular areas of strengths. For example, Hudi has clear advantages for streaming use cases and large-scale incremental processing, hence why organizations like Walmart and Uber are using it at scale. We may in fact see the rise of additional formats over time, as you can marry different data file organizations and table metadata and index structures to create probably half a dozen more table formats specialized to different workloads.
In fact, “table metadata format” is probably a clearer articulation of what we are referring to, as the actual data is just stored in columnar file formats like Parquet or Orc, across all three projects. The value users derive by switching from older data lakes to the data lakehouse model, comes not from mere format standardization, but solving some hard database problems like indexing, concurrency control, and change capture on top of a table format. So, if you believe the world will have multiple databases, then you also have good reason to believe there cannot and won’t be a standard table format.
So I believe that the right debate to be having is how to provide interoperability between all of the formats from a single copy of data. How can I avoid having to duplicate my data across formats, for example once in Iceberg for Snowflake support and once in Delta Lake for Databricks integration? Instead, we need to solve the problem of storing and managing the data just once, then enabling access to the data through the best format for the job at hand.
That’s exactly the problem we were solving with the XTable project we announced early 2023. XTable, formerly Onetable, provides omnidirectional interoperability between these metadata formats, eliminating any engine specific lock-ins imposed by the choice of table formats. XTable was open sourced late last year, and has seen tremendous community support including the likes of Microsoft Azure and Google Cloud. It has since transformed into Apache XTable, which is currently incubating with Apache Software Foundation with more industry participation as well.
BDW: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?
Chandar: I really love to travel and take long road trips, with my wife and children. With Onehouse taking off, I haven’t had as much time for this recently. I’d really like to visit Europe and Australia someday. My weekend hobby is caring for my large freshwater aquarium at home with some pretty cool fish.
You can read more about the 2024 BigDATA Wire People to Watch here.
The post Meet Vinoth Chandar, a 2024 Person to Watch appeared first on BigDATAwire.
0 Commentaires