We live in a world of big data and big compute. But what about big query engines? One of the startups developing software to keep up with big data and big compute is Voltron Data, which is headed by Josh Patterson.
Patterson co-founded Voltron Data in 2021 with pandas creator Wes McKinney (a 2018 Person to Watch) to develop next-generation data processing technology for the Python data ecosystem. About a year ago, Voltron Data company released Theseus, which it claims runs many times faster than Spark while costing many times less.
We recently caught up with Patterson, who is the CEO of Voltron Data and also one of our 2024 BigDATAwire People to Watch, to talk about his work at Voltron Data and the Python data ecosystem.
BigDATAwire: Voltron Data states that its Theseus product is for “petabyte-scale ETL.” Why have we not been able to move beyond ETL after all these years?
Josh Patterson: A single system can’t handle all tasks today; especially as analytics and ML become more complex, there are specialized systems optimized for specific workloads. We see this in the rise of GPUs for AI. Given this continual evolution and complexity, ETL evolves into a crucial service for managing these divergent systems, and it’s now the bottleneck.
When AI/ML training adopted hardware accelerators like GPUs, it improved AI system performance by 100,000x. However, data preprocessing is still on CPUs, and performance has only grown 10X in the last decade. Organizations at the forefront of AI are constrained by data processing because they cannot afford to build out big data CPU clusters fast enough. The performance divergence between GPUs and CPUs is getting exponentially worse. Only Theseus, Voltron Data’s accelerator-native data analytics engine, is achieving a 60x performance increase with 50x cost savings leveraging the same accelerators used in AI. Until we find one singular way to draw intelligence from data, we’ll always have ETL, which will continually need to get faster and more efficient.
BDW: How did your experience working on RAPIDS at Nvidia help prepare you for Voltron Data?
JP: My time at NVIDIA where I launched RAPIDS (an open source suite of data processing and ML libraries designed to enable data science workflows on GPU) was like working at a massive startup. It moved faster than most enterprises, focused on cutting-edge technology, pioneered new use cases and tapped into previously non-existent industries. We were relentlessly innovating.
With RAPIDS, we constantly thought of ways to accelerate adoption and maturity. Leveraging the open standards ecosystem, such as Apache Arrow, allowed us to accelerate our development and truly focus on innovation instead of redoing things that already existed – a philosophy that continues at Voltron Data today.
BDW: What role do you see Voltron Data filling in the Python data ecosystem in the years to come?
JP: With projects like Ibis, pyArrow, and ADBC, we expect the open standards we build, promote, and maintain will underpin the Python data ecosystem. In addition, standards like Arrow and Substrait exist to support a multitude of languages beyond the pythonic ecosystems.
Bridging these language divides so enterprises can scale out and integrate their myriad of data ecosystems is central to Voltron Data’s mission to bring a new way to design and build data systems.
BDW: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?
JP: Most people don’t know that I come from a long line of builders. Early in my career, I was a licensed general contractor and still enjoy building things around the house or with my family.
To read the rest of the 2024 People to Watch interviews, click here.
The post Meet Josh Patterson, a 2024 BigDATAwire Person to Watch appeared first on BigDATAwire.
0 Commentaires