The Spark-to-Ray Migration That Will Save Amazon $100M+ Per Year

When Ray first emerged from the UC Berkeley RISELab back in 2017, it was positioned as a possible replacement for Apache Spark. But as Anyscale, the commercial outfit behind Ray, scaled up its own operations, the “Ray will replace Spark” mantra was played down a bit. However, that’s exactly what the folks at a little online bookstore known as Amazon have done with one particularly onerous–and enormous–Spark workload.

Amazon–the parent of Amazon Web Services–recently published a blog post that shares intricate details about how it has started the migration of one of its largest business intelligence workloads from Apache Spark to Ray. The big problem stems from complexities the Amazon Business Data Technologies (BDT) team encountered as they scaled one Spark workload: compacting newly arrived data into the data lakehouse.

Amazon adopted Spark for this particular workload back in 2019, a year after it completed a three-year project to move away from Oracle. Amazon famously was a big consumer of the Oracle database for both transactional and analytical workloads, and just as famously was desperate to get off. But as the scale of the data processing grew, other problems cropped up.

A Giant Lakehouse

Amazon completed its Oracle migration in 2018, having moved the bulk of its 7,500 Oracle database instances for OLTP workloads to Amazon Aurora, the AWS-hosted MySQL and Postgres service, while some of the retail giant’s most write-intensive workloads went to Amazon DynamoDB, the company’s serverless NoSQL database.

However, there remained what Amazon CTO Werner Vogels said was the largest Oracle data warehouse on the planet, at 50 PB. The retailer’s BDT team replaced the Oracle data warehouse with what effectively was its own internal data lakehouse platform, according to the Amazon bloggers, including Amazon Principle Engineer Patrick Ames; Jules Damji, former lead developer advocate at Anycale; and Zhe Zhang, head of open source engineering at Anycale.

Amazon built a lakehouse to replace its Oracle data warehouse (Image courtesy Amazon)

The Amazon lakehouse was constructed primarily atop AWS-developed infrastructure, including Amazon S3, Amazon Redshift, Amazon RDS, and Apache Hive via Amazon EMR. The BDT built a “table subscription service” that let any number of analysts and other data consumers subscribe to data catalog tables stored in S3, and then query the data using their choice of framework, including open source engines like Spark, Flink, and Hive, but also Amazon Athena (serverless Presto and Trino) and Amazon Glue (think of it as an early version of Apache Iceberg, Apache Hudi, or Delta Lake).

But the BDT team soon faced another issue: unbounded streams of S3 files, including inserts, updates, and deletes, all of which needed to be compacted before they could be used for business-critical analysis–which is something that the table format providers have also been forced to deal with.

The Compaction Problem

“It was the responsibility of each subscriber’s chosen compute framework to dynamically apply, or ‘merge,’ all of these changes at read time to yield the correct current table state,” the Amazon bloggers write. “Unfortunately, these change-data-capture (CDC) logs of records to insert, update, and delete had grown too large to merge in their entirety at read time on their largest clusters.”

The BDT team was also running into “unwieldy problems,” such as having millions of very small files to merge, or a few massive files. “New subscriptions to their largest tables would take days or weeks to complete a merge, or they would just fail,” they write.

The initial tool they turned to for compacting these unbounded streams of S3 files was Apache Spark running on Amazon EMR (which used to stand for Elastic MapReduce but doesn’t stand for anything anymore).

Amazon faced issues with unbounded streams of S3 files, which prompted the development of a Spark compactor (Image courtesy Amazon)

The Amazon engineers constructed a pipeline whereby Spark would run the merge once, “and then write back a read-optimized version of the table for other subscribers to use,” they write in the blog. This helped to minimize the number of records merged at read time, thereby helping to get a handle on the problem.

However, it wasn’t long before the Spark compactor started showing signs of stress, the engineers wrote. No longer a mere 50PB, the Amazon data lakehouse had grown beyond the exascale barrier, or 1,000 PBs, and the Spark compactor was “starting to show some signs of its age.”

The Spark-based system simply was no longer able to keep up with the sheer volume of workload, and it started to miss SLAs. Engineers resorted to manually tuning the Spark jobs, which was difficult because “Apache Spark successfully (and unfortunately in this case) abstracting away most of the low-level data processing details,” the Amazon engineers write.

After considering a plan to build their own custom compaction system outside of Spark, the BDT considered another technology they had just read about: Ray.

Enter the Ray

Ray emerged from the RISELab back in 2017 as a promising new distributed computing framework. Developed by UC Berkeley graduate students Robert Nishihara and Philipp Moritz and their advisors Ion Stoica and Michael Jordan, Ray provided a novel mechanism for running arbitrary computer programs in an n-tier manner. Big data analytics and machine learning workloads were certainly under Ray’s gaze, but thanks to Ray’s general-purpose flexibility, it wasn’t restricted to that.

Amazon engineers constructured their own Ray clusters to manage the compaction (Image courtesy Amazon)

“What we’re trying to do is to make it as easy to program the cloud, to program clusters, as it is to program on your laptop, so you can write your application on your laptop, and run it on any scale,” Nishihara, a 2020 Datanami Person to Watch, told us back in 2019. “You would have the same code running in the data center and you wouldn’t have to think so much about system infrastructure and distributed systems. That’s what Ray is trying to enable.”

The folks on Amazon’s BDT team were certainly intrigued by Ray’s potential for scaling machine learning applications, which are certainly some of the biggest, gnarliest distributed computing problems on the planet. But they also saw that it could be useful for solving their compaction problem.

The Amazon bloggers listed off the positive Ray attributes:

“Ray’s intuitive API for tasks and actors, horizontally-scalable distributed object store, support for zero-copy intranode object sharing, efficient locality-aware scheduler, and autoscaling clusters offered to solve many of the key limitations they were facing with both Apache Spark and their in-house table management framework,” they write.

Ray for the Win

Amazon adopted Ray for a proof of concept (POC) for their compaction issue in 2020, and they liked what they saw. They found that, with proper tuning, Ray could compact 12 times larger datasets than Spark, improve cost efficiency by 91% compared to Spark, and process 13 times more data per hour than Spark, the Amazon bloggers write.

“There were many factors that contributed to these results,” they write, “including Ray’s ability to reduce task orchestration and garbage collection overhead, leverage zero-copy intranode object exchange during locality-aware shuffles, and better utilize cluster resources through fine-grained autoscaling. However, the most important factor was the flexibility of Ray’s programming model, which let them hand-craft a distributed application specifically optimized to run compaction as efficiently as possible.”

Amazon continued its work with Ray in 2021. That year, the Amazon team presented their work at the Ray Summit, and contributed their Ray compactor to the Ray DeltaCAT project, with the goal to support “other open catalogs like Apache Iceberg, Apache Hudi, and Delta Lake,” the bloggers write.

Amazon proceeded cautiously, and by 2022 had adopted Ray for a new service that analyzed the data quality of tables in the product data catalog. They chipped away at errors that Ray was generating and worked to integrate the Ray workload into EC2. By the end of the year, the migration from the Spark compactor to Ray began in earnest, the engineers write. In 2023, they had Ray shadowing the Spark compactor, and enabled administrators to switch back and forth between them as needed.

By 2024, the migration of Amazon’s full exabyte data lakehouse from the Spark compactor to the new Ray-based compactor was in full swing. Ray compacted “over 1.5EiB [exbibyte] of input Apache Parquet data from Amazon S3, which translates to merging and slicing up over 4EiB of corresponding in-memory Apache Arrow data,” they write. “Processing this volume of data required over 10,000 years of Amazon EC2 vCPU computing time on Ray clusters containing up to 26,846 vCPUs and 210TiB of RAM each.”

Amazon continues to use Ray to compact more than 20PiB per day of S3 data, across 1,600 Ray jobs per day, the Amazon bloggers write. “The average Ray compaction job now reads over 10TiB [tebibyte] of input Amazon S3 data, merges new table updates, and writes the result back to Amazon S3 in under 7 minutes including cluster setup and teardown.”

This means big savings for Amazon. The company estimates that if it were a standard AWS customer (as opposed to being the owner of all those data centers) that it would save 220,000 years of EC2 vCPU computing time, corresponding with a $120 million per-year savings.

It hasn’t all been unicorns and puppy dog tails, however. Ray’s accuracy at first-time compaction (99.15%) is not as high as Spark’s (99.91%), and sizing the Ray clusters hasn’t been easy, the bloggers write. But the future is bright for using Ray for this particular workload, as the BDT engineers are now looking to utilize more of Ray’s features to improve this workload and save the company even more of your hard-earned dollars.

“Amazon’s results with compaction specifically…indicate that Ray has the potential to be both a world-class data processing framework and a world-class framework for distributed ML,” the Amazon bloggers write. “And if you, like BDT, find that you have any critical data processing jobs that are onerously expensive and the source of significant operational pain, then you may want to seriously consider converting them over to purpose-built equivalents on Ray.”

Why Every Python Developer Will Love Ray

AWS, Others Seen Moving Off Oracle Databases

The post The Spark-to-Ray Migration That Will Save Amazon $100M+ Per Year appeared first on Datanami.