Apache Pinot Uncorks Real-Time Data for Ad-Tech Firm

A year ago, customers of Sovrn’s affiliate marketing business could expect to wait 24 to 48 hours before getting access to new data describing online consumer behavior. That didn’t cut it, so the company looked to a new OLAP database from Apache Pinot to speed things up. After adopting Startree’s hosted Pinot setup last year, that data loading delay has been eliminated and customers have near real-time access to the data they need.

Sovrn is a Colorado-based online advertising company reaching nearly 500 million people across 3.5 billion pageviews per day. It’s best known as an ad-tech business, but in addition to running online ad auctions, the 10-year-old company also manages an affiliate marketing business and sells other products to help publishers understand their audiences.

The company relies on a variety of technologies to deliver data in a timely fashion for its various business, including for its affiliate marketing business, which rewards publishers and bloggers when customers click a partners’ link and complete an ecommerce transaction. Databases like Snowflake, Apache Cassandra, AWS’s Redshift and RDS play a role, and so do Apache Flink and Apache Kafka systems, Databricks’ lakehouse, and others.

Getting timely data is important for all of Sovrn’s customers, but it’s especially important for the affiliate marketing business. As news events and trends come and go–say the Instapot craze the release of an iPhone, or the death of a queen–the window of opportunity for publishers to move their chips to where the links are going can be a few days or even just hours.

Ryan Chichirico, VP of Engineering for Sovrn (Image courtesy StarTree)

In the first half of 2022, Sovrn was using an Amazon Redshift database from AWS to power customer queries for the affiliate marketing business. Users would log into Sovrn and begin interacting with the Redshift-powered dashboard to see what content was generating the most clicks.

While the database worked fine once the data was loaded, Sovrn struggled to get the large amounts of real-time data off the Apache Kafka data bus and loaded into Redshift in a timely fashion.

“Our data pipelines that work within Kafka and Redshift had 24- to 48-hour delays in processing,” says Ryan Chichirico, Sovrn’s vice president of engineering. “And if anything blipped within that process, mainly because of how the data pipeline worked, we would have longer delays than 48 hours.”

Dying on the Vine

The biggest challenge was the volume of real-time data Sovrn was trying to cram into Amazon Redshift. With hundreds of millions of pageview events, millions of click events, and hundreds of thousands of revenue events to load per day, it was just too much data for Redshift to handle quickly.

Merging the real-time data with historical records was another challenge for Sovrn. Organizations often build separate systems, including a traditional data warehouse like Redshift for the historical data and a second system built for real-time data. Getting the two systems synched up usually requires various Rube Goldberg workarounds and contraptions, and it’s never pretty.

Sovrn considered other databases to address its real-time challenge. The company had experience with Apache Cassandra and ScyllaDB, a C++ clone of Cassandra. It also considered DynamoDB, the fast NoSQL database from AWS.

It also looked at Pinot, a column-oriented database developed at LinkedIn, as a possible solution. Pinot, which was created alongside Kafka at LinkedIn a decade ago, was designed with a fast index that logs data as soon as it’s ingested from Kafka, thereby enabling users to query data much more quickly than with other approaches. Pinot also provides the capability to access some historical data (though not as much as a full data warehouse).

“Pinot was something that was coming to the market and there was a lot of interest, piquing our interest around it,” Chichirico says. “So as we started to explore the capabilities of it with the nearline and offline tables and the ability to batch upload hundreds of thousands of records versus sequential loading, like Cassandra forced you to do, gave us the confidence during the bake off that it would perform.”

Sovrn ran the bakeoff in mid-2022, putting Pinot up against Snowflake and Cassandra/ScyllaDB at the end of the Kafka firehose. “The latency we were getting out of Pinot was blazing compared to these other products,” Chichirico says. “Sub-second latency.”

Sovrn was sold on Pinot’s capability and started its implementation near the end of the year. The company elected to go with StarTree’s hosted implementation of Pinot on AWS instead of running it themselves. StarTree was founded by Pinot’s creators and are still very close to the open source project, which impressed Chicirico.

Pinot in the Glass

After Pinot was open sourced in

Sovrn’s Pinot-powered dashboard gives customers real-time insights (Image courtesy StarTree)

2015, two of Pinot’s creators, Kishore Gopalakrishna and Xiang Fu, co-founded StarTree in 2018 to bring the product to market as a service. The company came out of stealth in 2021.

It’s been smooth sailing for Sovrn since implementing Pinot in late 2022. StarTree has conducted live upgrades, minimizing downtime. Sovrn and StarTree share a Slack channel that allows them to get in touch with the vendor when they have issues, Chichirico says.

“They’ve been phenomenal,” he tells Datanami. “Anything we escalate to them, they have a support team working on it. I don’t think we’ve run into any real snags with any features we need, but they’re certainly open to giving us access to beta features and asking us to try things out and taking our feedback pretty seriously.”

Databricks still has a role to play in Sovrn’s affiliate marketing business. Sovrn keeps only 30 days’ worth of data in Pinot, but relies on Spark batch jobs running in Databricks’ cloud to build historical data tables. But when they’re querying Sovrn’s system, customers don’t need to know where the underlying data actually resides.

“We’ve revamped our data pipelines so now we can stream that information in from Kafka as it’s happening, but we can also process it behind the scenes from Databricks to backfill information in, so you get that offline and real-time view of how your products have been performing,” the Sovryn VP says. “You don’t need to worry about whether it’s a real-time click or on offline click…You just make a query out to the click table for the information you’re seeking.”

But the big news is that Sovrn’s affiliate marketing customers now have access to much fresher data than before. The 24 to 48 hour lag between when consumers do something on the Internet and when their activity is logged into Pinot has all but been eliminated. Once the data is loaded into the Pinot database, the average query is completed in about 2.5 seconds, versus 6.2 seconds using Redshift.

That empowers customers to make decisions, Chichirico says. “Like, Wow, this page is monetizing really well. We can now distribute that page to our social channels. We can focus on this content more on our YouTube channels. We can make TikTok shorts about it,” he says. “Whatever it’s going to take to get more traffic driven to this page for this click and link generation to go buy that product.”

StarTree Uncorks $47 Million for Pinot

8 New Big Data Projects To Watch

The post Apache Pinot Uncorks Real-Time Data for Ad-Tech Firm appeared first on Datanami.