LakeFS Nabs $20M to Build ‘Git for Big Data’

Many organizations would like to get a better handle on their unstructured data in pursuit of an AI initiative. One promising startup pursuing that goal is lakeFS, which develops a version control system for big data, and which today announced it has raised $20 million to drive growth.

Just as Git provides version control to help developers manage application code, lakeFS brings version control to big data, including branching, merging, and committing data. It works with a variety of structured and unstructured data formats residing in S3-compatible object storage and file systems, and is being targeted at AI teams who are struggling to manage unstructured data for AI and machine learning projects.

“Data constantly changes, and you need to be able to look at the history of the data,” said lakeFS CEO and Co-founder Einat Orr. “LakeFS provides a manageability layer that is critical for enterprises to succeed with AI and ML initiatives.”

Before lakeFS, Orr was the CTO at an Israeli startup called SimilarWeb, a digital data and Web analytics firm that is now publicly traded. Orr was in charge of managing the R&D team that developed SimilarWeb’s data analytics application. The company used all the latest DevOps tools and techniques, just like many other tech firms.

“You worked with agile, with Git. You used testing platforms. You had your DevOps environment set up and you could work very quickly,” Orr explained to BigDATAwire. “When it comes to the data side, it was very hard to implement engineering best practices. The iterative work was very, very slow. The cost of error was very high. And this is the problem that we came to solve.”

Einat Orr is the CEO and Co-founder of lakeFS

In 2020, Orr and her SimilarWeb colleague Oz Katz left to co-found lakeFS, which was originally called Treeverse. The idea was to bring DevOps best practices and tech to data, specifically around the implementation of testing. As the company’s open source and enterprise tools were adopted, they saw that enterprises were primarily interested in using it in AI and ML environments, so the company shifted its focus there.

“When we released the project in 2020, that was our goal,” said Orr, who has a PhD in mathematics from Tel Aviv University. “And over time, we saw that the adoption is mainly in environments where models are researched and then trained, so the use case of AI and ML is where data version control really provides value.”

The version control in lakeFS functions essentially like an audit trail. When one person or application makes a change to the data, it’s tracked by lakeFS. Users can clone the original data set and branch it to use for additional use cases, like an analytics project. If the changes were made in error, they can be rolled back to the original.

There are three main ways that organizations need version control for data, Orr said. Either the data is very large, such as in the petabytes of data and billions of files; there are so many sources of data that they can’t be tracked manually; or the team of people accessing the data is so large that versioning is needed to keep people from stepping on each others’ toes.

Data practitioners are the main users of lakeFS, which could be data engineers, data analysts, or data scientists. LakeFS can be deployed as part of an effort to create data products, or pre-built repositories of data, Orr said. “When you have data version control, you can easily create a data product and work on it,” Orr said. “Several people can work on this data product. You can control the inputs of the data.”

Testing is still part and parcel of the lakeFS experience. Engineers can develop a test to determine if the data is kosher and follows the organizations’ best practices. If the data passes the test, more users can be granted access to it as a data product. It functions similarly to a CI/CD (continuous integration/continuous deployment) pipeline in the DevOps world, Orr said.

LakeFS enables customers to manage distributed, disparate data in a logical way. Instead of copying all of your data and loading it into a single repository, lakeFS creates a logical repository out of the object storage buckets, where users can access the data from a single mount point. LakeFS creates additional data structures on the storage repository where the users’ data is stored; nothing is saved externally.

The software itself is open source and supports any POSIX-compliant data source running on Linux and Unix, including object stores and file systems; support for Windows is coming. Anyone can use lakeFS to bring version control to data stored in a single repository. Databases running on block storage and SANs are not supported.

The company also sells an enterprise version that adds support for multiple object stores, on-prem data stores, role-based access control (RBAC), and creating mount points. The enterprise version also supports the versioning of Apache Iceberg tables and Snowflake environments.

The company has racked up several impressive customer wins over its short lifetime. Volvo, Toyota, Microsoft, Arm, Bosch, and NASA are using lakeFS as part of their data management infrastructure. One of the early users of lakeFS is the defense contractor Lockheed Martin, which uses lakeFS to help manage data as part of its AI factory. Orr explained the value of lakeFS in this deployment:

“So any user in Lockheed Martin, when dealing with the data, would be creating a lakeFS repository, putting their all the data that is relevant for their research or their model,” she said. “And then the team within that repository would be able to collaborate very easily by working on branches and merging good results, being able to reproduce any point in time within the development of the model.”

(Dave Hoeek/Shutterstock)

The Department of Energy is using lakeFS as part of Project Alexandra, an effort to build data interconnections and provide stewards for a long-term view of data stored by itself and the National Nuclear Security Administration (NNSA). You can view a video on the DOE’s use of lakeFS (and other big data software) here.

When the generative AI wave hit in late 2022, it spurred heavy investments in data infrastructure. Suddenly, unstructured data had a lot more value in an AI setting, but the technologies for managing that data were not keeping up with the rest of the stack. LakeFS was ready to pick up the GenAI ball and run with it, providing version control for unwieldly unstructured data repositories that are so critical for organizations’ AI projects.

The $20 million investment from Major Investments adds to previous $23 million in funding. This round is intended to help drive growth for lakeFS, both on the R&D side as well as the go-to-market side, Orr said.

LakeFS solves one of the most critical and oft overlooked challenges in modern data infrastructure, said Ido Hart, Partner at Maor Investments.

“As AI data becomes larger, messier and more mission-critical, lakeFS delivers the control layer needed to build, iterate and ship with confidence,” he states. “Built for the scale and complexity of modern enterprises, lakeFS is not just a smart solution, it’s a foundational layer for reproducibility, collaboration and trust in the AI era. We believe lakeFS will become indispensable to the modern AI stack, and we’re proud to back their bold vision.”

The dream of bringing order to messy multi-modal data is not the exclusive domain of Orr and Katz. Orr said she and her co-founder have the scars of working through the days of Hadoop. The creation of lakeFS is one of the results of applying the knowledge gained from those hard lessons.

“One of the things that I love about this is that it doesn’t replace anything, but it enhances everything within the environment that we’re in with version control capabilities,” Orr said. “Suddenly the storage is managed properly and clearly, and the orchestration can work with the versions. The data and the code could be orchestrated together with their versions. Everything falls into place just by putting this data version control system in. It just makes everything better.”

Peering Into the Unstructured Data Abyss

Unstructured Data Growth Wearing Holes in IT Budgets

The post LakeFS Nabs $20M to Build ‘Git for Big Data’ appeared first on BigDATAwire.