How Much Docker Should a Data Scientist Know?

The best answers are obviously “some”, “depends”, or begin with “Well…”. Let’s do a deeper dive and attempt to understand where and how Docker is being shoehorned (read shoved) into a data scientist’s daily work and look at how open source Buildpacks can help data scientists.

The Culprit ― Titles

Before we dive into the specifics of containerization, it’s crucial to understand the different roles often found among data specialists and what they typically entail. These include data scientist, data analyst, and data engineer.

A data analyst typically focuses on exploring and analyzing existing data to extract insights and communicate findings. Their work often involves data cleaning, visualization, statistical analysis, and reporting. Tools often include SQL, Excel, BI tools (Tableau, Power BI), and sometimes Python/R for scripting and basic modeling.

The data scientist builds models and algorithms, often using advanced statistical techniques and machine learning. They are involved in the entire process from data collection and cleaning to model building, evaluation, and sometimes deployment. Their toolkit is extensive, including Python, R, various ML frameworks (TensorFlow, PyTorch, scikit-learn), SQL, and increasingly, cloud platforms.

The data engineer is a newer role. This persona designs, builds, and maintains the infrastructure and systems that allow data scientists and analysts to access and use data effectively. This involves building data pipelines, managing databases, working with distributed systems (like Spark), and ensuring data quality and availability. Their skills lean heavily towards software engineering, databases, and distributed systems.

Are data engineers just DevOps folks in data science garb? (metamorworks/Shutterstock)

What do these titles mean? Are data engineers just DevOps folks in the garb of data science people?

While there’s definitely a significant overlap and data engineers often utilize many DevOps principles and tools, it’s not entirely accurate to say they are just DevOps folks. Data engineers have a deep understanding of data structures, data storage and retrieval, as well as data processing frameworks that go beyond typical IT operations. However, as data infrastructure has moved to the cloud and embraced principles like Infrastructure as Code and CI/CD, the skills required for data engineering have converged considerably with DevOps.

Lateral Shifts: The Rise of MLOps

This convergence is perhaps most evident in the emergence of MLOps.

MLOps can be seen as the intersection of machine learning (ML), DevOps, and data engineering. It’s about applying DevOps principles and practices to the machine learning lifecycle.

MLOps is about putting data science artifacts into production. These can be models, pipelines, inference endpoints, and more. The goal is to reliably and efficiently deploy, monitor, and maintain machine learning models in production environments.

In addition to typical DevOps tooling, MLOps requires a specific focus, and requires a few additional tools. It is like creating a new vertical industry where DevOps tools are applied. While MLOps leverages core DevOps concepts like CI/CD, monitoring, and automation, it also introduces tools and practices specific to machine learning, such as model registries, feature stores, and tools for tracking experiments and model versions. This represents a specialization within the broader DevOps landscape, tailored to the unique challenges of deploying and managing ML models.

Enter Kubernetes

Over the past few years, Kubernetes has become an integral part of cloud-native computing and the gold standard for container orchestration at scale. It provides a robust and scalable way to manage containerized applications.

(Mia-Stendal/Shutterstock)

Kubernetes is the mainstay of the DevOps world. Kubernetes offers significant benefits in terms of scalability, resilience, and portability, making it a popular choice for modernizing infrastructure. This adoption, driven by the engineering and operations side, inevitably impacts other roles that interact with deployed applications.

This forces knowledge of containers, Docker, and a whole lot of other tools on data scientists. As ML models are increasingly deployed as microservices within containerized environments managed by Kubernetes, data scientists need to understand the basics of how their models will run in production. This often starts with understanding containers, and Docker is the most prevalent containerization tool.

How does learning a new DevOps tool compare to learning, say, Microsoft Excel? It’s a vastly different beast. Learning Excel is about mastering a user interface and a set of functions for data manipulation and analysis within a structured environment. Learning a DevOps tool like Docker, or understanding Kubernetes, involves grasping concepts related to operating systems, networking, distributed systems, and deployment workflows. It’s a significant step into the world of infrastructure and software engineering practices.

Let’s look at the stages of an ML pipeline and where containers fit in:

Data Preparation (collection, cleaning/pre-processing, feature engineering): These steps can often be containerized to ensure consistent environments and dependencies.
Model Training (model selection, architecture, hyperparameter tuning): Training jobs can be run in containers, making it easier to manage dependencies and scale training across different machines.
CI/CD: Containers are fundamental to CI/CD pipelines for ML, allowing for automated building, testing, and deployment of models and related code.
Model Registry (storage): While the registry itself might not be containerized by the data scientist, the process of pushing and pulling model artifacts often integrates with containerized workflows.
Model Serving: This is a primary use case for containers. Models are typically served within containers (e.g., using Flask, FastAPI, or specific serving frameworks) for scalability and isolation.
Observability (usage load, model drift, security): Monitoring and logging tools often integrate with containerized applications to provide insights into their performance and behavior.

A Whole Sea of Non-Containerized Workloads

Click it to enlarge it

Despite the push towards containerization, it’s important to acknowledge that there exists a whole sea of non-containerized workloads in data science. Not every task or tool immediately benefits from or requires containerization.

These could be tools or whole platforms. Typically running locally, but also in production.

Some concrete examples of non containerized workloads in a data science pipeline are:

Initial data exploration and ad-hoc analysis: Often done locally in a Jupyter notebook or IDE without the need for containerization.
Using desktop-based statistical software: Tools like SPSS or SAS, while powerful, are not typically run in containers for interactive analysis.
Working with large datasets on a shared cluster without container orchestration: Some organizations may still rely on traditional cluster computing where jobs are submitted and run without explicit containerization by the end user.
Simple scripts for data extraction or reporting that run on a schedule: For straightforward tasks without complex dependencies, a simple script executed by a scheduler might suffice without container overhead.
Older legacy systems or tools: Not all existing data infrastructure is container-native.

The Problem

The result of such a spread in non-containerized options being available, and convenient, data scientists tend to gravitate towards using these options. Containers represent a cognitive overload — another tech they have to study, another mastery they need to pursue.

That being said, containers can improve several things for data science teams. Inconsistencies between environments, which can be a large source of toil, can be ironed out. Containers can prevent dependency conflicts between different environments – local or staging. Reproducible and portable builds and models served is a feature that data scientists would love to have.

Not all data teams can afford to have large, competent, or economic operations teams at their beck and call. The Iron Triangle all over again.

Cloud Native Buildpacks: A Clean Solution To A Messy Problem

Data scientists frequently utilize diverse toolchains involving languages like Python or R along with a myriad of libraries, leading to complex dependency management challenges. Operationalizing these artifacts often require deftness and container acrobatics in the form of manually stitching together and maintaining intricate Dockerfiles.

Buildpacks really change the game here. They help assemble the necessary build and run time dependencies and create OCI-compliant images without explicit Dockerfile instructions. This automation reduces the operational burden on data scientists, not to mention cognitive liberation, allowing them to concentrate on analytical tasks.

Cloud native Buildpacks are a CNCF incubating project. The open source tool is maintained by a community spread across several organizations and finds tremendous use in the MLOps space. Check out the list of adopters.md and get started from the GitHub repo.

About the author: Ram Iyengar, developer advocate for Cloud Foundry Foundation (part of Linux Foundation), is an engineer by practice and an educator at heart. Along his journey as a developer, Ram transitioned into technology evangelism and hasn’t looked back. He enjoys helping engineering teams around the world discover new and creative ways to work.

Kubernetes Best Practices: Blueprint for Building Successful Applications on Kubernetes

Is Kubernetes Overhyped?

The post How Much Docker Should a Data Scientist Know? appeared first on BigDATAwire.