A great MLOps project should start with a good Python Package đ
MLOps practitioners (rightfully) point out that running notebooks in production is a bad software practice, but what are the alternatives? A simple script is not enough to capture the complexity of AI/ML projects, and rewriting a whole project in another programming language is both costly and time-consuming.
To solve this problem, the most efficient approach is to create a Python package that compiles the project sources and assets in a code archive. However, building such a package can be a complex endeavor for newcomers. The Python ecosystem is vibrant, but also fragmented. Moreover, machine learning projects are more complex to develop than most other software applications.
In this article, I present the implementation of a Python package on GitHub designed to support MLOps initiatives. The goal of this package is to make the coding workflow of data scientists and ML engineers as flexible, robust, and productive as possible. First, I start by motivating the use of Python packages. Then, I provide some tools and tips you can include in your MLOps project. Finally, I explain the follow-up steps required to take this package to the next level and make it work in your environment.
Link to the repository - https://github.com/fmind/mlops-python-package
Motivations
Building Python packages is a common practice in our industry. A Python package allows developers to collaborate with others, version the source code, and share code archives on a package index such as Pypi.org. Another benefit is that Python package can be used both as a library (i.e., imported from another code base) and an application (i.e., be executed from the command line). Python developers are also used to leverage packages developed by others, such as Flask, Pandas, or TensorFlow just to name a few.
But despite all these benefits, Python packages are complex to build and structure properly. On one hand, there are a lot of tools to combine and it can be difficult for developers to choose the best components without hands-on experience. On the other hand, machine learning is one of the most complex types of projects, as data dimensions, randomness, and entangled workflows make everything more difficult.
In my career, I spend hours researching the best set of tools and tricks to make this experience as optimal as possible for data scientists, ML engineers, and myself. I hope this initiative will help you in your MLOps journey and this will let you build great AI/ML solutions for your use cases.
Tools & Tips
The GitHub repository provides both the implementation and the design decisions for developing with the MLOps Python package. You can get all the main information in the README.md file. Before jumping to the tools and tips, Iâd like to highlight the âmethodologyâ for selecting the elements in this package:
- Keep It Simple Stupid (KISS): data scientists and ML engineers are dealing with complex tasks (e.g., maths, software development, business requirements, âŠ). The package should get out of their way as much as possible and be simple to read and follow.
- Leverage good software practices: our nascent MLOps industry can benefit from years of software experience. We can leverage design patterns and the Python ecosystem to make our development environment as powerful as possible.
- The constant trade-off of simplicity vs power: creating an empty shell or a technical show-off is easy. The real struggle in creating such a package is to bring the best practice possible while making it accessible to the majority of end users.
The MLOps Python Package includes more than 30 tools. My favorite ones are:
- Mypy: check that your code types are valid during development.
- OmegaConf: parse and merge YAML files to load configurations.
- Pydantic: better definition and validation of Python classes (check out Tagged Union, this is a great way to initialize your program!).
- Invoke: define development tasks in a saner syntax than Makefile.
- Poetry: manage your Python package (metadata, dependencies, âŠ).
The MLOps Python Package also includes more than 20 tips and tricks. The most important ones are:
- SOLID Principles: define software interface to make your code more modular and reusable.
- Soft Coding: change your program behavior through config files instead of code changes.
- Data Catalog: separate the data you want to access from how you access it
- Text Fixture: create contextual objects to support Test-Driven Development (TDD) with Pytest.
- DataFrame Typing: define dataframe schemas to communicate their fields and validate them with Pandera.
PS: I know several people who complain that Python is a bad programming language. On the contrary, I think Python can be a great programming language with a bit of discipline and the right tooling!
Integrations
Having an MLOps Python Package is just a small part of your MLOps journey. While most MLOps project starts with a Python package, this artifact should be integrated with the rest of your infrastructure: Compute Engine (e.g., Kubernetes, Databricks), Experiment Tracking (e.g., MLflow, Neptune), and Task Orchestration (e.g., Airflow, Kubeflow).
After all my research, I haven't found a one-size-fits-all infrastructure that can address everybody's use cases. On one hand, cloud providers provide end-to-end solutions which are specific to their platform. On the other hand, Kubernetes-based solutions have a huge learning curve and are too heavyweight for data scientists. Another big issue is the lack of common protocols to easily integrate all these MLOps systems, as explained in my other article: We need POSIX for MLOps, and more generally in this talk from Rich Hickey: The Language of the System.
Thus, I created this package as a common denominator for MLOps initiatives devoided of any infrastructure dependencies. It is your task, dear reader, to extend its capabilities based on your requirements and environments to suit your end-user needs.
Conclusions
Using Python packages for MLOps is a good practice, but creating the best package requires a lot of hammock-driven development. I hope you will find value in this package and have the best success with your MLOps project.
To reflect on this task, I like to compare programming to martial arts. The goal is not to beat your opponent with brute force tactics but to inflict the most deadly strikes in the swiftest manner. It is a balance of power, moderation, and respect for your opponent. Similarly, I think creating good software is a form of art, and a never-ending quest to surpass yourself.
As I final note, I always love to discuss development practices with my peers. Feel free to drop me a message on the MLOps Community Slack, create an issue on GitHub, or contribute directly to the repository. This is the power of open source after all :)