Deep Dive: How to structure your code for Machine Learning Development
Deep Dive in Data Science Fundamentals
In general, I believe Machine Learning deserves the rigor of any software engineering field. Training codes should be reusable, modular, scalable, testable, maintainable, and documented. I want to show you today my template to develop quality code for machine learning development. We look at:
What does coding mean?
The code structure:
Setting up the virtual environment
The code skeleton
Implementing the training pipeline
Saving the model binary
Improving the code readability:
Packaging the project
This Deep Dive is part of the Data Science Fundamentals series.
What does coding mean?
I often see Data Scientists or Machine Learning Engineers developing Jupyter notebooks, copy-pasting their codes from one place to another, and that gives me nightmares! When running ML experiments, Jupyter is prone to human errors as different cells can be run in different orders. You should be able to capture all the configurations of an experiment to ensure reproducibility. Jupyter can be used to call a training package or an API and manually orchestrate experiments, but fully developing in Jupyter is a risky practice. When training a model, you should make sure that the data is passed through the exact feature processing pipelines at serving time. This means the exact same classes, and methods, as well as the identical versions of packages and hardware (GPU vs CPU). Personally, I prototype in Jupyter but develop in Pycharm or VSCode.
When writing, focus on the following aspects:
Reusability: It is the capacity to reuse code in another context or project without the need for significant modifications. Code reusability can be achieved in several ways, such as through the use of libraries, frameworks, modules, and object-oriented programming techniques. In addition, good documentation and clear code organization can also facilitate code reuse by making it easier for other developers to understand and use the code.
Modularity: It is the practice of breaking down a software system into smaller, independent modules or components that can be developed, tested, and maintained separately.
Scalability: It refers to the ability of a software development codebase to accommodate the growth and evolution of a software system over time. In other words, it refers to the ability of the codebase to adapt to changing requirements, features, and functionalities while maintaining its overall structure, quality, and performance. To achieve codebase scalability, it is important to establish clear coding standards and practices from the outset, such as the use of version control, code review, and continuous integration and deployment. In addition, it is important to prioritize code maintainability and readability, as well as the use of well-documented code and clear naming conventions.
Testability: It refers to the ease with which software code can be tested to ensure that it meets the requirements and specifications of the software system. It can be achieved by designing code with testing in mind rather than treating testing as an afterthought. This can involve writing code that is modular, well-organized, and easy to understand and maintain, as well as using tools and techniques that support automated testing and continuous integration.
Maintainability: It refers to the ease with which software code can be modified, updated, and extended over time.
Documentation: It provides a means for developers, users, and other stakeholders to understand how the software system works, what its features are, and how to interact with it.
In Machine Learning, like any engineering domain, no line of code should be written until a proper design is established. Having a design means that we were able to translate a business problem into a machine learning solution (if ML is indeed the right solution to the problem!). For simplicity, let’s assume we want to build a mobile application where a user needs machine learning predictions displayed on the screen. Personalized product recommendations, for example. The process could work as follows:
The mobile application requests personalized predictions from the backend server.
The backend server fetches predictions from a database.
We figured that daily batch predictions were the most appropriate setup for now, and the predictions get updated daily by the machine learning service.
Before we can understand how to develop our model, we need to understand how we will deploy it. Let’s assume that, for our purposes, an inference application will be containerized in a Docker container. The container can be deployed in a container registry such as AWS ECR (Amazon Elastic Container Registry) or Docker Hub. We can have an orchestration system such as Airflow that spins up the inference service, pulls the container from the registry, and runs the inference application.
Now that we know what we need to build and how it will be deployed, it is becoming much clearer how we need to structure our codebase. We will build two applications: an inference and a training application. To minimize potential human errors, it is imperative that the modules used at training time are the same as the ones used at inference time. Let’s look at the following class diagram:
The application layer: that is the part of the code that captures the applications’ logic. Think about those modules as “buttons“ that start the inference or training processes. We are going to have a
run()function for each of those applications that will serve as handles for the Docker image to start those processes.
The data layer: This is the abstraction layer that moves data in and out of the applications. I am calling it the "data" layer, but I am including anything that needs to go into the outside world, like the data, the model binaries, the data transformer, the training metadata… In this batch use case, we are just going to need a function that brings the data into the applications
get_data()and another that puts predictions back into the database
DataConnectormoves data around. The
ObjectConnectoris the actor responsible for transferring model binaries and data transformation pipelines using
put_object().The DataConnector moves data around. The ObjectConnector is the actor responsible for transferring model binaries and data transformation pipelines using
The machine learning layer: This is the module where all the different components of machine learning will live. The three components of model training are:
Learning the parameters of the model: the
Modelwill take care of that with the
fit()method. For inferring, we use the
Learning the features transformations: We may need to normalize features, perform Box-Cox transformations, one-hot encode, etc… The
DataProcessorwill take care of that with the
Learning the hyperparameters of the model and data pipeline: the
CrossValidatorwill handle this task with its
TrainingPipelinewill handle the logic between the different components
The code structure
Now that we have a class diagram, we need to map it into actual code. Let’s call the project
machine_learning_service. There are many ways to do it but we will organize the project as follows:
machine_learning_service/ ├── docs/ ├── src/ ├── tests/
The docs folder: for the documents
The src folder: for the source code (the actual codebase)
The tests folder: for the unit tests
Setting the virtual environment
Because we will need to Dockerize this project at some point, it is important to control the Python version and packages we use locally. For that, we will create a virtual environment called
env with venv. Within the project folder, we run
python -m venv ./env
to create it and
to activate it. Now we should see the following folder structure
machine_learning_service/ ├── docs/ ├── env/ ├── src/ ├── tests/
Let’s make sure the right Python version is running
which python > ~/.../machine_learning_service/env/bin/python
so we use the Python binaries of the virtual environment. Let’s make sure it is Python 3
python -V > Python 3.9.7
Ok, we are good to go!
The code skeleton
Within the source folder, let’s create the different modules we have in the class diagram:
machine_learning_service/ ├── docs/ ├── env/ ├── src/ │ ├── applications/ │ │ ├── training.py │ │ └── inference.py │ ├── data_layer/ │ │ ├── data_connector.py │ │ └── object_connector.py │ ├── machine_learning/ │ │ ├── training_pipeline.py │ │ └── model.py │ │ └── data_processor.py │ │ └── cross_validator.py ├── tests/
For now, let’s have empty classes.
# model.py class Model: def fit(self, X, y): raise NotImplemented return self def predict(self, X): raise NotImplemented def fit_predict(self, X, y): return self.fit(X, y).predict(X)
# data_processor.py class DataProcessor: def fit(self, data): raise NotImplemented return self def transform(self, data): raise NotImplemented def fit_transform(self, X, y): return self.fit(data).transform(data)
Keep reading with a 7-day free trial
Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.