Deep Dive: How to structure your code for Machine Learning Development

Deep Dive in Data Science Fundamentals

Apr 08, 2023

∙ Paid

In general, I believe Machine Learning deserves the rigor of any software engineering field. Training codes should be reusable, modular, scalable, testable, maintainable, and documented. I want to show you today my template to develop quality code for machine learning development. We look at:

What does coding mean?
Designing:
- System design
- Deployment process
- Class diagram
The code structure:
- Folder structure
- Setting up the virtual environment
- The code skeleton
- The applications
- Implementing the training pipeline
- Saving the model binary
Improving the code readability:
- Docstrings
- Type hinting
Packaging the project

This Deep Dive is part of the Data Science Fundamentals series.

What does coding mean?

I often see Data Scientists or Machine Learning Engineers developing Jupyter notebooks, copy-pasting their codes from one place to another, and that gives me nightmares! When running ML experiments, Jupyter is prone to human errors as different cells can be run in different orders. You should be able to capture all the configurations of an experiment to ensure reproducibility. Jupyter can be used to call a training package or an API and manually orchestrate experiments, but fully developing in Jupyter is a risky practice. When training a model, you should make sure that the data is passed through the exact feature processing pipelines at serving time. This means the exact same classes, and methods, as well as the identical versions of packages and hardware (GPU vs CPU). Personally, I prototype in Jupyter but develop in Pycharm or VSCode.

When writing, focus on the following aspects:

Reusability: It is the capacity to reuse code in another context or project without the need for significant modifications. Code reusability can be achieved in several ways, such as through the use of libraries, frameworks, modules, and object-oriented programming techniques. In addition, good documentation and clear code organization can also facilitate code reuse by making it easier for other developers to understand and use the code.
Modularity: It is the practice of breaking down a software system into smaller, independent modules or components that can be developed, tested, and maintained separately.
Scalability: It refers to the ability of a software development codebase to accommodate the growth and evolution of a software system over time. In other words, it refers to the ability of the codebase to adapt to changing requirements, features, and functionalities while maintaining its overall structure, quality, and performance. To achieve codebase scalability, it is important to establish clear coding standards and practices from the outset, such as the use of version control, code review, and continuous integration and deployment. In addition, it is important to prioritize code maintainability and readability, as well as the use of well-documented code and clear naming conventions.
Testability: It refers to the ease with which software code can be tested to ensure that it meets the requirements and specifications of the software system. It can be achieved by designing code with testing in mind rather than treating testing as an afterthought. This can involve writing code that is modular, well-organized, and easy to understand and maintain, as well as using tools and techniques that support automated testing and continuous integration.
Maintainability: It refers to the ease with which software code can be modified, updated, and extended over time.
Documentation: It provides a means for developers, users, and other stakeholders to understand how the software system works, what its features are, and how to interact with it.

Designing

System design

In Machine Learning, like any engineering domain, no line of code should be written until a proper design is established. Having a design means that we were able to translate a business problem into a machine learning solution (if ML is indeed the right solution to the problem!). For simplicity, let’s assume we want to build a mobile application where a user needs machine learning predictions displayed on the screen. Personalized product recommendations, for example. The process could work as follows:

The mobile application requests personalized predictions from the backend server.
The backend server fetches predictions from a database.
We figured that daily batch predictions were the most appropriate setup for now, and the predictions get updated daily by the machine learning service.

Deployment process

Before we can understand how to develop our model, we need to understand how we will deploy it. Let’s assume that, for our purposes, an inference application will be containerized in a Docker container. The container can be deployed in a container registry such as AWS ECR (Amazon Elastic Container Registry) or Docker Hub. We can have an orchestration system such as Airflow that spins up the inference service, pulls the container from the registry, and runs the inference application.

Class diagram

Now that we know what we need to build and how it will be deployed, it is becoming much clearer how we need to structure our codebase. We will build two applications: an inference and a training application. To minimize potential human errors, it is imperative that the modules used at training time are the same as the ones used at inference time. Let’s look at the following class diagram:

The application layer: that is the part of the code that captures the applications’ logic. Think about those modules as “buttons“ that start the inference or training processes. We are going to have a run() function for each of those applications that will serve as handles for the Docker image to start those processes.
The data layer: This is the abstraction layer that moves data in and out of the applications. I am calling it the "data" layer, but I am including anything that needs to go into the outside world, like the data, the model binaries, the data transformer, the training metadata… In this batch use case, we are just going to need a function that brings the data into the applications get_data() and another that puts predictions back into the database put_data(). The DataConnector moves data around. The ObjectConnector is the actor responsible for transferring model binaries and data transformation pipelines using get_object() and put_object().The DataConnector moves data around. The ObjectConnector is the actor responsible for transferring model binaries and data transformation pipelines using get_object() and put_object().
The machine learning layer: This is the module where all the different components of machine learning will live. The three components of model training are:
- Learning the parameters of the model: the Model will take care of that with the fit() method. For inferring, we use the predict() method.
- Learning the features transformations: We may need to normalize features, perform Box-Cox transformations, one-hot encode, etc… The DataProcessor will take care of that with the fit() and transform() methods.
- Learning the hyperparameters of the model and data pipeline: the CrossValidator will handle this task with its fit() function.
The TrainingPipeline will handle the logic between the different components

The code structure

Folders structure

Now that we have a class diagram, we need to map it into actual code. Let’s call the project machine_learning_service. There are many ways to do it but we will organize the project as follows:

machine_learning_service/
├── docs/
├── src/
├── tests/

The docs folder: for the documents
The src folder: for the source code (the actual codebase)
The tests folder: for the unit tests

Setting the virtual environment

Because we will need to Dockerize this project at some point, it is important to control the Python version and packages we use locally. For that, we will create a virtual environment called env with venv. Within the project folder, we run

python -m venv ./env

to create it and

source ./env/bin/activate

to activate it. Now we should see the following folder structure

machine_learning_service/
├── docs/
├── env/
├── src/
├── tests/

Let’s make sure the right Python version is running

which python 
> ~/.../machine_learning_service/env/bin/python

so we use the Python binaries of the virtual environment. Let’s make sure it is Python 3

python -V
> Python 3.9.7

Ok, we are good to go!

The code skeleton

Within the source folder, let’s create the different modules we have in the class diagram:

machine_learning_service/
├── docs/
├── env/
├── src/
│   ├── applications/
│   │   ├── training.py
│   │   └── inference.py
│   ├── data_layer/
│   │   ├── data_connector.py
│   │   └── object_connector.py
│   ├── machine_learning/
│   │   ├── training_pipeline.py
│   │   └── model.py
│   │   └── data_processor.py
│   │   └── cross_validator.py
├── tests/

For now, let’s have empty classes.

The Model class

# model.py
class Model:

    def fit(self, X, y):
        raise NotImplemented
        return self

    def predict(self, X):
        raise NotImplemented

    def fit_predict(self, X, y):
        return self.fit(X, y).predict(X)

The DataProcessor class

# data_processor.py
class DataProcessor:

    def fit(self, data):
        raise NotImplemented
        return self

    def transform(self, data):
        raise NotImplemented

    def fit_transform(self, X, y):
        return self.fit(data).transform(data)

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.