How to Build an AutoML System

Machine Learning System Design

Sep 25, 2023

∙ Paid

The architectural system components
The optimization space
The optimization strategies
The experimental design
The optimization methods

What do we want to build?

Let’s assume we want to build an enterprise solution to automate the development of Machine Learning models. The user should be able to select a training data set and obtain a trained model ready to be deployed. We will assume that the system is only designed for large datasets, and we only allow for tabular data. The user should be able to impose a specific constraint on the model architecture to ensure a certain level of experimentation. We must consider that multiple users should be able to use the system. We need to make sure we provide validation metrics for the user to make an educated decision on a possible deployment. The experiment should be reproducible and versioned. This implies capturing the different metadata of the training runs alongside the training data.

Let’s summarize the requirements:

We need to be able to select a training data set. The training dataset should be tabular.
The user can impose constraints on the model architecture
Multiple users should be able to use the tools at the same time
The result should be a trained model with a version and a way to reproduce the training experiment
We should capture validation metrics alongside other metrics like latency and memory utilization
We assume supervised learning only

The architectural system components

Let’s translate those business requirements into technical ones. To build such a system, we will need to consider the following components:

Frontend client: we need a way to allow the user to input parameters to set up the model training and start the process. We also need a way for the user to visualize the results of a specific run along with its related metrics. We could also provide a way to compare training runs for a better model selection process. This frontend can allow for an effortless restart of the runs while ensuring reproducibility.
A backend server: this is where the logic displayed on the frontend is implemented. It connects to a Run Metadata database that captures the different run parameters and metrics. This database should contain all the information necessary to restart identical training runs. MLFLow is an excellent example of a training runs management system.
A message queue for training requests: Because we may have multiple users submitting training requests simultaneously, we need to buffer those requests. We need this component if we want to keep some control over the level of machine utilization. If we have a cap on the number of training servers we can use simultaneously, it is better to buffer requests until enough machines are available for the next requests.
An orchestration scheduler: a machine learning training process takes a long time, and it is better to capture the different steps using an orchestration system. The orchestration system can plan the various stages and restart one in case of failure. Airflow and Kubeflow are examples of such a system. Kubeflow only makes sense for large organizations running on Kubernetes. The scheduler will monitor the message queue and trigger a training pipeline once a user request is received.
A training pipeline: When the scheduler receives a user request, it triggers a training pipeline. The different steps are captured in a Directed acyclic graph (DAG) and are handled by the orchestration workers.
The Data pull module: we need to establish a logic to pull the correct data from the feature store. The specific features and row samples need to be specified. The user could define the features, and the row samples could be chosen based on the date range or other variables related to the specific learning task. Once the data is pulled, it must be validated to ensure that it follows the requirements for the particular training run and is consistent with features metadata (expected range, expected categories, expected statistics, etc.).
The Data processing module: once the data is ready, we need, at the very least, to carve out a validation set for model performance evaluation. This module could include additional logic for some predefined feature transformation.
The Model selection module: this is where most of the process will be spent. That module handles the model selection process, including choosing the ML model, the hyperparameters, the model architecture, and performing the feature selection. The result of this module is a trained optimal model.
The model validation module: after training the model, we need to capture the different validation metrics that will help the user make an educated decision about the resulting model. Beyond ML metrics, we must capture information about hardware utilization, such as memory and CPU usage. Once we have those metrics, we need to send the resulting metadata to the Run Metadata database.
The model push module: the resulting model needs to be pushed to a model registry and its version number.

In the rest of this post, we will focus on the Model Selection module.

The optimization space

The model selection is the component that involves the ML algorithmic components. When we talk about “model selection“, we mean searching for the optimal model for a specific training dataset. If we have features X and a target Y, we would like to learn what is the optimal transformation F from the data:

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.