From Experiments to Deployment : MLflow 101

Afaque Umer

Listen

Picture this: You’ve got a brand new business idea, and the data you need is right at your fingertips. You’re all pumped up to dive into creating that fantastic machine-learning model 🤖. But, let’s be real, this journey is no cakewalk! You’ll be experimenting like crazy, dealing with data preprocessing, picking algorithms, and tweaking hyperparameters till you’re dizzy 😵‍💫. As the project gets trickier, it’s like trying to catch smoke — you lose track of all those wild experiments and brilliant ideas you had along the way. And trust me, remembering all that is harder than herding cats 😹

But wait, there’s more! Once you’ve got that model, you gotta deploy it like a champ! And with ever-changing data and customer needs, you’ll be retraining your model more times than you change your socks! It’s like a never-ending roller coaster, and you need a rock-solid solution to keep it all together 🔗. Enter MLOps! It’s the secret sauce that brings order to the chaos ⚡

Alright, folks, now that we’ve got the Why behind us, let’s dive into the What and the juicy How in this blog.

Let’s take a look at the pipeline that we are gonna build by the end of this blog 👆

Hold on tight, ’cause this ain’t gonna be a quick read! We’re crafting an end-to-end MLOps solution, and to keep it real, I had to split it into three sections.

Section 1: We’ll lay down the foundations and theories 📜

Section 2: Now that’s where the action is! We’re building a spam filter and tracking all those crazy experiments with MLflow 🥼🧪

Section 3: We’ll focus on the real deal — deploying and monitoring our champ model, making it production-ready 🚀

Let’s rock and roll with MLOps!

MLOps represents a collection of methodologies and industry best practices aimed at assisting data scientists in simplifying and automating the entire model training, deployment, and management lifecycle within a large-scale production environment.

It is gradually emerging as a distinct and standalone approach for managing the entire machine-learning lifecycle. The essential stages in the MLOps process include the following:

How We’re Gonna Implement It? While several options are available like Neptune, Comet, and Kubeflow, etc. we will stick with MLflow. So, let’s get acquainted with MLflow and dive into its principles.

MLflow is like the Swiss army knife of machine learning — it’s super versatile and open-source, helping you manage your entire ML journey like a boss. It plays nice with all the big-shot ML libraries(TensorFlow, PyTorch, Scikit-learn, spaCy, Fastai, Statsmodels, etc.). Still, you can also use it with any other library, algorithm, or deployment tool you prefer. Plus, it’s designed to be super customizable — you can easily add new workflows, libraries, and tools using custom plugins.

MLflow follows a modular and API-based design philosophy, breaking its functionality into four distinct parts.

Now, let’s check out each of these parts one by one!

That’s a wrap for our basic understanding of MLflow’s offerings. For more in-depth details, refer to its official documentation here 👉📄. Now, armed with this knowledge, let’s dive into Section 2. We’ll kick things off by creating a simple spam filter app, and then we’ll go full-on experiment mode, tracking different experiments with unique runs!

Alright, folks, get ready for an exciting journey! Before we dive into the lab and get our hands dirty with experiments, let’s lay out our plan of attack so we know what we’re building. First up, we’re gonna rock a spam classifier using the random forest classifier (I know Multinomial NB works better for doc classification, but hey, we wanna play around with random forest’s hyperparams). We’ll intentionally make it not-so-good at first, just for the thrill of it. Then, we’ll unleash our creativity and track various runs, tweaking hyperparams and experimenting with cool stuff like Bag of Words and Tfidf. And guess what? We will use MLflow UI like a boss for all that sweet tracking action and prep ourselves for the next section. So buckle up, 'cause we’re gonna have a blast! 🧪💥

For this task, we will use the Spam Collection Dataset available on Kaggle. This dataset contains 5,574 SMS messages in English, tagged as ham (legitimate) or spam. However, there is an imbalance in the dataset, with around 4,825 ham labels. To avoid deviation and keep things concise, I decided to drop some ham samples, reducing them to around 3,000, and saved the resulting CSV for further use in our model and text preprocessing. Feel free to choose your approach based on your needs — this was just for brevity. Here’s the code snippet showing how I achieved this.

Now we have the data ready to roll, let’s swiftly build a basic classifier. I won’t bore you with the old cliché that computers can’t grasp text language, hence the need to vectorize it for text representation. Once that’s done, we can feed it to ML/DL algorithms and I won't tell you if you need a refresher or have any doubts, don't fret – I've got you covered in one of my previous blogs for you to refer to. You know that already, right? 🤗

levelup.gitconnected.com

Alright, let’s get down to business! We’ll load the data, and preprocess the messages to remove stopwords, punctuations, and more. We’ll even stemmize or lemmatize them for good measure. Then comes the exciting part — vectorizing the data to get some amazing features to work with. Next up, we’ll split the data for training and testing, fit it into the random forest classifier, and make those juicy predictions on the test set. Finally, it’s evaluation time to see how our model performs! let’s walk the talk ⚡

In this code, I’ve provided several options for experiments as comments, such as preprocessing with or without stop words, lemmatizing, stemming, etc. Similarly, for vectorizing, you can choose between Bag of Words, TF-IDF, or embeddings. Now, let’s get to the fun part! We’ll train our first model by calling these functions sequentially and passing hyperparameters.

Yeah, I totally agree, this model is pretty much useless. The precision is nearly zero, which leads to an F1 score close to 0 as well. Since we have a slight class imbalance, the F1 score becomes more crucial than accuracy as it gives an overall measure of precision and recall — that’s its magic! So, here we have it — our very first terrible, nonsensical, and useless model. But hey, no worries, it’s all part of the learning journey 🪜.

Now, let’s fire up MLflow and get ready to experiment with different options and hyperparameters. Once we fine-tune things, it will all start to make sense. We’ll be able to visualize and analyze our progress like pros!

First things first, let’s get MLflow up and running. To keep things neat, it’s recommended to set up a virtual environment. You can simply install MLflow using pip 👉pip install mlflow

Once it’s installed, fire up the MLflow UI by running 👉mlflow ui in the terminal (make sure it’s within the virtual environment where you installed MLflow). This will launch the MLflow server on your local browser hosted at http://localhost:5000. You will see a page similar to 👇

Since we haven’t recorded anything yet, there won’t be much to check on the UI. MLflow offers several tracking options, like local, local with a database, on a server, or even on the cloud. For this project, we’ll stick to everything local for now. Once we get the hang of the local setup, passing the tracking server URI and configuring a few parameters can be done later — the underlying principles remain the same.

Now, let’s dive into the fun part — storing metrics, parameters, and even models, visualizations, or any other objects, also known as artifacts.

MLflow’s tracking functionality can be seen as an evolution or replacement of traditional logging in the context of machine learning development. In traditional logging, you would typically use custom string formatting to record information such as hyperparameters, metrics, and other relevant details during model training and evaluation. This logging approach can become tedious and error-prone, especially when dealing with a large number of experiments or complex machine-learning pipelines whilst Mlflow automates the process of recording and organizing this information, making it easier to manage and compare experiments leading to more efficient and reproducible machine learning workflows.

Mlflow tracking is centered around three main functions: log_param for logging parameters, log_metric for logging metrics, and log_artifact for logging artifacts (e.g., model files or visualizations). These functions facilitate organized and standardized tracking of experiment-related data during the machine learning development process.

When logging a single parameter, it is recorded using a key-value pair within a tuple. On the other hand, when dealing with multiple parameters, you would use a dictionary with key-value pairs. The same concept applies to logging metrics as well. Here’s a code snippet to illustrate the process.

An experiment acts as a container representing a group of related machine learning runs, providing a logical grouping for runs with a shared objective. Each experiment has a unique experiment ID, and you can assign a user-friendly name for easy identification.

On the other hand, a run corresponds to the execution of your machine-learning code within an experiment. You can have multiple runs with different configurations within a single experiment, and each run is assigned a unique run ID. The tracking information, which includes parameters, metrics, and artifacts, is stored in a backend store, such as a local file system, database (e.g., SQLite or MySQL), or remote cloud storage (e.g., AWS S3 or Azure Blob Storage).

MLflow offers a unified API to log and track these experiment details, regardless of the backend store in use. This streamlined approach allows for effortless retrieval and comparison of experiment results, enhancing the transparency and manageability of the machine learning development process.

To begin, you can create an experiment using either mlflow.create_experiment() or a simpler method, mlflow.set_experiment("your_exp_name"). If a name is provided, it will use the existing experiment; otherwise, a new one will be created to log runs.

Next, call mlflow.start_run() to initialize the current active run and start logging. After logging the necessary information, close the run using mlflow.end_run().

Here’s a basic snippet illustrating the process:

Instead of executing scripts via the shell and providing parameters there, we’ll opt for a user-friendly approach. Let’s build a basic UI that allows users to input either the experiment name or specific hyperparameter values. When the train button is clicked, it will invoke the train function with the specified inputs. Additionally, we’ll explore how to query experiments and runs once we have a substantial number of runs saved.

With this interactive UI, users can effortlessly experiment with different configurations and track their runs for more streamlined machine-learning development. I won’t delve into the specifics of Streamlit since the code is straightforward. I’ve made minor adjustments to the earlier train function for MLflow logging, as well as implemented custom theme settings. Before running an experiment, users are prompted to choose between entering a new experiment name (which logs runs in that experiment) or selecting an existing experiment from the dropdown menu, generated using mlflow.search_experiments(). Additionally, users can easily fine-tune hyperparameters as needed. Here is the code for the application 👇

and here is what the app will look like 🚀

Now that our app is ready, let’s proceed to the experiments. For the first experiment, I’ll use the words in their raw form without stemming or lemmatizing, focusing only on stop words and punctuation removal, and applying Bag of Words (BOW) to the text data for text representation. Then, in successive runs, I will fine-tune a few hyperparameters. We’ll name this experiment RawToken.

After running a few runs, we can launch MLflow from the Streamlit UI, and it will appear something like this

Alright, now we’ve got the RawToken experiment listed under Experiments and a bunch of runs under the Run column, all associated with this experiment. You can pick one, a couple, or all runs and hit the compare button to check out their results side by side. Once inside the compare section, you can select the metrics or parameters you want to compare or visualize.

There’s more to explore than you might expect, and you’ll figure out the best approach once you know what you’re looking for and why!

Alright, we’ve completed one experiment, but it didn’t turn out as expected, and that’s okay! Now, we need to get some results with at least some F1 score to avoid any potential embarrassment. We knew this would happen since we used raw tokens and kept the number of trees and depth quite low. So, let’s dive into a couple of new experiments, one with stemming and the other with lemmatization. Within these experiments, we’ll take shots at different hyperparameters coupled with different text representation techniques.

I won’t go full pro mode here because our purpose is different, and just a friendly reminder that I haven’t implemented Git integration. Tracking experiments with Git could be ideal, but it will require some changes in the code, which I’ve already commented out. MLflow can keep track of Git as well, but adding it would result in a bunch of extra screenshots, and I know you’re a wizard at Git, so I’ll leave it up to you!

Now, let’s manually comment out and uncomment some code to add these two new experiments and record a few runs within them. After going through everything I just said, here are the experiments and their results. Let’s see how it goes! 🚀🔥

Alright, now that we’re done with our experiments, our runs might look a bit messy and chaotic, just like real-life use cases. Can you imagine doing all of this manually? It would be a nightmare, and we’d probably run out of sticky notes or need an endless supply of painkillers! But thanks to MLflow, it’s got us covered and takes care of all the mess from our wild experiments, leaving us with a clean and organized solution. Let’s appreciate the magic of MLflow! 🧙‍♀️✨

Alright, let’s say we’re done with a few experiments, and now we need to load a model from a specific experiment and run. The objective is to retrieve the run_id and load the artifacts (the model and vectorizer) associated with that run id. One way to achieve this is by searching for experiments, getting their ids, then searching for runs within those experiments. You can filter the results based on metrics like accuracy and select the run id you need. After that, you can load the artifacts using MLflow functions.

An easier option is to use the MLflow UI directly, where you can compare the results in descending order, take the run id from the topmost result, and repeat the process.

Another straightforward and standard method is deploying models in production, which we’ll cover in the last section of the blog.

My intention behind the first approach was to familiarize you with the experiment query, as sometimes you might require a custom dashboard or plots instead of MLflow’s built-in features. By using the MLflow UI, you can effortlessly create custom visualizations to suit your specific needs. It’s all about exploring different options to make your MLflow journey even more efficient and effective!

Now that we have obtained the run_id, we can load the model and perform predictions through various APIs. MLflow utilizes a specific format called flavors for different libraries. You can also create your own custom flavor, but that’s a separate topic to explore. In any case, when you click on any model in MLflow, it will display instructions on how to load it.

Let’s load one of our models to perform a quick prediction and see how it works in action!

Whoa!! that was smooth! Loading a model from 15 different runs was a breeze. All we had to do was provide the run ID, and there was no need to remember complex paths or anything of that sort. But wait, is that all? How do we serve the models or deploy them? Let’s dive into that in the next section and explore the world of model deployment and serving.

Welcome to the final section! Let’s jump right in without wasting any time. Once we’ve decided on the model we want to use, all that’s left to do is select it and register it with a unique model name. In earlier versions of MLflow, registering a model required a database, but not anymore. Now, it’s much simpler, and I’ll have to write a little less about that.

The key point here is to keep the model name simple and unique. This name will be crucial for future tasks like retraining or updating models. Whenever we have a new model resulting from successful experiments with good metrics, we register it with the same name. MLflow automatically logs the model with a new version and keeps updating it.

In this section, let’s register three models based on the test accuracy chart: one at the bottom, one in the middle, and the last one at the top. We’ll name the model spamfilter

Once we register models of different runs under the same model name it will add versions to it like this 👇

So, is it the end of the road once we have registered the model? The answer is no! Registering the model is just one step in the machine learning lifecycle, and it’s from here that MLOps, or more specifically, the CI/CD pipeline, comes into play.

Once we have registered the models in MLflow, the next steps typically involve: ⚠️ Theory Ahead ⚠️

All righty then! No more chit-chat and theory jargon!! We’re done with that, and boredom is so not invited to this party. It’s time to unleash the code ⚡ Let’s get our hands dirty and have some real fun! 🚀💻. Here I’m working solo, I’m not bound by the quality or testing team’s constraints 😉. While I don’t fully understand the significance of the yellow stage (Staging for Validation), I’ll take the leap and directly move to the green stage. Though this approach might be risky in a real-world scenario, in my experimental world, I’m willing to take the chance.

So with just a few clicks, I’ll set the stage of my version 3 model to production, and let’s explore how we can query the production model.

Likewise, we can execute a query, and by filtering on the condition current_stage == ‘Production’, we can retrieve the model. Just like we did in the last section, we can use the model.run_id to proceed. It’s all about leveraging what we’ve learned! 💡

Alternatively, you can also load a production model using the following snippet.

Now that our production model is deployed, the next step is to serve it through an API. MLflow provides a default REST API for making predictions using the logged model, but it has limited customization options. To have more control and flexibility, we can use web frameworks like FastAPI or Flask to build custom endpoints.

For demonstration purposes, I’ll use Streamlit again to showcase some information about the production models. Additionally, we’ll explore how a new model from an experiment can potentially replace the previous one if it performs better. Here’s the code for User Application named as user_app.py

The app UI will look something like this 😎

Wow, we’ve successfully deployed our first app! But hold on, the journey doesn’t end here. Now that the app is being served to users, they will interact with it using different data, resulting in diverse predictions. These predictions are recorded through various means such as feedback, ratings, and more. However, as time passes, the model might lose its effectiveness, and that’s when it’s time for retraining.

Retraining involves going back to the initial stage, possibly with new data or algorithms, to improve the model’s performance.

After retraining, we put the new models to the test against the production model, and if they show significant improvement, they’re queued up in the Staging 🟨 area for validation and quality checks.

Once they get the green light, they’re moved to the Production 🟩 stage, replacing the current model in use. The previous production model is then archived ⬛.

Note: We have the flexibility to deploy multiple models simultaneously in production. This means we can offer different models with varying qualities and functionalities, tailored to meet specific subscriptions or requirements. It’s all about customizing the user experience to perfection!

Now, move this latest run to the production stage and refresh our app 🔄️

It is reflecting the latest changes, and this is exactly how models are served in the real world. This is the basics of CI/CD — Continuous Integration and Continuous Deployment. This is MLOps. We’ve nailed it from start to finish! 🎉

And that’s a wrap for this extensive blog! But remember, this is just a tiny step in the vast world of MLOps. The journey ahead involves hosting our app on the cloud, collaborating with others, and serving models through APIs. While I used Streamlit solely in this blog, you have the freedom to explore other options like FastAPI or Flask for building endpoints. You can even combine Streamlit with FastAPI for decoupling and coupling with your preferred pipeline. If you need a refresher, I’ve got you covered with one of my previous blogs that shows how to do just that!

medium.com

Hey, hey, hey! We’ve reached the finish line, folks! Here’s the GitHub Repo for this whole project 👇

github.com

I hope this blog brought some smiles and knowledge your way. If you had a good time reading it and found it helpful, don’t forget to follow yours truly, Afaque Umer, for more thrilling articles.

Stay tuned for more exciting adventures in the world of Machine Learning and Data Science. I’ll make sure to break down those fancy-sounding terms into easy-peasy concepts.

All right, all right, all right! It’s time to say goodbye 👋

Image Source: UnsplashWhyWhatHowImage By AuthorSection 1:Section 2:Section 3:Image Source: DatabricksData Gathering:Data Analysis:Data Transformation/Preparation:Model Training & Development:Model Validation:Model Serving:Model Monitoring:Model Re-training:MLflow Tracking: Mlflow Projects:Mlflow Models: Mlflow Registry: pip install mlflowmlflow uiImage By Author: Mlflow UIlog_paramlog_metriclog_artifactImage By Author: MLflow Trackingmlflow.create_experiment()mlflow.set_experiment("your_exp_name")mlflow.start_run()mlflow.end_run()mlflow.search_experiments()Image By Author: Streamlit UI (Trainer App)RawToken.Image By Author: Hyperparameter TuningImage By Author: Experiments & RunsRawTokenImage By Author: Comparing RunsImage By Author: Metrics & ParametersImage By Author: All ExperimentsImage By Author: Comparing All Runsexperimentrunrun_idartifactsImage By Author: Querying RunsImage By Author: Trackingrun_id flavorsImage By Author: Loading & PredictingspamfilterImage By Author: Registering a ModelImage By Author: Model VersionsImage By Author: Model StagingStaging and Validation 🟨:Deployment 🟩:Monitoring and Maintenance ⛑️:Retraining ⚙️:Model Versioning 🔢: Feedback Loop and Improvement:Image By Author: Querying Registered Modelscurrent_stage == ‘Production’,model.run_idImage By Author: Finding Production ModelImage By Author: Finding Production Modeluser_app.pyImage By Autor: Stramlit CodeImage By Autor: Stramlit UIImage By Autor: Retraining & TrackingImage By Autor: Model StagingProductionNote: We have the flexibility to deploy multiple models simultaneously in production. This means we can offer different models with varying qualities and functionalities, tailored to meet specific subscriptions or requirements. It’s all about customizing the user experience to perfection!Image By Autor: Making PredictionsAfaque Umer