Mastering ML model logs in Microsoft Fabric

Paul PETON
5 min readJul 19, 2024

--

Microsoft Fabric is the new data & AI platform platform from Microsoft, enabling end-to-end working (from data collection to visualization) of data, according to different experiences, corresponding to different personas in the data world: Data Engineer, Data Analyst and also Data Scientist.

Data Scientists have the option of launching a notebook-based environment based on notebooks, running on a Spark cluster. Data can of course be stored in OneLake, the central storage system central storage system on which Fabric is based.

Those familiar with environments such as Azure Machine Learning or Azure Databricks will be familiar with the open-source project MLFlow, a tool that has become a standard for tracing training hyperparameters and evaluation metrics, and to store all related all the artifacts associated with a Machine Learning model: requirements file requirements file, saved model binary, input and output definition, etc.

Microsoft has therefore also integrated MLFlow into the Fabric platform. And to simplify its use, it’s not even necessary the MLFlow session to trigger log tracking! This is the autologging mechanism documented here.

However, it may be useful to define training sessions and the elements (parameters and metrics) that you want MLFlow to record.

In this article, we’ll look at these two ways of working with MLFlow integrated into Microsoft Fabric.

Implicit creation of an experience

To illustrate the various features, we’ll use code generated by ChatGPT, using the following prompt:

Write me a Python code that illustrates a Sklearn classification using a dataset from this package.

We can now execute the notebook containing this code.

The .fit() method of the Scikit-learn library triggers the registration of a run (execution) within an experiment (experiment) whose name is, by default, that of the notebook (here “Notebook-1”).

These two elements are clickable links that will enable us to navigate to other elements now stored in the workspace used for this notebook.

The various artifacts are linked to the run, and in particular the graphics generated in the notebook.

Let’s now modify the value of a training hyperparameter in order to compare two runs:

model = LogisticRegression(max_iter=200, penalty=’None’)

A good practice is to modify the names of the runs so as to know what they correspond to.

A manual action will be required to save the model, from the run view.

Save run as an ML model

When you first save the model, you need to assign it a folder and then a name.

The model is then saved as version 1.

The interface then triggers a wizard to generate the code required for inference (i.e. using the model to generate forecasts).

The code generated is based on the synapseml library developed by Microsoft.

We now have three elements in workspace:

  • notebook
  • experiment (all runs)
  • ML model

Explicit MLFlow syntaxes

Let’s now explore the various MLFlow syntaxes that will enable us to:

- view the server URI

- choose which drives to trace

- choose the hyperparameters and metrics collected

- save a new version of the model

The implicitly executed code was as follows:

import mlflow

mlflow.autolog(

log_input_examples=False,

log_model_signatures=True,

log_models=True,

disable=True,

exclusive=True,

disable_for_unsupported_versions=True,

silent=True

)

We start by importing the MLFlow library, already pre-installed on the Spark cluster.

The most important parameter is undoubtedly “disable”, which disables automatic logging at notebook level (we’ll see later how to do this at workspace level).

We can now define the experiment name using the set_experiment() instruction and view the tracking URI using the get_tracking_uri() syntax.

Here’s the classification code, adapted to explicitly use MLFlow commands. The following prompt was used to generate this code:

Adds the elements needed to track hyperparameters and metrics in MLFlow

Links are always available in the notebook cell logs.

This means that the model has already been saved, so there’s no need to perform the manual step you saw earlier.

Finally, it is possible (and recommended) to improve the model log by specifying examples of inputs and outputs, known as the model signature.

# Record model with signature

from mlflow.models.signature import infer_signature

signature = infer_signature( X_test, y_pred )

model_name = “orders-outliers-model”

mlflow.sklearn.log_model(

model, model_name,

signature=signature,

registered_model_name=model_name

)

To maintain consistency with model training, object pairs X_train, y_train or X_test, y_pred can be used.

Workspace parameter: automatic log

Once you’ve decided on one of these options, you can play with the “automatic log” parameter, available at workspace granularity.

If you’re planning to maximize the portability of your notebooks (running them on Azure Machine Learning or Azure Databricks, for example), it’s probably a good idea to write MLFlow commands explicitly.

--

--

Paul PETON
Paul PETON

Written by Paul PETON

Microsoft AI MVP #7 | MDW Partners France COO

No responses yet