diff --git a/docs/book/README.md b/docs/book/README.md index d5b9e44..118b599 100644 --- a/docs/book/README.md +++ b/docs/book/README.md @@ -16,7 +16,9 @@ The course starts on **October 16, 2023**. \ * **Newsletter**. [Sign up](https://www.evidentlyai.com/ml-observability-course) to receive weekly updates with the course materials. * **Discord community**. Join the [community](https://discord.gg/PyAJuUD5mB) to ask questions and chat with others. * **Course platform**. [Register](https://evidentlyai.thinkific.com/courses/ml-observability-course) if you want to submit assignments and receive the certificate. This is optional. -* **Code examples**. Will be published in this GitHub [repository](https://github.com/evidentlyai/ml_observability_course) throughout the course. +* **Code examples**. Will be published in this GitHub [repository](https://github.com/evidentlyai/ml_observability_course) throughout the course. +* **Enjoying the course?** [Star](https://github.com/evidentlyai/evidently) Evidently on GitHub to contribute back! This helps us create free, open-source tools and content for the community. + The course starts on **October 16, 2023**. The videos and course notes for the new modules will be released during the course cohort. diff --git a/docs/book/SUMMARY.md b/docs/book/SUMMARY.md index 8e65097..f41e3b6 100644 --- a/docs/book/SUMMARY.md +++ b/docs/book/SUMMARY.md @@ -10,7 +10,14 @@ * [1.3. ML monitoring metrics. What exactly can you monitor?](ml-observability-course/module-1-introduction/ml-monitoring-metrics.md) * [1.4. Key considerations for ML monitoring setup](ml-observability-course/module-1-introduction/ml-monitoring-setup.md) * [1.5. ML monitoring architectures](ml-observability-course/module-1-introduction/ml-monitoring-architectures.md) -* [Module 2: ML monitoring metrics](ml-observability-course/module-2-ml-monitoring-metrics.md) +* [Module 2: ML monitoring metrics](ml-observability-course/module-2-ml-monitoring-metrics/readme.md) + * [2.1. How to evaluate ML model quality](ml-observability-course/module-2-ml-monitoring-metrics/evaluate-ml-model-quality.md) + * [2.2. Overview of ML quality metrics. Classification, regression, ranking](ml-observability-course/module-2-ml-monitoring-metrics/ml-quality-metrics-classification-regression-ranking.md) + * [2.3. Evaluating ML model quality CODE PRACTICE](ml-observability-course/module-2-ml-monitoring-metrics/ml-model-quality-code-practice.md) + * [2.4. Data quality in machine learning](ml-observability-course/module-2-ml-monitoring-metrics/data-quality-in-ml.md) + * [2.5. Data quality in ML CODE PRACTICE](ml-observability-course/module-2-ml-monitoring-metrics/data-quality-code-practice.md) + * [2.6. Data and prediction drift in ML](ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-in-ml.md) + * [2.8. Data and prediction drift in ML CODE PRACTICE](ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-code-practice.md) * [Module 3: ML monitoring for unstructured data](ml-observability-course/module-3-ml-monitoring-for-unstructured-data.md) * [Module 4: Designing effective ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring.md) * [Module 5: ML pipelines validation and testing](ml-observability-course/module-5-ml-pipelines-validation-and-testing.md) diff --git a/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-architectures.md b/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-architectures.md index 60b9c03..6592e8e 100644 --- a/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-architectures.md +++ b/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-architectures.md @@ -36,4 +36,10 @@ When it comes to visualizing the results of monitoring, you also have options. Each ML monitoring architecture has its pros and cons. When choosing between them, consider existing tools, the scale of ML deployments, and available team resources for systems support. Be pragmatic: you can start with a simpler architecture and expand later. -For a deeper dive into the ML monitoring architectures with specific code examples, head to [Module 5](ml-observability-course/module-5-ml-pipelines-validation-and-testing.md) and [Module 6](ml-observability-course/module-6-deploying-an-ml-monitoring-dashboard.md). +For a deeper dive into the ML monitoring architectures with specific code examples, head to [Module 5](../module-5-ml-pipelines-validation-and-testing.md) and [Module 6](../module-6-deploying-an-ml-monitoring-dashboard.md). + +## Enjoyed the content? + +Star Evidently on GitHub to contribute back! This helps us create free, open-source tools and content for the community. + +⭐️ [Star](https://github.com/evidentlyai/evidently) on GitHub! diff --git a/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-metrics.md b/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-metrics.md index 0e01943..39d31fb 100644 --- a/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-metrics.md +++ b/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-metrics.md @@ -27,4 +27,4 @@ The ultimate measure of the model quality is its impact on the business. Dependi ![](<../../../images/2023109\_course\_module1\_fin\_images.034.png>) -For a deeper dive into **ML model quality and relevance** and **data quality and integrity** metrics, head to [Module 2](ml-observability-course/module-2-ml-monitoring-metrics.md). +For a deeper dive into **ML model quality and relevance** and **data quality and integrity** metrics, head to [Module 2](../module-2-ml-monitoring-metrics/readme.md). diff --git a/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-setup.md b/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-setup.md index f397aca..609a4ba 100644 --- a/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-setup.md +++ b/docs/book/ml-observability-course/module-1-introduction/ml-monitoring-setup.md @@ -58,4 +58,4 @@ While designing an ML monitoring system, tailor your approach to fit your specif * Use reference datasets to simplify the monitoring process but make sure they are carefully curated. * Define custom metrics that fit your problem statement and data properties. -For a deeper dive into the ML monitoring setup, head to [Module 4](ml-observability-course/module-4-designing-effective-ml-monitoring.md). +For a deeper dive into the ML monitoring setup, head to [Module 4](../module-4-designing-effective-ml-monitoring.md). diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics.md deleted file mode 100644 index 250ca73..0000000 --- a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -description: Model quality, data quality, data drift for structured data. ---- - -# Module 2. ML monitoring metrics - -Course content coming soon! diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-code-practice.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-code-practice.md new file mode 100644 index 0000000..a2ea5b4 --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-code-practice.md @@ -0,0 +1,24 @@ +# 2.8. Data and prediction drift in ML [CODE PRACTICE] + +{% embed url="https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14" %} + +**Video 8**. [Data and prediction drift in ML [CODE PRACTICE]](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14), by Emeli Dral + +In this video, we walk you through the code example of detecting data drift and creating a custom method for drift detection using the open-source [Evidently](https://github.com/evidentlyai/evidently) Python library. + +**Want to go straight to code?** Here is the [example notebook](https://github.com/evidentlyai/ml_observability_course/blob/main/module2/data_drift_deep_dive.ipynb) to follow along. + +**Outline**:\ +[00:00](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14&t=0s) Create a working environment and import libraries\ +[01:33](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14&t=93s) Overview of the data drift options\ +[04:25](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14&t=265s) Evaluating share of drifted features\ +[06:40](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14&t=400s) Detecting column drift\ +[11:47](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14&t=707s) Set different drift detection method per feature type\ +[12:57](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14&t=777s) Set individual different drift detection methods per feature\ +[15:34](https://www.youtube.com/watch?v=oO1K4CaWxt0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=14&t=934s) Custom drift detection method + +## Enjoyed the content? + +Star Evidently on GitHub to contribute back! This helps us create free, open-source tools and content for the community. + +⭐️ [Star](https://github.com/evidentlyai/evidently) on GitHub! diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-in-ml.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-in-ml.md new file mode 100644 index 0000000..ab996d6 --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-prediction-drift-in-ml.md @@ -0,0 +1,84 @@ +# 2.6. Data and prediction drift in ML + +{% embed url="https://www.youtube.com/watch?v=bMYcB_5gP4I&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=12" %} + +**Video 6**. [Data and prediction drift in ML](https://www.youtube.com/watch?v=bMYcB_5gP4I&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=12), by Emeli Dral + +## What is data drift, and why evaluate it? + +When ground truth is unavailable or delayed, we cannot calculate ML model quality metrics directly. Instead, we can use proxy metrics like feature and prediction drift. + +**Prediction drift** shows changes in the distribution of **model outputs** over time. Without target values, this is the best proxy of the model behavior. Detected changes in the model outputs may be an early signal of changes in the model environment, data quality bugs, pipeline errors, etc. + +![](<../../../images/2023109\_course\_module2.058.png>) + +**Feature drift** demonstrates changes in the distribution of **input features** over time. When we train the model, we assume that if the input data remains reasonably similar, we can expect similar model quality. Thus, data distribution drift can be an early warning about model quality decay, important changes in the model environment or user behavior, unannounced changes to the modeled process, etc. + +![](<../../../images/2023109\_course\_module2.060.png>) + +Prediction and feature drift can serve as early warning signs for model quality issues. They can also help pinpoint a root cause when the model decay is already observed. + +![](<../../../images/2023109\_course\_module2.065.png>) + +Some key considerations about data drift to keep in mind: +* **Prediction drift is usually more important than feature drift**. If you monitor one thing, look at the outputs. +* **Data drift in ML is a heuristic**. There is no “objective” drift; it varies based on the specific use case and data. +* **Not all distribution drift leads to model performance decay**. Consider the use case, the meaning of specific features, their importance, etc. +* **You don’t always need to monitor data drift**. It is useful for business-critical models with delayed feedback. But often you can wait. +* **Data drift helps with debugging**. Even if you do not alert on feature drift, it might help troubleshoot the decay. +* **Drift detection might be valuable even if you have the labels**. Feature drift might appear before you observe the model quality drop. + +{% hint style="info" %} +**Further reading:** [How to break a model in 20 days. A tutorial on production model analytics](https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production). +{% endhint %} + +## How to detect data drift? + +To detect distribution drift, you need to pick: +* **Drift detection method**: statistical tests, distance metrics, rules, etc. +* **Drift detection threshold**: e.g., confidence levels for statistical tests or numeric threshold for distance metrics. +* **Reference dataset**: what an exemplary distribution is. +* **Alert conditions**: e.g., based on feature importance and the share of the drifting features. + +## Data drift detection methods + +There are three commonly used approaches to drift detection: +* **Statistical tests**, e.g., Kolmogorov-Smirnov or Chi-squared test. You can use parametric or non-parametric tests to compare distributions. Generally, parametric tests are more sensitive. Using statistical tests for drift detection is best for smaller datasets and samples. The resulting drift “score” is measured by p-value (a “confidence” of drift detection). +* **Distance-based metrics**, e.g., Wasserstein distance or Jensen Shannon Divergence. This group of metrics works well for larger datasets. The drift “score” is measured as distance, divergence, or level of similarity. +* **Rule-based checks** are custom rules for detecting drift based on heuristics and domain knowledge. These are great when you expect specific changes, e.g., new categories added to the dataset. + +Here is how the defaults are implemented in the Evidently open-source library. + +**For small datasets (<=1000)**, you can use Kolmogorov-Smirnov test for numerical features, Chi-squared test for categorical features, and proportion difference test for independent samples based on Z-score for binary categorical features. + +![](<../../../images/2023109\_course\_module2.070.png>) + +**For large datasets (>1000)**, you might use Wasserstein Distance for numerical features and Jensen-Shannon divergence for categorical features. + +![](<../../../images/2023109\_course\_module2.071.png>) + +## Univariate vs. multivariate drift + +The **univariate drift** detection approach looks at drift in each feature individually. It returns drift/no drift for each feature and can be easily interpretable. + +The **multivariate drift** detection approach looks at the complete dataset (e.g., using PCA and certain methods like domain classifier). It returns drift/no drift for the dataset and may be useful for systems with many features. + +You can still use the univariate approach to detect drift in a dataset by: +* Tracking the share (%) of drifting features to get a dataset drift decision. +* Tracking distribution drift only in the top model features. +* Combining both solutions. + +## Tips for calculating drift + +Here are some tips to keep in mind when calculating data drift: +* **Data quality is a must**. Calculate data quality metrics first and then monitor for drift. Otherwise, you might detect “data drift” that is caused by data quality issues. +* **Mind the feature set**. The approach to drift analysis varies based on the type and importance of features. +* **Mind the segments**. Consider segment-based drift monitoring when you have clearly defined segments in your data. For example, in manufacturing, you might have different suppliers of raw materials and need to monitor distribution drift separately for each of them. + +## Summing up + +We discussed the key concepts of data drift and how to measure it. When calculating data drift, consider drift detection method and thresholds, properties of reference data, and alert conditions. + +Further reading: [How to break a model in 20 days. A tutorial on production model analytics](https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production). + +Up next: deep dive into data drift detection [OPTIONAL] and practice on how to detect data drift using Python and [Evidently](https://github.com/evidentlyai/evidently) library. diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-quality-code-practice.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-quality-code-practice.md new file mode 100644 index 0000000..86f8863 --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-quality-code-practice.md @@ -0,0 +1,22 @@ +# 2.5. Data quality in ML [CODE PRACTICE] + +{% embed url="https://www.youtube.com/watch?v=_HKGrW2mVdo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=11" %} + +**Video 5**. [Data quality in ML [CODE PRACTICE]](https://www.youtube.com/watch?v=_HKGrW2mVdo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=11), by Emeli Dral + +In this video, we walk you through the code example of data quality evaluation using [Evidently](https://github.com/evidentlyai/evidently) Reports and Test Suites. + +**Want to go straight to code?** Here is the [example notebook](https://github.com/evidentlyai/ml_observability_course/blob/main/module2/data_quality.ipynb) to follow along. + +Here is a quick refresher on the Evidently components we will use: +* **Reports** compute and visualize 100+ metrics in data quality, drift, and model performance. You can use in-built report presets to make visuals appear with just a couple of lines of code. +* **Test Suites** perform structured data and ML model quality checks. They verify conditions and show which of them pass or fail. You can start with default test conditions or design your testing framework. + +**Outline**:\ +[00:00](https://www.youtube.com/watch?v=_HKGrW2mVdo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=11&t=0s) Create a working environment and import libraries\ +[01:30](https://www.youtube.com/watch?v=_HKGrW2mVdo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=11&t=90s) Prepare reference and current dataset\ +[05:20](https://www.youtube.com/watch?v=_HKGrW2mVdo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=11&t=320s) Run data quality Test Suite and visualize the results\ +[09:30](https://www.youtube.com/watch?v=_HKGrW2mVdo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=11&t=570s) Customize the Test Suite by specifying individual tests and test conditions\ +[13:20](https://www.youtube.com/watch?v=_HKGrW2mVdo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=11&t=800s) Build and customize data quality Report + +That’s it! We evaluated data quality using Evidently Reports and Test Suites and demonstrated how to add custom metrics, tests, and test conditions to the analysis. diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-quality-in-ml.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-quality-in-ml.md new file mode 100644 index 0000000..1fdd9df --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/data-quality-in-ml.md @@ -0,0 +1,61 @@ +# 2.4. Data quality in machine learning + +{% embed url="https://www.youtube.com/watch?v=IRbmQGqzVZo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=10" %} + +**Video 4**. [Data quality in machine learning](https://www.youtube.com/watch?v=IRbmQGqzVZo&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=10), by Emeli Dral + +## What can go wrong with the input data? + +If you have a complex ML system, there are many things that can go wrong with the data. The golden rule is: garbage in, garbage out. We need to make sure that the data we feed our model with is fine. + +Some common data processing issues are: +* **Wrong source**. E.g., a pipeline points to an older version of the table. +* **Lost access**. E.g., permissions are not updated. +* **Bad SQL. Or not SQL**. E.g., a query breaks when a user comes from a different time zone and makes an action “tomorrow." +* **Infrastructure update**. E.g., change in computation based on a dependent library. +* **Broken feature code**. E.g., feature computation breaks at a corner case like a 100% discount. + +Issues can also arise if the data schema changes or data is lost at the source (e.g., broken in-app logging or frozen sensor values). If you have several models interacting with each other, broken upstream models can affect downstream models. + +![](<../../../images/2023109\_course\_module2.041.png>) + +## Data quality metrics and analysis + +**Data profiling** is a good starting point for monitoring data quality metrics. Based on the data type, you can come up with basic descriptive statistics for your dataset. For example, for numerical features, you can calculate: +* Min and Max values +* Quantiles +* Unique values +* Most common values +* Share of missing values, etc. + +Then, you can visualize and compare statistics and data distributions of the current data batch and reference data to ensure data stability. + +![](<../../../images/2023109\_course\_module2.047.png>) + +When it comes to monitoring data quality, you must define the conditions for alerting. + +**If you do not have reference data, you can set up thresholds manually based on domain knowledge**. “General ML data quality” can include such characteristics as: +* no/low share of missing values +* no duplicate columns/rows +* no constant (or almost constant!) features +* no highly correlated features +* no target leaks (high correlation between feature and target) +* no range violations (based on the feature context, e.g., negative age or sales). + +Since setting up these conditions manually can be tedious, it often helps to have a reference dataset. + +**If you have reference data, you can compare it with the current data and autogenerate test conditions based on the reference**. For example, based on the training or past batch, you can monitor for: +* expected data schema and column types +* expected data completeness (e.g., 90% non-empty) +* expected batch size (e.g., number of rows) +* expected patterns for specific columns, such as: + * non-unique (features) or unique (IDs) + * specific data distribution types (e.g., normality) + * expected ranges based on observed values + * descriptive statistics: averages, median, quantiles, min-max (point estimation or statistical tests with a confidence interval). + +## Summing up + +Monitoring data quality is critical to ensuring that ML models function reliably in production. Depending on the availability of reference data, you can manually set up thresholds based on domain knowledge or automatically generate test conditions based on the reference. + +Up next: hands-on practice on how to evaluate and test data quality using Python and [Evidently](https://github.com/evidentlyai/evidently) library. diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/evaluate-ml-model-quality.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/evaluate-ml-model-quality.md new file mode 100644 index 0000000..207a40b --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/evaluate-ml-model-quality.md @@ -0,0 +1,57 @@ +# 2.1. How to evaluate ML model quality + +{% embed url="https://www.youtube.com/watch?v=7Y819MAQTDg&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=7" %} + +**Video 1**. [How to evaluate ML model quality](https://www.youtube.com/watch?v=7Y819MAQTDg&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=7), by Emeli Dral + +## Challenges of standard ML monitoring + +When it comes to standard ML monitoring, we usually start by measuring ML model performance metrics: +* **Model quality and error metrics** show how the ML model performs in production. For example, you can track precision, recall, and log-loss for classification models or MAE for regression models. +* **Business or product metrics** help evaluate the ML model’s impact on business performance. You might want to track such metrics as purchases, clicks, views, etc. + +**However, standard ML monitoring is not always enough**. Some challenges can complicate the ML performance assessment: +* **Feedback or ground truth is delayed**. When ground truth is not immediately available, calculating quality metrics can be technically impossible. +* **Past performance does not guarantee future results**, especially when the environment is unstable. +* **Many segments with different quality**. Aggregated metrics might not provide insights for diverse user/object groups. In this case, we need to monitor quality metrics for each segment separately. +* **The target function is volatile**. Volatile target function can lead to fluctuating performance metrics, making it difficult to differentiate between local quality drops and major performance issues. + +![](<../../../images/2023109\_course\_module2.005.png>) + +## Early monitoring metrics + +You can adopt **early monitoring** together with standard monitoring metrics to tackle these challenges. + +Early monitoring focuses on metrics derived from consistently available data: input data and ML model output data. For example, you can track: +* **Data quality** to detect issues with data quality and integrity. +* **Data drift** to monitor changes in the input feature distributions. +* **Output drift** to observe shifts in model predictions. + +![](<../../../images/2023109\_course\_module2.006.png>) + +## Module 2 structure + +This module includes both theoretical parts and code practice for each of the evaluation types. Here is the module structure: + +**Model quality** +* Theory: ML model quality metrics for regression, classification, and ranking problems. +* Practice: building a sample report in Python showcasing quality metrics. + +**Data quality** +* Theory: data quality metrics. +* Practice: creating a sample report in Python on data quality. + +**Data and prediction drift** +* Theory: an overview of the data drift metrics. +* [OPTIONAL] Theory: a deeper dive into data drift detection methods and strategies. +* Practice: building a sample report in Python to detect data and prediction drift for various data type. + +![](<../../../images/2023109\_course\_module2.007.png>) + +## Summing up + +Tracking ML quality metrics in production is crucial to ensure that ML models perform reliably in real-world scenarios. However, standard ML performance metrics like model quality and error are not always enough. + +Adopting early monitoring and measuring data quality, data drift, and prediction drift provides insights into potential issues when standard performance metrics cannot be calculated. + +Through this module, learners will gain a theoretical understanding and hands-on experience in evaluating and interpreting model quality, data quality, and data drift metrics. diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/ml-model-quality-code-practice.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/ml-model-quality-code-practice.md new file mode 100644 index 0000000..85ecd9c --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/ml-model-quality-code-practice.md @@ -0,0 +1,19 @@ +# 2.3. Evaluating ML model quality [CODE PRACTICE] + +{% embed url="https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9" %} + +**Video 3**. [Evaluating ML model quality [CODE PRACTICE]](https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9), by Emeli Dral + +In this video, we walk you through the code example of ML model quality evaluation using Python and the open-source [Evidently](https://github.com/evidentlyai/evidently) library. + +**Want to go straight to code?** Here is the [example notebook](https://github.com/evidentlyai/ml_observability_course/blob/main/module2/ml_model_quality.ipynb) to follow along. + +**Outline**:\ +[00:00](https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9&t=0s) Create a working environment and import libraries \ +[02:45](https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9&t=165s) Prepare datasets for classification and regression models\ +[08:25](https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9&t=505s) Build and customize classification quality report\ +[14:50](https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9&t=890s) Save and share the report\ +[16:05](https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9&t=965s) Display the report in JSON format and as a Python dictionary\ +[18:15](https://www.youtube.com/watch?v=QWLw_lJ29k0&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=9&t=1095s) Build and customize regression quality report + +That’s it! We built an ML model quality report for classification and regression problems and learned how to display it in HTML and JSON formats and as a Python dictionary. diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/ml-quality-metrics-classification-regression-ranking.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/ml-quality-metrics-classification-regression-ranking.md new file mode 100644 index 0000000..1956e71 --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/ml-quality-metrics-classification-regression-ranking.md @@ -0,0 +1,103 @@ +# 2.2. Overview of ML quality metrics. Classification, regression, ranking + +{% embed url="https://www.youtube.com/watch?v=4_LOXDWxCbw&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=8" %} + +**Video 2**. [Overview of ML quality metrics](https://www.youtube.com/watch?v=4_LOXDWxCbw&list=PL9omX6impEuOpTezeRF-M04BW3VfnPBRF&index=8), by Emeli Dral + +## ML model quality in production + +ML model quality degrades over time. This happens because things change, and the model’s environment evolves. + +You need **monitoring** to be able to maintain the ML model's relevance by detecting issues on time. You can also collect additional data and build visualizations for **debugging**. + +But there is a caveat: to calculate classification, regression, and ranking quality metrics, **you need labels**. If you can, consider labeling at least part of the data to be able to compute them. + +![](<../../../images/2023109\_course\_module2.009.png>) + +## Classification quality metrics + +A classification problem in ML is a task of assigning predefined categories or classes (labels) to new input data. Here are some commonly used metrics to measure the quality of the classification model: +* [**Accuracy**](https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall) is the overall share of correct predictions. It is well-interpretable and arguably the most popular metric for classification problems. However, be cautious when using this metric with imbalanced datasets. +* [**Precision**](https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall) measures correctness when predicting the target class. +* [**Recall**](https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall) shows the ability to find all the objects of the target class. Precision and recall are usually used together. Both work well for unbalanced datasets. +* **F1-score** is the harmonic mean of precision and recall. +* [**ROC-AUC**](https://www.evidentlyai.com/classification-metrics/explain-roc-curve) works for probabilistic classification and evaluates the model's ability to rank correctly. +* **Logarithmic loss** demonstrates how close the prediction probability is to the actual value. It is a good metric for probabilistic problem statement. + +![](<../../../images/2023109\_course\_module2.012.png>) + +Methods to help visualize and understand classification quality metrics include: +* [**Confusion matrix**](https://www.evidentlyai.com/classification-metrics/confusion-matrix) shows the number of correct predictions – true positives (TP) and true negatives (TN) – and the number of errors – false positives (FP) and false negatives (FN). You can calculate precision, recall, and F1-score based on these values. +* **Precision-recall table** helps calculate metrics like precision, recall, and F1-score for different thresholds in probabilistic classification. +* **Class separation quality** helps visualize correct and incorrect predictions for each class. +* **Error analysis**. You can also map predicted probabilities or model errors alongside feature values and explore if a specific type of misclassification is connected to the particular feature values. + +![](<../../../images/2023109\_course\_module2.016.png>) + +{% hint style="info" %} +**Further reading:** [What is your model hiding? A tutorial on evaluating ML models](https://www.evidentlyai.com/blog/tutorial-2-model-evaluation-hr-attrition). +{% endhint %} + +## Regression quality metrics + +Regression models provide numerical output which is compared against actual values to estimate ML model quality. Some standard regression quality metrics include: +* **Mean Error (ME)** is an average of all errors. It is easy to calculate, but remember that positive and negative errors can overcompensate each other. +* **Mean Absolute Error (MAE)** is an average of all absolute errors. +* **Root Mean Squared Error (RMSE)** is a square root of the mean of squared errors. It penalizes larger errors. +* **Mean Absolute Percentage Error (MAPE)** averages all absolute errors in %. Works well for datasets with objects of different scale (i.e., tens, thousands, or millions). +* **Symmetric MAPE** provides different penalty for over- or underestimation. + +![](<../../../images/2023109\_course\_module2.020.png>) + +Some of the methods to analyze and visualize regression model quality are: +* **Predicted vs. Actual** value plots and Error over time plots help derive patterns in model predictions and behavior (e.g., Does the model tend to have bigger errors during weekends or hours of peak demand?). +* **Error analysis**. It is often important to distinguish between **underestimation** and **overestimation** during error analysis. Since errors might have different business costs, this can help optimize model performance for business metrics based on the use case. + +You can also map extreme errors alongside feature values and explore if a specific type of error is connected to the particular feature values. + +![](<../../../images/2023109\_course\_module2.025.png>) + +## Ranking quality metrics + +Ranking focuses on the relative order of items rather than their absolute values. Popular examples of ranking problems are search engines and recommender systems. + +We need to estimate the order of objects to measure quality in ranking tasks. Some commonly used ranking quality metrics are: +* **Cumulative gain** helps estimate the cumulative value of recommendations and does not take into account the position of a result in the list. +* **Discounted Cumulative Gain (DCG)** gives a penalty when a relevant result is further in the list. +* **Normalized DCG (NDCG)** normalizes the evaluation irrespective of the list length. +* **Precision @k** is a share of the relevant objects in top-K results. +* **Recall @k** is a coverage of all relevant objects in top-K results. +* **Lift @k** reflects an improvement over random ranking. + +![](<../../../images/2023109\_course\_module2.028.png>) + +If you work on a recommender system, you might want to consider additional – “beyond accuracy” – metrics that reflect RecSys behavior. Some examples are: +* Serendipity +* Novelty +* Diversity +* Coverage +* Popularity bias + +You can also use other custom metrics based on your problem statement and business context, for example, by weighting the metrics by specific segments. + +## Considerations for production ML monitoring + +When you define the model quality metrics to monitor the ML model performance in production, there are some important considerations to keep in mind: + +**Pick the right metrics that align with your use case and business goals**: +* **The usuals apply**. E.g., reuse metrics from the model development phase and do not use accuracy for a problem with highly imbalanced classes. +* **Consider a proxy business metric to evaluate impact**. E.g., consider tracking an estimated loss/gain based on known error costs, the share of predictions with an error larger than X, etc. +* **Not all evaluation metrics are useful for dynamic production monitoring**. E.g., ROC AUC reflects quality across all thresholds, but a production model has a specific one. +* **Consider custom metrics or heuristics**. E.g., the average position of the first relevant object in the recommendation block. + +**It’s not just a choice of metric**. There are other parameters you might need to define: +* **Aggregation window**. It is crucial to calculate metrics in the right windows. Depending on the use case, you might want to monitor precision, for example, every minute, hourly, daily, or over a sliding 7-day window as a key performance indicator. +* **Segments**. You can track model quality separately for different locations, devices, customer subscription types, etc. + +## Summing up + +We discussed the importance of monitoring ML model performance in production and introduced commonly used quality metrics for classification, regression, and ranking problems. + +Further reading: [What is your model hiding? A tutorial on evaluating ML models](https://www.evidentlyai.com/blog/tutorial-2-model-evaluation-hr-attrition). + +In the next part of this module, we will dive into practice and build a model quality report using the open-source [Evidently](https://github.com/evidentlyai/evidently) Python library. diff --git a/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/readme.md b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/readme.md new file mode 100644 index 0000000..e022b23 --- /dev/null +++ b/docs/book/ml-observability-course/module-2-ml-monitoring-metrics/readme.md @@ -0,0 +1,13 @@ +--- +description: Model quality, data quality, data drift for structured data. +--- + +# Module 2: ML monitoring metrics: model quality, data quality, data drift + +This module will cover different aspects of the production ML model performance. We will explain some popular metrics and tests and how to apply them: +* what it means to have a “good” ML model; +* evaluating ML model quality; +* tracking data quality in production; +* data and prediction drift as proxy metrics. + +This module includes both theoretical parts and code practice for each evaluation type. At the end of this module, you will understand the contents of ML observability: metrics and checks you can run and how to interpret them. diff --git a/docs/images/2023109_course_module2.005.png b/docs/images/2023109_course_module2.005.png new file mode 100644 index 0000000..d193652 Binary files /dev/null and b/docs/images/2023109_course_module2.005.png differ diff --git a/docs/images/2023109_course_module2.006.png b/docs/images/2023109_course_module2.006.png new file mode 100644 index 0000000..b6ac49b Binary files /dev/null and b/docs/images/2023109_course_module2.006.png differ diff --git a/docs/images/2023109_course_module2.007.png b/docs/images/2023109_course_module2.007.png new file mode 100644 index 0000000..090d036 Binary files /dev/null and b/docs/images/2023109_course_module2.007.png differ diff --git a/docs/images/2023109_course_module2.009.png b/docs/images/2023109_course_module2.009.png new file mode 100644 index 0000000..0c771a1 Binary files /dev/null and b/docs/images/2023109_course_module2.009.png differ diff --git a/docs/images/2023109_course_module2.012.png b/docs/images/2023109_course_module2.012.png new file mode 100644 index 0000000..fdddf66 Binary files /dev/null and b/docs/images/2023109_course_module2.012.png differ diff --git a/docs/images/2023109_course_module2.016.png b/docs/images/2023109_course_module2.016.png new file mode 100644 index 0000000..05cadba Binary files /dev/null and b/docs/images/2023109_course_module2.016.png differ diff --git a/docs/images/2023109_course_module2.020.png b/docs/images/2023109_course_module2.020.png new file mode 100644 index 0000000..a6a57ad Binary files /dev/null and b/docs/images/2023109_course_module2.020.png differ diff --git a/docs/images/2023109_course_module2.025.png b/docs/images/2023109_course_module2.025.png new file mode 100644 index 0000000..48ba4da Binary files /dev/null and b/docs/images/2023109_course_module2.025.png differ diff --git a/docs/images/2023109_course_module2.028.png b/docs/images/2023109_course_module2.028.png new file mode 100644 index 0000000..6986f8f Binary files /dev/null and b/docs/images/2023109_course_module2.028.png differ diff --git a/docs/images/2023109_course_module2.041.png b/docs/images/2023109_course_module2.041.png new file mode 100644 index 0000000..0afed0a Binary files /dev/null and b/docs/images/2023109_course_module2.041.png differ diff --git a/docs/images/2023109_course_module2.047.png b/docs/images/2023109_course_module2.047.png new file mode 100644 index 0000000..55033cd Binary files /dev/null and b/docs/images/2023109_course_module2.047.png differ diff --git a/docs/images/2023109_course_module2.058.png b/docs/images/2023109_course_module2.058.png new file mode 100644 index 0000000..ab9d337 Binary files /dev/null and b/docs/images/2023109_course_module2.058.png differ diff --git a/docs/images/2023109_course_module2.060.png b/docs/images/2023109_course_module2.060.png new file mode 100644 index 0000000..1800100 Binary files /dev/null and b/docs/images/2023109_course_module2.060.png differ diff --git a/docs/images/2023109_course_module2.065.png b/docs/images/2023109_course_module2.065.png new file mode 100644 index 0000000..4c0eefb Binary files /dev/null and b/docs/images/2023109_course_module2.065.png differ diff --git a/docs/images/2023109_course_module2.070.png b/docs/images/2023109_course_module2.070.png new file mode 100644 index 0000000..9fc8fc0 Binary files /dev/null and b/docs/images/2023109_course_module2.070.png differ diff --git a/docs/images/2023109_course_module2.071.png b/docs/images/2023109_course_module2.071.png new file mode 100644 index 0000000..c2e3ed5 Binary files /dev/null and b/docs/images/2023109_course_module2.071.png differ