evidentlyai · emeli-dral · Nov 3, 2023 · Nov 3, 2023 · Nov 3, 2023 · Nov 3, 2023
diff --git a/docs/book/README.md b/docs/book/README.md
@@ -9,9 +9,6 @@ description: Open-source ML observabilty course.
 
 Welcome to the Open-source ML observability course!
 
-The course starts on **October 16, 2023**. \
-[Sign up](https://www.evidentlyai.com/ml-observability-course) to save your seat and receive weekly course updates.
-
 # How to participate?
 * **Join the course**. [Sign up](https://www.evidentlyai.com/ml-observability-course) to receive weekly updates with course materials and information about office hours.
 * **Course platform [OPTIONAL]**. If you want to receive a course certificate, you should **also** [register](https://evidentlyai.thinkific.com/courses/ml-observability-course) on the platform and complete all the assignments before **December 1, 2023**.
@@ -47,7 +44,7 @@ ML observability course is organized into six modules. You can follow the comple
 {% endcontent-ref %}
 
 {% content-ref url="ml-observability-course/module-4-designing-effective-ml-monitoring.md" %}
-[Module 4. Designing effective ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring.md). 
+[Module 4. Designing effective ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring/readme.md). 
 {% endcontent-ref %}
 
 {% content-ref url="ml-observability-course/module-5-ml-pipelines-validation-and-testing.md" %}

diff --git a/docs/book/SUMMARY.md b/docs/book/SUMMARY.md
@@ -26,6 +26,13 @@
  * [3.4. Monitoring embeddings drift](ml-observability-course/module-3-ml-monitoring-for-unstructured-data/monitoring-embeddings-drift.md)
  * [3.5. Monitoring text data [CODE PRACTICE]](ml-observability-course/module-3-ml-monitoring-for-unstructured-data/monitoring-text-data-code-practice.md)
  * [3.6. Monitoring multimodal datasets](ml-observability-course/module-3-ml-monitoring-for-unstructured-data/monitoring-multimodal-datasets.md)
-* [Module 4: Designing effective ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring.md)
+* [Module 4: Designing effective ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring/readme.md)
+ * [4.1. Logging for ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring/logging-ml-monitoring.md)
+ * [4.2. How to prioritize ML monitoring metrics](ml-observability-course/module-4-designing-effective-ml-monitoring/how-to-prioritize-monitoring-metrics.md)
+ * [4.3. When to retrain machine learning models](ml-observability-course/module-4-designing-effective-ml-monitoring/when-to-retrain-ml-models.md)
+ * [4.4. How to choose a reference dataset in ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring/how-to-choose-reference-dataset-ml-monitoring.md)
+ * [4.5. Custom metrics in ML monitoring](ml-observability-course/module-4-designing-effective-ml-monitoring/custom-metrics-ml-monitoring.md)
+ * [4.6. Implementing custom metrics in Evidently [OPTIONAL]](ml-observability-course/module-4-designing-effective-ml-monitoring/custom-metrics-evidently-code-practice.md)
+ * [4.7. How to choose the ML monitoring deployment architecture](ml-observability-course/module-4-designing-effective-ml-monitoring/choosing-ml-monitoring-deployment-architecture.md)
 * [Module 5: ML pipelines validation and testing](ml-observability-course/module-5-ml-pipelines-validation-and-testing.md)
 * [Module 6: Deploying an ML monitoring dashboard](ml-observability-course/module-6-deploying-an-ml-monitoring-dashboard.md)
diff --git a/docs/book/ml-observability-course/module-4-designing-effective-ml-monitoring.md b/docs/book/ml-observability-course/module-4-designing-effective-ml-monitoring.md
diff --git a/...gning-effective-ml-monitoring/choosing-ml-monitoring-deployment-architecture.md b/...gning-effective-ml-monitoring/choosing-ml-monitoring-deployment-architecture.md
@@ -0,0 +1,99 @@
+# 4.7. How to choose the ML monitoring deployment architecture
+
+{% embed url="https://youtu.be/Q1NUCDZFRbU?si=26GhKBdhFAIzxBgi" %}
+
+**Video 7**. [How to choose the ML monitoring deployment architecture](https://youtu.be/Q1NUCDZFRbU?si=26GhKBdhFAIzxBgi), by Emeli Dral
+
+There are alternative backends for machine learning monitoring architecture.
+
+![](<../../../images/2023110\_course\_module4\_fin.086-min.png>)
+
+## Ad-hoc reporting 
+
+**Ad-hoc reporting** is a viable option when you've recently deployed a machine learning system, and do not have alternative monitoring systems. 
+* It has **low engineering overhead**: you can use familiar tools like Jupyter notebooks, Python scripts, or R scripts. 
+* It is **suitable for initial exploration** of data and model quality and shaping expectations about model performance, but is not a long-term monitoring solution.
+
+![](<../../../images/2023110\_course\_module4\_fin.087-min.png>)
+
+## Batch monitoring
+
+**Batch ML monitoring** is a reliable and stable approach. It is suitable for both machine learning pipelines and services.
+
+To implement batch monitoring, you need a workflow orchestration tool like Airflow or Kubeflow, and tools for calculating metrics and tests, such as Evidently.
+
+**Pros**:
+* Works well for both ML models implemented as batch pipelines and ML services.
+* It is fairly simple to run monitoring jobs, especially if you already have a workflow orchestrator in place.
+* You can use the same tools you use to run model training jobs during the experimental and validation phases of a machine learning lifecycle.
+* You can combine immediate monitoring (e.g., data quality checks) and metrics dependent on ground truth (trigger-based calculations).
+
+**Cons**:
+* It is not real-time. There are some delays in metric computation due to additional resources required for running the infrastructure. 
+* It might be complex if you don't have an existing orchestrator; setting up one can be resource-intensive.
+
+![](<../../../images/2023110\_course\_module4\_fin.088-min.png>)
+
+## Near real-time (streaming) monitoring
+
+**Near real-time ML monitoring** architecture is suitable when you serve models as APIs and want to detect issues close to real-time. In this case, you push data from the machine learning service to the monitoring system.
+
+You will need optimal storage solutions for time series data like Prometheus or Clickhouse, and tools like Grafana or Evidently for dashboarding and alerting.
+
+**Pros**:
+* Works for models deployed as an ML service as opposed to batch jobs.
+* Suitable for scenarios when you need an immediate reaction to issues like missing data or outliers.
+
+**Cons**: 
+* High operational costs. Make sure you have the resources to maintain an additional monitoring service. 
+* Potentially double effort. You will often still need to deal with delayed ground truth feedback and run batch monitoring jobs to calculate these metrics.
+
+![](<../../../images/2023110\_course\_module4\_fin.089-min.png>)
+
+**Custom monitoring backend**. You can also combine near real-time and batch monitoring. 
+
+For example, you can combine:
+* **Real-time checks**. You can send the data available at serving time directly from the ML service to an ML monitoring system to run input and model output checks and to generate alerts. 
+* **Monitoring jobs**. For delayed ground truth or more complex checks, you can run monitoring jobs over prediction logs on a trigger or a schedule. 
+* **Dashboarding tool**. You can log all results to the same metric storage system and get a single dashboard with panels for batch and real-time checks.
+
+![](<../../../images/2023110\_course\_module4\_fin.090-min.png>)
+
+## A case for batch ML monitoring 
+
+Let’s go through the possible logic of choosing the ML monitoring architecture. 
+
+First, let’s contrast it to **traditional software health monitoring**. You can typically implement additional service endpoints for metrics. Then, you can use tools like Prometheus to pull the metrics from these endpoints and store high-frequency time series data. You can add alerting and dashboard tools that rely on these metrics as a data source.
+
+![](<../../../images/2023110\_course\_module4\_fin.092-min.png>)
+
+However, integrating ML metrics into this same setup isn't as simple. Here is why:
+* **Complex metrics**. Software metrics are usually more straightforward in terms of computation. You can run simple aggregations over data points like response times and memory usage. Some ML-related metrics (like the number of rows or missing values) are similar. But others, like model quality or statistical tests, involve more complex calculations.
+* **Delayed feedback**. Model quality metrics like precision, recall or accuracy typically depend on delayed data. You cannot compute them at serving time and must wait for the labels. Once you calculate them, you must “backfill” time series data for the past period, since the moment you compute metrics is not the moment they refer to.
+* **Reference dataset**. For checks like data and prediction drift, you must also pass a batch of data you are comparing against. This does not easily fit into traditional software architecture. 
+
+ML model monitoring may require additional components:
+* **Metric calculation pipelines**. If you run metric computation as jobs, you can use the appropriate backend for complex evaluations, for example, not just a SQL-like query. You can run complex evaluations like statistical drift and behavioral tests.
+* **Run several different pipelines**. You can split metrics into separate pipelines. Some will run on a schedule (for metrics you can compute immediately) and others will be triggered by events like receiving new labeled data.
+* **Passing the reference data**. You can implement complex pipelines that would involve querying the reference data, loading it, and comparing it against the current data batch. 
+
+**Example**: you can cover the whole model lifecycle with batch checks and monitoring jobs.
+
+![](<../../../images/2023110\_course\_module4\_fin.102-min.png>)
+
+You can still combine this approach with traditional software monitoring system architecture. Once you implement a different metric computation backend for ML metrics, you can store the results in a metric storage and use it as a data source for your dashboarding system to visualize machine learning-related metrics.
+
+![](<../../../images/2023110\_course\_module4\_fin.103-min.png>)
+
+You can add a few ML-related metrics to an existing dashboard or create a separate ML monitoring dashboard.
+
+## Summing up
+
+We discussed the differences between different ML monitoring architectures. Here are some takeaways:
+* Choose the ML architecture that matches your available resources, risk mitigation needs, and the complexity of your machine learning model.
+* Even if you deploy a model as a service, consider batch ML monitoring. It is a more lightweight option, especially if you have a workflow orchestrator in place. It can handle complex evaluation scenarios.
+
+## Enjoyed the content?
+
+Star Evidently on GitHub to contribute back! This helps us create free, open-source tools and content for the community.
+⭐️ [Star](https://github.com/evidentlyai/evidently) on GitHub!
diff --git a/...e-4-designing-effective-ml-monitoring/custom-metrics-evidently-code-practice.md b/...e-4-designing-effective-ml-monitoring/custom-metrics-evidently-code-practice.md
@@ -0,0 +1,18 @@
+# 4.6. Implementing custom metrics in Evidently [OPTIONAL]
+
+{% embed url="https://youtu.be/uEyoP-sPhyc?si=7hwr4LaJIeBZ-YLD" %}
+
+**Video 6**. [Implementing custom metrics in Evidently [OPTIONAL, CODE PRACTICE]](https://youtu.be/uEyoP-sPhyc?si=7hwr4LaJIeBZ-YLD), by Emeli Dral
+
+This is an optional code practice video. It is useful when you already have experience using the Evidently Python library and are familiar with the existing Metrics and Tests. If you are new - check out the next module for an end-to-end example!
+
+**Want to go straight to code?** Here is the [example notebook](https://github.com/evidentlyai/ml_observability_course/blob/main/module4/custom_metric_practice.ipynb) to follow along.
+
+**Outline:**\
+[00:00](https://www.youtube.com/watch?v=uEyoP-sPhyc&t=0s) Introduction \
+[00:37](https://www.youtube.com/watch?v=uEyoP-sPhyc&t=37s) Imports \
+[01:54](https://www.youtube.com/watch?v=uEyoP-sPhyc&t=114s) Understanding the structure of Metrics and Tests \
+[05:11](https://www.youtube.com/watch?v=uEyoP-sPhyc&t=311s) Create a dummy custom metric \
+[12:17](https://www.youtube.com/watch?v=uEyoP-sPhyc&t=737s) Apply a dummy metric on toy data \
+[14:00](https://www.youtube.com/watch?v=uEyoP-sPhyc&t=840s) Create a more complicated metric: Mean by Category \
+[26:25](https://www.youtube.com/watch?v=uEyoP-sPhyc&t=1585s) Apply a new metric on toy data
diff --git a/...urse/module-4-designing-effective-ml-monitoring/custom-metrics-ml-monitoring.md b/...urse/module-4-designing-effective-ml-monitoring/custom-metrics-ml-monitoring.md
@@ -0,0 +1,56 @@
+# 4.5. Custom metrics in ML monitoring
+
+{% embed url="https://youtu.be/PrFuzKLM66I?si=68EF7tepIyXxyMig" %}
+
+**Video 5**. [Custom metrics in ML monitoring](https://youtu.be/PrFuzKLM66I?si=68EF7tepIyXxyMig), by Emeli Dral
+
+## Types of custom metrics
+
+While there is no strict division between “standard” and “custom” metrics, there is some consensus on evaluating, for example, classification model quality using metrics like precision and recall. They are fairly “standard.” 
+
+However, you often need to implement “custom” metrics to reflect specific aspects of model performance. They typically refer to business objectives or domain requirements and help capture the impact of an ML model within its operational context.
+
+Here are some examples. 
+
+**Business and product KPIs (or proxies)**. These metrics are aligned with key performance indicators that reflect the business goals and product performance. 
+
+**Examples include**:
+* Manufacturing optimization: raw materials saved.
+* Chatbots: number of successful chat completions.
+* Fraud detection: number of detected fraud cases over $50,000.
+* Recommender systems: share of recommendation blocks without clicks.
+
+We recommend **consulting with business stakeholders** even before building the model. They may suggest valuable KPIs, heuristics, and metrics that could be monitored even during the experimentation phase.
+
+When direct measurement of a KPI is not possible, consider **approximating the model impact**. For example, you can assign an average “cost” to specific types of model errors based on domain knowledge.
+
+![](<../../../images/2023110\_course\_module4\_fin.078-min.png>)
+
+**Domain-specific ML metrics**. These are metrics that are commonly used in specific domains and industries. 
+
+**Examples include**:
+* Churn prediction in telecommunications: lift metrics.
+* Recommender systems: serendipity or novelty metrics.
+* Healthcare: fairness metrics.
+* Speech recognition: word error rate.
+* Medical imaging: Jaccard index.
+
+![](<../../../images/2023110\_course\_module4\_fin.079-min.png>)
+
+**Weighted or aggregated metrics**. Sometimes, you can design custom metrics as a “weighted” variation of other metrics. For example, you can adjust them to account for the importance of certain features or classes in your data. 
+
+**Examples include**:
+* Data drift weighted by feature importance.
+* Measuring specific recommender system biases, for example, based on product popularity, price, or product group.
+* In unbalanced classification problems, you can weigh precision and recall by class or by specific important user groups, such as based on the estimated user Lifetime Value (LTV).
+
+![](<../../../images/2023110\_course\_module4\_fin.080-min.png>)
+
+## Summing up
+
+There is no need to invent “custom” metrics just for the sake of it. However, you might want to implement them to:
+* better reflect important model qualities,
+* estimate the business impact of the model,
+* add metrics useful for product and business stakeholders and accepted within the domain. 
+
+Up next: optional code practice to create and implement a custom quality metric in the Evidently Python library.