DVC Machine Learning Pipeline--
This repository demonstrates how to set up and manage a Machine Learning (ML) pipeline using DVC, a version control system for machine learning projects. The goal of the project is to create an automated, reproducible pipeline that handles data ingestion, preprocessing, feature engineering, model training, and evaluation. Additionally, DVC will help track data, models, and experiments to ensure that the process is repeatable and shareable.
This setup also integrates dvclive, a library for experiment tracking, which logs the metrics from each run and stores them in a versioned format.
Project Setup-
data_ingestion.py – Responsible for downloading, cleaning, and preparing the raw data for further processing. data_preprocessing.py – Handles tasks like feature extraction, data scaling, and other preprocessing activities. feature_engineering.py – Creates new features or transforms existing ones to improve model performance. model_building.py – Defines and trains the ML model (e.g., classification or regression). model_evaluation.py – Evaluates the trained model using performance metrics like accuracy, precision, recall, etc. Each module can be run independently, allowing for flexibility and easier debugging.
- Add data, model, and reports to .gitignore We do not want to track large files such as datasets, trained models, or generated reports in Git, as they are usually too large and not ideal for versioning. Create a .gitignore file and add the following directories:
data/ model/ reports/
This will ensure that Git ignores these files and only tracks the source code and configuration files.
- Add, Commit, and Push to GitHub Once you’ve added all the files to the repository, commit and push them to your GitHub repository.
git add . git commit -m "Initial commit with pipeline structure" git push origin main Setting Up DVC Pipeline (Without Parameters)
- Create dvc.yaml and Add Stages- The dvc.yaml file defines the stages of the pipeline. Each stage represents a step in the data processing and modeling flow. The stages are defined in the YAML format, specifying the commands to run, their dependencies (i.e., the files the stage needs), and their outputs (i.e., the files the stage generates).
Example:
yaml
stages: data_ingestion: cmd: python src/data_ingestion.py deps: - src/data_ingestion.py outs: - data/raw data_preprocessing: cmd: python src/data_preprocessing.py deps: - src/data_preprocessing.py outs: - data/processed feature_engineering: cmd: python src/feature_engineering.py deps: - src/feature_engineering.py outs: - data/features model_building: cmd: python src/model_building.py deps: - src/model_building.py outs: - model/model.pkl model_evaluation: cmd: python src/model_evaluation.py deps: - src/model_evaluation.py outs: - reports/evaluation.csv
Each stage defines:
cmd: The command to run the script. deps: Dependencies for the stage (e.g., scripts and configuration files). outs: The output files generated by the stage (e.g., processed data, trained model).
- Initialize DVC and Test the Pipeline To initialize DVC and run the pipeline, execute the following commands:
dvc init dvc repro
The dvc repro command reproduces the pipeline by running each stage in the correct order based on the dependencies. You can visualize the pipeline DAG (Directed Acyclic Graph) to see the flow of data and the relationships between stages:
dvc dag
- Add, Commit, and Push to GitHub After setting up the DVC pipeline, commit the changes to both Git and DVC:
git add dvc.yaml git commit -m "Set up initial DVC pipeline" git push origin main
Setting Up DVC Pipeline (With Parameters)
- Add params.yaml The params.yaml file stores configuration parameters that can be shared across different stages of the pipeline. These parameters can be adjusted to tune the pipeline, such as specifying the test size for data splitting or setting model hyperparameters.
Example:
yaml data_ingestion: test_size: 0.2 model_building: max_depth: 5 n_estimators: 100
- Update DVC Pipeline to Use Parameters Update the dvc.yaml file to reference params.yaml. Modify the stages to dynamically use parameters from params.yaml.
Example modification for the data_ingestion stage:
yaml stages: data_ingestion: cmd: python src/data_ingestion.py --test_size ${data_ingestion.test_size} deps: - src/data_ingestion.py - params.yaml outs: - data/raw
- Run the Pipeline with Parameters Now that the pipeline is configured with parameters, run it again to test the setup:
dvc repro
This will execute the stages, utilizing the parameters defined in params.yaml.
- Add, Commit, and Push to GitHub Once you’ve confirmed the pipeline is working with parameters, commit the changes to Git and DVC:
git add params.yaml dvc.yaml git commit -m "Added parameters to DVC pipeline" git push origin main
Experiment Tracking with DVC and dvclive 12. Install dvclive Install the dvclive library to track experiments and log performance metrics during model training. This allows you to monitor the evolution of experiments over time.
pip install dvclive
- Add dvclive Code Block In the model_building.py script, integrate dvclive to log metrics such as accuracy, loss, or other performance measures. Example code:
python import dvclive
dvclive.init()
accuracy = model.evaluate(X_test, y_test) dvclive.log("accuracy", accuracy)
dvclive.next_step()
- Run Experiments To track experiments, run the pipeline again:
dvc exp run
Each time you run the pipeline, a new experiment is logged in the dvclive directory, storing the metrics for that run.
- View Experiment Results You can view the results of the experiments using the following command:
dvc exp show
Alternatively, use the DVC extension in Visual Studio Code for a graphical interface to manage experiments.
- Manage Experiments You can manage experiments with the following commands:
Remove an experiment:
dvc exp remove Reproduce a previous experiment:
dvc exp apply
-
Change Parameters and Re-run Code You can modify the parameters in params.yaml to test different configurations and track them as separate experiments. This allows you to fine-tune the model or preprocessing steps.
-
Add, Commit, and Push to GitHub After tracking and running experiments, commit the changes to Git and DVC:
git add dvc.lock dvc.yaml git commit -m "Track new experiment results" git push origin main
Conclusion->
This project demonstrates a full ML pipeline using DVC and dvclive for reproducibility and experiment tracking. The steps outlined here ensure that data, models, and experiments are all versioned and easily reproducible, making it simpler to experiment with different configurations, compare results, and share the project with others.
By using DVC, you gain several key benefits for ML projects:
Reproducibility: All stages of the pipeline are defined and versioned, making it easy to reproduce the results. Experiment Tracking: dvclive allows for logging and comparing metrics from different experiments. Collaboration: DVC makes it easier to collaborate with others, sharing code, data, and models in a versioned and controlled manner. Additional Notes Large files like datasets and trained models are excluded from Git tracking by adding them to .gitignore. DVC handles these files efficiently by versioning them separately. You can scale and modify the pipeline by adding more stages or adjusting parameters. For team collaborations, DVC makes it simple to synchronize data and models using remote storage (e.g., S3, GCS). This setup ensures that the pipeline is robust, reproducible, and efficient for both individual use and team collaboration.