diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/README.md b/self-paced-labs/vertex-ai/train-deploy-tf-model/README.md new file mode 100644 index 0000000000..94888f9a8b --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/README.md @@ -0,0 +1,119 @@ +# Vertex AI: Qwik Start + +In this lab, you will use [BigQuery](https://cloud.google.com/bigquery) for data processing and exploratory data analysis and the [Vertex AI](https://cloud.google.com/vertex-ai) platform to train and deploy a custom TensorFlow Regressor model to predict customer lifetime value (CLV). The goal of the lab is to introduce Vertex AI through a high value real world use case - predictive CLV. You will start with a local BigQuery and TensorFlow workflow you may already be familiar with and progress toward training and deploying your model in the cloud with Vertex AI as well as retrieving predictions and explanations from your model. + +![Vertex AI](./images/vertex-ai-overview.png "Vertex AI Overview") + +Vertex AI is Google Cloud's next generation, unified platform for machine learning development and the successor to AI Platform announced at Google I/O in May 2021. By developing machine learning solutions on Vertex AI, you can leverage the latest ML pre-built components and AutoML to significantly enhance development productivity, the ability to scale your workflow and decision making with your data, and accelerate time to value. + +## Learning objectives + +* Train a TensorFlow model locally in a hosted [**Vertex Notebook**](https://cloud.google.com/vertex-ai/docs/general/notebooks?hl=sv). +* Create a [**managed Tabular dataset**](https://cloud.google.com/vertex-ai/docs/training/using-managed-datasets?hl=sv) artifact for experiment tracking. +* Containerize your training code with [**Cloud Build**](https://cloud.google.com/build) and push it to [**Google Cloud Artifact Registry**](https://cloud.google.com/artifact-registry). +* Run a [**Vertex AI custom training job**](https://cloud.google.com/vertex-ai/docs/training/custom-training) with your custom model container. +* Use [**Vertex TensorBoard**](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview) to visualize model performance. +* Deploy your trained model to a [**Vertex Online Prediction Endpoint**](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions) for serving predictions. +* Request an online prediction and explanation and see the response. + +## Setup + +### 1. Enable Cloud Services utilized in the lab environment: + +#### 1.1 Launch [Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell) + +#### 1.2 Set your Project ID + +Confirm that you see the desired project ID returned below: +``` +gcloud config get-value project +``` + +If you do not see your desired project ID, set it as follows: +``` +PROJECT_ID=[YOUR PROJECT ID] +gcloud config set project $PROJECT_ID +``` + +#### 1.3 Use `gcloud` to enable the services + +``` +gcloud services enable \ + compute.googleapis.com \ + iam.googleapis.com \ + iamcredentials.googleapis.com \ + monitoring.googleapis.com \ + logging.googleapis.com \ + notebooks.googleapis.com \ + aiplatform.googleapis.com \ + bigquery.googleapis.com \ + artifactregistry.googleapis.com \ + cloudbuild.googleapis.com \ + container.googleapis.com +``` + +### 2. Create Vertex AI custom service account for Vertex Tensorboard experiment tracking + +#### 2.1. Create custom service account +``` +SERVICE_ACCOUNT_ID=vertex-custom-training-sa +gcloud iam service-accounts create $SERVICE_ACCOUNT_ID \ + --description="A custom service account for Vertex custom training with Tensorboard" \ + --display-name="Vertex AI Custom Training" +``` + +#### 2.2. Grant it access to GCS for writing and retrieving Tensorboard logs +``` +PROJECT_ID=$(gcloud config get-value core/project) +gcloud projects add-iam-policy-binding $PROJECT_ID \ + --member=serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \ + --role="roles/storage.admin" +``` + +#### 2.3. Grant it access to your BigQuery data source to read data into your TensorFlow model +``` +gcloud projects add-iam-policy-binding $PROJECT_ID \ + --member=serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \ + --role="roles/bigquery.admin" +``` + +#### 2.4. Grant it access to Vertex AI for running model training, deployment, and explanation jobs +``` +gcloud projects add-iam-policy-binding $PROJECT_ID \ + --member=serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \ + --role="roles/aiplatform.user" +``` + +### 3. Creating a Vertex Notebooks instance + +An instance of **Vertex Notebooks** is used as a primary lab environment. + +To provision the instance follow the [Create an new notebook instance](https://cloud.google.com/vertex-ai/docs/general/notebooks) setup guide. Use the *TensorFlow Enterprise 2.3* no-GPU image. Leave all other settings at their default values. + +After the instance is created, you can connect to [JupyterLab](https://jupyter.org/) IDE by clicking the *OPEN JUPYTERLAB* link in the [Vertex AI Notebooks Console](https://console.cloud.google.com/vertex-ai/notebooks/instances). + + +### 4. Clone the lab repository + +In your **JupyterLab** instance, open a terminal and clone this repository in the `home` folder. +``` +cd +git clone https://github.com/GoogleCloudPlatform/training-data-analyst.git +``` + +### 5. Install the lab dependencies + +Run the following in the **JupyterLab** terminal to go to the `training-data-analyst/self-paced-labs/vertex-ai/vertex-ai-qwikstart` folder, then pip install `requirements.txt` to install lab dependencies: + +```bash +cd training-data-analyst/self-paced-labs/vertex-ai/vertex-ai-qwikstart +pip install -U -r requirements.txt +``` + +### 6. Navigate to lab notebook + +In your **JupyterLab** instance, navigate to __training-data-analyst__ > __self-paced-labs__ > __vertex-ai__ > __vertex-ai-qwikstart__, and open __lab_exercise.ipynb__. + +Open `lab_exercise.ipynb` to complete the lab. + +Happy coding! \ No newline at end of file diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/images/clv-rfm.svg b/self-paced-labs/vertex-ai/train-deploy-tf-model/images/clv-rfm.svg new file mode 100644 index 0000000000..d86723d0c1 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/images/clv-rfm.svg @@ -0,0 +1 @@ +Timeline1Layer 1HistoricalNowTimeUnknown \ No newline at end of file diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/images/vertex-ai-overview.png b/self-paced-labs/vertex-ai/train-deploy-tf-model/images/vertex-ai-overview.png new file mode 100644 index 0000000000..277f151656 Binary files /dev/null and b/self-paced-labs/vertex-ai/train-deploy-tf-model/images/vertex-ai-overview.png differ diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/lab_exercise.ipynb b/self-paced-labs/vertex-ai/train-deploy-tf-model/lab_exercise.ipynb new file mode 100644 index 0000000000..285c908581 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/lab_exercise.ipynb @@ -0,0 +1,1062 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "81e68768", + "metadata": {}, + "source": [ + "# Vertex AI: Qwik Start" + ] + }, + { + "cell_type": "markdown", + "id": "8f3be9d1", + "metadata": {}, + "source": [ + "## Learning objectives\n", + "\n", + "* Train a TensorFlow model locally in a hosted [**Vertex Notebook**](https://cloud.google.com/vertex-ai/docs/general/notebooks?hl=sv).\n" + ] + }, + { + "cell_type": "markdown", + "id": "c7a746be", + "metadata": {}, + "source": [ + "## Introduction: customer lifetime value (CLV) prediction with BigQuery and TensorFlow on Vertex AI" + ] + }, + { + "cell_type": "markdown", + "id": "76bf82e0", + "metadata": {}, + "source": [ + "In this lab, you use [BigQuery](https://cloud.google.com/bigquery) for data processing and exploratory data analysis and the [Vertex AI](https://cloud.google.com/vertex-ai) platform to train and deploy a custom TensorFlow Regressor model to predict customer lifetime value (CLV). The goal of the lab is to introduce to Vertex AI through a high value real world use case - predictive CLV. You start with a local BigQuery and TensorFlow workflow that you may already be familiar with and progress toward training and deploying your model in the cloud with Vertex AI.\n", + "\n", + "![Vertex AI](./images/vertex-ai-overview.png \"Vertex AI Overview\")\n", + "\n", + "Vertex AI is Google Cloud's next generation, unified platform for machine learning development and the successor to AI Platform announced at Google I/O in May 2021. By developing machine learning solutions on Vertex AI, you can leverage the latest ML pre-built components and AutoML to significantly enhance development productivity, the ability to scale your workflow and decision making with your data, and accelerate time to value." + ] + }, + { + "cell_type": "markdown", + "id": "4fe3b8c6", + "metadata": {}, + "source": [ + "### Predictive CLV: how much monetary value existing customers will bring to the business in the future\n", + "\n", + "Predictive CLV is a high impact ML business use case. CLV is a customer's past value plus their predicted future value. The goal of predictive CLV is to predict how much monetary value a user will bring to the business in a defined future time range based on historical transactions.\n", + "\n", + "By knowing CLV, you can develop positive ROI strategies and make decisions about how much money to invest in acquiring new customers and retaining existing ones to grow revenue and profit.\n", + "\n", + "Once your ML model is a success, you can use the results to identify customers more likely to spend money than the others, and make them respond to your offers and discounts with a greater frequency. These customers, with higher lifetime value, are your main marketing target to increase revenue.\n", + "\n", + "By using the machine learning approach to predict your customers' value you will use in this lab, you can prioritize your next actions, such as the following:\n", + "\n", + "* Decide which customers to target with advertising to increase revenue.\n", + "* Identify which customer segments are most profitable and plan how to move customers from one segment to another.\n", + "\n", + "Your task is to predict the future value for existing customers based on their known transaction history. \n", + "\n", + "![CLV](./images/clv-rfm.svg \"Customer Lifetime Value\") \n", + "Source: [Cloud Architecture Center - Predicting Customer Lifetime Value with AI Platform: training the models](https://cloud.google.com/architecture/clv-prediction-with-offline-training-train)\n", + "\n", + "There is a strong positive correlation between the recency, frequency, and amount of money spent on each purchase each customer makes and their CLV. Consequently, you leverage these features to in your ML model. For this lab, they are defined as:\n", + "\n", + "* **Recency**: The time between the last purchase and today, represented by the distance between the rightmost circle and the vertical dotted line that's labeled \"Now\".\n", + "* **Frequency**: The time between purchases, represented by the distance between the circles on a single line.\n", + "* **Monetary**: The amount of money spent on each purchase, represented by the size of the circle. This amount could be the average order value or the quantity of products that the customer ordered." + ] + }, + { + "cell_type": "markdown", + "id": "d46a1982", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "id": "dc29eb23", + "metadata": {}, + "source": [ + "### Define constants" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd4c2e53", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Add installed library dependencies to Python PATH variable.\n", + "PATH=%env PATH\n", + "%env PATH={PATH}:/home/jupyter/.local/bin" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93ead7a0", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Retrieve and set PROJECT_ID and REGION environment variables.\n", + "PROJECT_ID = !(gcloud config get-value core/project)\n", + "PROJECT_ID = PROJECT_ID[0]\n", + "# Replace the value below with your assigned lab region.\n", + "REGION = 'us-central1'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d6d4df6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Create a globally unique Google Cloud Storage bucket for artifact storage.\n", + "GCS_BUCKET = f\"{PROJECT_ID}-bucket\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "883ab23c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!gsutil mb -l $REGION gs://$GCS_BUCKET" + ] + }, + { + "cell_type": "markdown", + "id": "8018cc87", + "metadata": {}, + "source": [ + "### Import libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "412ffc51", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import datetime\n", + "import numpy as np\n", + "import pandas as pd\n", + "import tensorflow as tf\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from google.cloud import aiplatform" + ] + }, + { + "cell_type": "markdown", + "id": "aecf21cb", + "metadata": {}, + "source": [ + "### Initialize the Vertex Python SDK client" + ] + }, + { + "cell_type": "markdown", + "id": "a301853d", + "metadata": {}, + "source": [ + "Import the Vertex SDK for Python into your Python environment and initialize it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae6029df", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f\"gs://{GCS_BUCKET}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cf880707", + "metadata": {}, + "source": [ + "## Download and process the lab data into BigQuery" + ] + }, + { + "cell_type": "markdown", + "id": "742ceefd", + "metadata": {}, + "source": [ + "### Dataset\n", + "\n", + "In this lab, you use the publicly available [Online Retail data set](https://archive.ics.uci.edu/ml/datasets/online+retail) from the UCI Machine Learning Repository. This dataset contains 541,909 transnational customer transactions occuring between (YYYY-MM-DD) 2010-12-01 and 2011-12-09 for a UK-based and registered non-store retailer. The company primarily sells unique all-occasion gifts. Many of the company's customers are wholesalers.\n", + "\n", + "**Citation** \n", + "Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.\n", + "\n", + "This lab is also inspired by the Google Cloud Architect Guide Series [Predicting Customer Lifetime Value with AI Platform: introduction](https://cloud.google.com/architecture/clv-prediction-with-offline-training-intro)." + ] + }, + { + "cell_type": "markdown", + "id": "9c7d9d01", + "metadata": {}, + "source": [ + "### Data ingestion" + ] + }, + { + "cell_type": "markdown", + "id": "df4efbb9", + "metadata": {}, + "source": [ + "Execute the command below to ingest the lab data from the UCI Machine Learning repository into `Cloud Storage` and then upload to `BigQuery` for data processing. The data ingestion and processing scripts are available under the `utils` folder in the lab directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7720d05e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# BigQuery constants. Please leave these unchanged.\n", + "BQ_DATASET_NAME=\"online_retail\"\n", + "BQ_RAW_TABLE_NAME=\"online_retail_clv_raw\"\n", + "BQ_CLEAN_TABLE_NAME=\"online_retail_clv_clean\"\n", + "BQ_ML_TABLE_NAME=\"online_retail_clv_ml\"\n", + "BQ_URI=f\"bq://{PROJECT_ID}.{BQ_DATASET_NAME}.{BQ_ML_TABLE_NAME}\"" + ] + }, + { + "cell_type": "markdown", + "id": "557df7b2", + "metadata": {}, + "source": [ + "**Note**: This Python script will take about 2-3 min to download and process the lab data file. Follow along with logging output in the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a42e87bc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!python utils/data_download.py \\\n", + " --PROJECT_ID={PROJECT_ID} \\\n", + " --GCS_BUCKET={GCS_BUCKET} \\\n", + " --BQ_RAW_TABLE_NAME={BQ_RAW_TABLE_NAME} \\\n", + " --BQ_CLEAN_TABLE_NAME={BQ_CLEAN_TABLE_NAME} \\\n", + " --BQ_ML_TABLE_NAME={BQ_ML_TABLE_NAME} \\\n", + " --URL=\"https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online Retail.xlsx\"" + ] + }, + { + "cell_type": "markdown", + "id": "6ca57a9f", + "metadata": {}, + "source": [ + "### Data processing" + ] + }, + { + "cell_type": "markdown", + "id": "c7293fc2", + "metadata": {}, + "source": [ + "As is the case with many real-world datasets, the lab dataset required some cleanup for you to utilize this historical customer transaction data for predictive CLV.\n", + "\n", + "The following changes were applied:\n", + "\n", + "* Keep only records that have a Customer ID.\n", + "* Aggregate transactions by day from Invoices.\n", + "* Keep only records that have positive order quantities and monetary values.\n", + "* Aggregate transactions by Customer ID and compute recency, frequency, monetary features as well as the prediction target.\n", + "\n", + "**Features**:\n", + "- `customer_country` (CATEGORICAL): customer purchase country.\n", + "- `n_purchases` (NUMERIC): number of purchases made in feature window. (frequency)\n", + "- `avg_purchase_size` (NUMERIC): average unit purchase count in feature window. (monetary)\n", + "- `avg_purchase_revenue` (NUMERIC): average GBP purchase amount in in feature window. (monetary)\n", + "- `customer_age` (NUMERIC): days from first purchase in feature window.\n", + "- `days_since_last_purchase` (NUMERIC): days from the most recent purchase in the feature window. (recency) \n", + "\n", + "**Target**: \n", + "- `target_monetary_value_3M` (NUMERIC): customer revenue from the entire study window including feature and prediction windows.\n", + "\n", + "Note: This lab demonstrates a simple way to use a DNN predict customer 3-month ahead CLV monetary value based solely on the available dataset historical transaction history. Additional factors to consider in practice when using CLV to inform interventions include customer acquisition costs, profit margins, and discount rates to arrive at the present value of future customer cash flows. One of a DNN's benefits over traditional probabilistic modeling approaches is their ability to incorporate additional categorical and unstructured features; this is a great feature engineering opportunity to explore beyond this lab which just explores the RFM numeric features." + ] + }, + { + "cell_type": "markdown", + "id": "402abff6", + "metadata": {}, + "source": [ + "## Exploratory data analysis (EDA) in BigQuery" + ] + }, + { + "cell_type": "markdown", + "id": "f4fa4d6c", + "metadata": {}, + "source": [ + "Below you use BigQuery from this notebook to do exploratory data analysis to get to know this dataset and identify opportunities for data cleanup and feature engineering." + ] + }, + { + "cell_type": "markdown", + "id": "91c50cbe", + "metadata": {}, + "source": [ + "### Recency: how recently have customers purchased?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50110392", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bigquery recency\n", + "\n", + "SELECT \n", + " days_since_last_purchase\n", + "FROM \n", + " `online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75edeba1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "recency.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89bc69b4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "recency.hist(bins=100);" + ] + }, + { + "cell_type": "markdown", + "id": "e857fb43", + "metadata": {}, + "source": [ + "From the chart, there are clearly a few different customer groups here such as loyal customers that have made purchases in the last few days as well as inactive customers that have not purchased in 250+ days. Using CLV predictions and insights, you can strategize on marketing and promotional interventions to improve customer purchase recency and re-active dormant customers." + ] + }, + { + "cell_type": "markdown", + "id": "1d4d8860", + "metadata": {}, + "source": [ + "### Frequency: how often are customers purchasing?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "34402015", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bigquery frequency\n", + "\n", + "SELECT\n", + " n_purchases\n", + "FROM\n", + " `online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc1fd5c2", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "frequency.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cbeac7e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "frequency.hist(bins=100);" + ] + }, + { + "cell_type": "markdown", + "id": "00c933f5", + "metadata": {}, + "source": [ + "From the chart and quantiles, you can see that half of the customers have less than or equal to only 2 purchases. You can also tell from the average purchases > median purchases and max purchases of 81 that there are customers, likely wholesalers, who have made significantly more purchases. This should have you already thinking about feature engineering opportunities such as bucketizing purchases and removing or clipping outlier customers. You can also explore alternative modeling strategies for CLV on new customers who have only made 1 purchase as the approach demonstrated in this lab will perform better on customers with more relationship transactional history. " + ] + }, + { + "cell_type": "markdown", + "id": "00c0c043", + "metadata": {}, + "source": [ + "### Monetary: how much are customers spending?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b8d00ea", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bigquery monetary\n", + "\n", + "SELECT\n", + " target_monetary_value_3M\n", + "FROM\n", + "`online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "636a5010", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "monetary.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08b651c5", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "monetary['target_monetary_value_3M'].plot(kind='box', title=\"Target Monetary Value 3M: wide range, long right tail distribution\", grid=True);" + ] + }, + { + "cell_type": "markdown", + "id": "7bc60b98", + "metadata": {}, + "source": [ + "From the chart and summary statistics, you can see there is a wide range in customer monetary value ranging from 2.90 to 268,478 GBP. Looking at the quantiles, it is clear there are a few outlier customers whose monetary value is greater than 3 standard deviations from the mean. With this small dataset, it is recommended to remove these outlier customer values to treat separately, change your model's loss function to be more resistant to outliers, log the target feature, or clip their values to a maximum threshold. You should also be revisiting your CLV business requirements to see if binning customer monetary value and reframing this as a ML classification problem would suit your needs." + ] + }, + { + "cell_type": "markdown", + "id": "02e553fd", + "metadata": {}, + "source": [ + "### Establish a simple model performance baseline" + ] + }, + { + "cell_type": "markdown", + "id": "08221502", + "metadata": {}, + "source": [ + "In order to evaluate the performance of your custom TensorFlow DNN Regressor model you will build in the next steps, it is a ML best practice to establish a simple performance baseline. Below is a simple SQL baseline that multiplies a customer's average purchase spent compounded by their daily purchase rate and computes standard regression metrics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf088864", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bigquery\n", + "\n", + "WITH\n", + " day_intervals AS (\n", + " SELECT\n", + " customer_id,\n", + " DATE_DIFF(DATE('2011-12-01'), DATE('2011-09-01'), DAY) AS target_days,\n", + " DATE_DIFF(DATE('2011-09-01'), MIN(order_date), DAY) AS feature_days,\n", + " FROM\n", + " `online_retail.online_retail_clv_clean`\n", + " GROUP BY\n", + " customer_id\n", + " ),\n", + " \n", + " predicted_clv AS (\n", + " SELECT\n", + " customer_id,\n", + " AVG(avg_purchase_revenue) * (COUNT(n_purchases) * (1 + SAFE_DIVIDE(COUNT(target_days),COUNT(feature_days)))) AS predicted_monetary_value_3M,\n", + " SUM(target_monetary_value_3M) AS target_monetary_value_3M\n", + " FROM\n", + " `online_retail.online_retail_clv_ml`\n", + " LEFT JOIN day_intervals USING(customer_id)\n", + " GROUP BY\n", + " customer_id\n", + " )\n", + "\n", + "# Calculate overall baseline regression metrics.\n", + "SELECT\n", + " ROUND(AVG(ABS(predicted_monetary_value_3M - target_monetary_value_3M)), 2) AS MAE,\n", + " ROUND(AVG(POW(predicted_monetary_value_3M - target_monetary_value_3M, 2)), 2) AS MSE,\n", + " ROUND(SQRT(AVG(POW(predicted_monetary_value_3M - target_monetary_value_3M, 2))), 2) AS RMSE\n", + "FROM\n", + " predicted_clv" + ] + }, + { + "cell_type": "markdown", + "id": "956ac010", + "metadata": {}, + "source": [ + "These baseline results provide further support for the strong impact of outliers. The extremely high MSE comes from the exponential penalty applied to missed predictions and the magnitude of error on a few predictions.\n", + "\n", + "Next, you should look to plot the baseline results to get a sense of opportunity areas for you ML model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e14ff67", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bigquery baseline\n", + "\n", + "WITH\n", + " day_intervals AS (\n", + " SELECT\n", + " customer_id,\n", + " DATE_DIFF(DATE('2011-12-01'), DATE('2011-09-01'), DAY) AS target_days,\n", + " DATE_DIFF(DATE('2011-09-01'), MIN(order_date), DAY) AS feature_days,\n", + " FROM\n", + " `online_retail.online_retail_clv_clean`\n", + " GROUP BY\n", + " customer_id\n", + " ),\n", + " \n", + " predicted_clv AS (\n", + " SELECT\n", + " customer_id,\n", + " AVG(avg_purchase_revenue) * (COUNT(n_purchases) * (1 + SAFE_DIVIDE(COUNT(target_days),COUNT(feature_days)))) AS predicted_monetary_value_3M,\n", + " SUM(target_monetary_value_3M) AS target_monetary_value_3M\n", + " FROM\n", + " `online_retail.online_retail_clv_ml`\n", + " INNER JOIN day_intervals USING(customer_id)\n", + " GROUP BY\n", + " customer_id\n", + " )\n", + "\n", + "SELECT\n", + " *\n", + "FROM\n", + " predicted_clv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "afda73aa", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "baseline.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a543c10", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ax = baseline.plot(kind='scatter',\n", + " x='predicted_monetary_value_3M', \n", + " y='target_monetary_value_3M',\n", + " title='Actual vs. Predicted customer 3-month monetary value',\n", + " figsize=(5,5),\n", + " grid=True)\n", + "\n", + "lims = [\n", + " np.min([ax.get_xlim(), ax.get_ylim()]), # min of both axes\n", + " np.max([ax.get_xlim(), ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "# now plot both limits against eachother\n", + "ax.plot(lims, lims, 'k-', alpha=0.5, zorder=0)\n", + "ax.set_aspect('equal')\n", + "ax.set_xlim(lims)\n", + "ax.set_ylim(lims);" + ] + }, + { + "cell_type": "markdown", + "id": "0d53ad3a", + "metadata": {}, + "source": [ + "## Train a TensorFlow model locally" + ] + }, + { + "cell_type": "markdown", + "id": "b3658b32", + "metadata": {}, + "source": [ + "Now that you have a simple baseline to benchmark your performance against, train a TensorFlow Regressor to predict CLV." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c45e2feb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bigquery\n", + "\n", + "SELECT data_split, COUNT(*)\n", + "FROM `online_retail.online_retail_clv_ml`\n", + "GROUP BY data_split" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7e2994a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bigquery clv\n", + "\n", + "SELECT *\n", + "FROM `online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80339852", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "clv_train = clv.loc[clv.data_split == 'TRAIN', :]\n", + "clv_dev = clv.loc[clv.data_split == 'VALIDATE', :]\n", + "clv_test = clv.loc[clv.data_split == 'TEST', :]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a15e9683", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Model training constants.\n", + "# Virtual epochs design pattern:\n", + "# https://medium.com/google-cloud/ml-design-pattern-3-virtual-epochs-f842296de730\n", + "N_TRAIN_EXAMPLES = 2638\n", + "STOP_POINT = 20.0\n", + "TOTAL_TRAIN_EXAMPLES = int(STOP_POINT * N_TRAIN_EXAMPLES)\n", + "BATCH_SIZE = 32\n", + "N_CHECKPOINTS = 10\n", + "STEPS_PER_EPOCH = (TOTAL_TRAIN_EXAMPLES // (BATCH_SIZE*N_CHECKPOINTS))\n", + "\n", + "NUMERIC_FEATURES = [\n", + " \"n_purchases\",\n", + " \"avg_purchase_size\",\n", + " \"avg_purchase_revenue\",\n", + " \"customer_age\",\n", + " \"days_since_last_purchase\",\n", + "]\n", + "\n", + "LABEL = \"target_monetary_value_3M\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "627cc31a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def df_dataset(df):\n", + " \"\"\"Transform Pandas Dataframe to TensorFlow Dataset.\"\"\"\n", + " return tf.data.Dataset.from_tensor_slices((df[NUMERIC_FEATURES].to_dict('list'), df[LABEL].values))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b0744b6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "trainds = df_dataset(clv_train).prefetch(1).batch(BATCH_SIZE).repeat()\n", + "devds = df_dataset(clv_dev).prefetch(1).batch(BATCH_SIZE)\n", + "testds = df_dataset(clv_test).prefetch(1).batch(BATCH_SIZE)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9459079", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from keras.metrics import RootMeanSquaredError\n", + "\n", + "def rmse(y_true, y_pred):\n", + " \"\"\"Custom RMSE regression metric.\"\"\"\n", + " return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))\n", + "\n", + "\n", + "def build_model():\n", + " \"\"\"Build and compile a TensorFlow Keras Regressor.\"\"\"\n", + "\n", + " # Define input feature tensors and input layers.\n", + " input_layers = {\n", + " feature: tf.keras.layers.Input(name=feature, shape=(1,), dtype=tf.float32) \n", + " for feature in NUMERIC_FEATURES\n", + " }\n", + "\n", + " inputs = tf.keras.layers.concatenate([\n", + " tf.keras.layers.Normalization(axis=-1)(input_layers[feature]) \n", + " for feature in NUMERIC_FEATURES\n", + " ])\n", + "\n", + " d1 = tf.keras.layers.Dense(256, activation=tf.nn.relu, name='d1')(inputs)\n", + " d2 = tf.keras.layers.Dropout(0.2, name='d2')(d1)\n", + "\n", + " # Note: the single neuron output for regression.\n", + " output = tf.keras.layers.Dense(1, name='output')(d2)\n", + "\n", + " model = tf.keras.Model(input_layers, output, name='online-retail-clv')\n", + "\n", + " optimizer = tf.keras.optimizers.Adam(0.001)\n", + "\n", + " # Note: MAE loss is more resistant to outliers than MSE.\n", + " model.compile(loss=tf.keras.losses.MAE,\n", + " optimizer=optimizer,\n", + " metrics=['mae', 'mse', RootMeanSquaredError()])\n", + "\n", + " return model\n", + "\n", + "model = build_model()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8601ff5f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(model, show_shapes=False, rankdir=\"LR\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "354206ee", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "tensorboard_callback = tf.keras.callbacks.TensorBoard(\n", + " log_dir='./local-training/tensorboard',\n", + " histogram_freq=1)\n", + "\n", + "earlystopping_callback = tf.keras.callbacks.EarlyStopping(patience=1)\n", + "\n", + "checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(\n", + " filepath='./local-training/checkpoints/my_model_weights.weights.h5',\n", + " save_weights_only=True,\n", + " monitor='val_loss',\n", + " mode='min')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "730181fb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "history = model.fit(trainds,\n", + " validation_data=devds,\n", + " steps_per_epoch=STEPS_PER_EPOCH,\n", + " epochs=N_CHECKPOINTS,\n", + " callbacks=[tensorboard_callback,\n", + " earlystopping_callback,\n", + " checkpoint_callback])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2594d084", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "LOSS_COLS = [\"loss\", \"val_loss\"]\n", + "\n", + "pd.DataFrame(history.history)[LOSS_COLS].plot();" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b71775db", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "train_pred = model.predict(df_dataset(clv_train).prefetch(1).batch(BATCH_SIZE))\n", + "dev_pred = model.predict(devds)\n", + "test_pred = model.predict(testds)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b6eceb1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "train_results = pd.DataFrame({'actual': clv_train['target_monetary_value_3M'].to_numpy(), 'predicted': np.squeeze(train_pred)}, columns=['actual', 'predicted'])\n", + "dev_results = pd.DataFrame({'actual': clv_dev['target_monetary_value_3M'].to_numpy(), 'predicted': np.squeeze(dev_pred)}, columns=['actual', 'predicted'])\n", + "test_results = pd.DataFrame({'actual': clv_test['target_monetary_value_3M'].to_numpy(), 'predicted': np.squeeze(test_pred)}, columns=['actual', 'predicted'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4659dd09", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Model prediction calibration plots.\n", + "fig, (train_ax, dev_ax, test_ax) = plt.subplots(1, 3, figsize=(15,15))\n", + "\n", + "train_results.plot(kind='scatter',\n", + " x='predicted',\n", + " y='actual',\n", + " title='Train: act vs. pred customer 3M monetary value',\n", + " grid=True,\n", + " ax=train_ax)\n", + "\n", + "train_lims = [\n", + " np.min([train_ax.get_xlim(), train_ax.get_ylim()]), # min of both axes\n", + " np.max([train_ax.get_xlim(), train_ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "train_ax.plot(train_lims, train_lims, 'k-', alpha=0.5, zorder=0)\n", + "train_ax.set_aspect('equal')\n", + "train_ax.set_xlim(train_lims)\n", + "train_ax.set_ylim(train_lims)\n", + "\n", + "dev_results.plot(kind='scatter',\n", + " x='predicted',\n", + " y='actual',\n", + " title='Dev: act vs. pred customer 3M monetary value',\n", + " grid=True,\n", + " ax=dev_ax)\n", + "\n", + "dev_lims = [\n", + " np.min([dev_ax.get_xlim(), dev_ax.get_ylim()]), # min of both axes\n", + " np.max([dev_ax.get_xlim(), dev_ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "dev_ax.plot(dev_lims, dev_lims, 'k-', alpha=0.5, zorder=0)\n", + "dev_ax.set_aspect('equal')\n", + "dev_ax.set_xlim(dev_lims)\n", + "dev_ax.set_ylim(dev_lims)\n", + "\n", + "test_results.plot(kind='scatter',\n", + " x='predicted',\n", + " y='actual',\n", + " title='Test: act vs. pred customer 3M monetary value',\n", + " grid=True,\n", + " ax=test_ax)\n", + "\n", + "test_lims = [\n", + " np.min([test_ax.get_xlim(), test_ax.get_ylim()]), # min of both axes\n", + " np.max([test_ax.get_xlim(), test_ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "test_ax.plot(test_lims, test_lims, 'k-', alpha=0.5, zorder=0)\n", + "test_ax.set_aspect('equal')\n", + "test_ax.set_xlim(test_lims)\n", + "test_ax.set_ylim(test_lims);" + ] + }, + { + "cell_type": "markdown", + "id": "2a5f1582", + "metadata": {}, + "source": [ + "You have trained a model better than your baseline. As indicated in the charts above, there is still additional feature engineering and data cleaning opportunities to improve your model's performance on customers with CLV. Some options include handling these customers as a separate prediction task, applying a log transformation to your target, clipping their value or dropping these customers all together to improve model performance.\n" + ] + }, + { + "cell_type": "markdown", + "id": "2fc312cf", + "metadata": {}, + "source": [ + "## Next steps" + ] + }, + { + "cell_type": "markdown", + "id": "30ab0ae3", + "metadata": {}, + "source": [ + "Congratulations! In this lab, you walked through a machine learning experimentation workflow using Google Cloud's BigQuery for data storage and analysis and Vertex AI machine learning services to train and deploy a TensorFlow model to predict customer lifetime value" + ] + }, + { + "cell_type": "markdown", + "id": "0749f152", + "metadata": {}, + "source": [ + "## License" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d2cfd56", + "metadata": {}, + "outputs": [], + "source": [ + "# Copyright 2021 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + } + ], + "metadata": { + "environment": { + "kernel": "conda-base-py", + "name": "workbench-notebooks.m125", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125" + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel) (Local)", + "language": "python", + "name": "conda-base-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/lab_exercise_long.ipynb b/self-paced-labs/vertex-ai/train-deploy-tf-model/lab_exercise_long.ipynb new file mode 100644 index 0000000000..c31ce960b6 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/lab_exercise_long.ipynb @@ -0,0 +1,1931 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "81e68768", + "metadata": {}, + "source": [ + "# Vertex AI: Qwik Start" + ] + }, + { + "cell_type": "markdown", + "id": "8f3be9d1", + "metadata": {}, + "source": [ + "## Learning objectives\n", + "\n", + "* Train a TensorFlow model locally in a hosted [**Vertex Notebook**](https://cloud.google.com/vertex-ai/docs/general/notebooks?hl=sv).\n", + "* Create a [**managed Tabular dataset**](https://cloud.google.com/vertex-ai/docs/training/using-managed-datasets?hl=sv) artifact for experiment tracking.\n", + "* Containerize your training code with [**Cloud Build**](https://cloud.google.com/build) and push it to [**Google Cloud Artifact Registry**](https://cloud.google.com/artifact-registry).\n", + "* Run a [**Vertex AI custom training job**](https://cloud.google.com/vertex-ai/docs/training/custom-training) with your custom model container.\n", + "* Use [**Vertex TensorBoard**](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview) to visualize model performance.\n", + "* Deploy your trained model to a [**Vertex Online Prediction Endpoint**](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions) for serving predictions.\n", + "* Request an online prediction and explanation and see the response." + ] + }, + { + "cell_type": "markdown", + "id": "c7a746be", + "metadata": {}, + "source": [ + "## Introduction: customer lifetime value (CLV) prediction with BigQuery and TensorFlow on Vertex AI" + ] + }, + { + "cell_type": "markdown", + "id": "76bf82e0", + "metadata": {}, + "source": [ + "In this lab, you use [BigQuery](https://cloud.google.com/bigquery) for data processing and exploratory data analysis and the [Vertex AI](https://cloud.google.com/vertex-ai) platform to train and deploy a custom TensorFlow Regressor model to predict customer lifetime value (CLV). The goal of the lab is to introduce to Vertex AI through a high value real world use case - predictive CLV. You start with a local BigQuery and TensorFlow workflow that you may already be familiar with and progress toward training and deploying your model in the cloud with Vertex AI.\n", + "\n", + "![Vertex AI](./images/vertex-ai-overview.png \"Vertex AI Overview\")\n", + "\n", + "Vertex AI is Google Cloud's next generation, unified platform for machine learning development and the successor to AI Platform announced at Google I/O in May 2021. By developing machine learning solutions on Vertex AI, you can leverage the latest ML pre-built components and AutoML to significantly enhance development productivity, the ability to scale your workflow and decision making with your data, and accelerate time to value." + ] + }, + { + "cell_type": "markdown", + "id": "4fe3b8c6", + "metadata": {}, + "source": [ + "### Predictive CLV: how much monetary value existing customers will bring to the business in the future\n", + "\n", + "Predictive CLV is a high impact ML business use case. CLV is a customer's past value plus their predicted future value. The goal of predictive CLV is to predict how much monetary value a user will bring to the business in a defined future time range based on historical transactions.\n", + "\n", + "By knowing CLV, you can develop positive ROI strategies and make decisions about how much money to invest in acquiring new customers and retaining existing ones to grow revenue and profit.\n", + "\n", + "Once your ML model is a success, you can use the results to identify customers more likely to spend money than the others, and make them respond to your offers and discounts with a greater frequency. These customers, with higher lifetime value, are your main marketing target to increase revenue.\n", + "\n", + "By using the machine learning approach to predict your customers' value you will use in this lab, you can prioritize your next actions, such as the following:\n", + "\n", + "* Decide which customers to target with advertising to increase revenue.\n", + "* Identify which customer segments are most profitable and plan how to move customers from one segment to another.\n", + "\n", + "Your task is to predict the future value for existing customers based on their known transaction history. \n", + "\n", + "![CLV](./images/clv-rfm.svg \"Customer Lifetime Value\") \n", + "Source: [Cloud Architecture Center - Predicting Customer Lifetime Value with AI Platform: training the models](https://cloud.google.com/architecture/clv-prediction-with-offline-training-train)\n", + "\n", + "There is a strong positive correlation between the recency, frequency, and amount of money spent on each purchase each customer makes and their CLV. Consequently, you leverage these features to in your ML model. For this lab, they are defined as:\n", + "\n", + "* **Recency**: The time between the last purchase and today, represented by the distance between the rightmost circle and the vertical dotted line that's labeled \"Now\".\n", + "* **Frequency**: The time between purchases, represented by the distance between the circles on a single line.\n", + "* **Monetary**: The amount of money spent on each purchase, represented by the size of the circle. This amount could be the average order value or the quantity of products that the customer ordered." + ] + }, + { + "cell_type": "markdown", + "id": "d46a1982", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "id": "dc29eb23", + "metadata": {}, + "source": [ + "### Define constants" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd4c2e53", + "metadata": {}, + "outputs": [], + "source": [ + "# Add installed library dependencies to Python PATH variable.\n", + "PATH=%env PATH\n", + "%env PATH={PATH}:/home/jupyter/.local/bin" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93ead7a0", + "metadata": {}, + "outputs": [], + "source": [ + "# Retrieve and set PROJECT_ID and REGION environment variables.\n", + "PROJECT_ID = !(gcloud config get-value core/project)\n", + "PROJECT_ID = PROJECT_ID[0]\n", + "REGION = 'us-central1' # Replace the region with the region mentioned in your lab manual.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d6d4df6", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a globally unique Google Cloud Storage bucket for artifact storage.\n", + "GCS_BUCKET = f\"{PROJECT_ID}-bucket\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "883ab23c", + "metadata": {}, + "outputs": [], + "source": [ + "!gsutil mb -l $REGION gs://$GCS_BUCKET" + ] + }, + { + "cell_type": "markdown", + "id": "8018cc87", + "metadata": {}, + "source": [ + "### Import libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "412ffc51", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import datetime\n", + "import numpy as np\n", + "import pandas as pd\n", + "import tensorflow as tf\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from google.cloud import aiplatform" + ] + }, + { + "cell_type": "markdown", + "id": "aecf21cb", + "metadata": {}, + "source": [ + "### Initialize the Vertex Python SDK client" + ] + }, + { + "cell_type": "markdown", + "id": "a301853d", + "metadata": {}, + "source": [ + "Import the Vertex SDK for Python into your Python environment and initialize it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae6029df", + "metadata": {}, + "outputs": [], + "source": [ + "aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f\"gs://{GCS_BUCKET}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cf880707", + "metadata": {}, + "source": [ + "## Download and process the lab data into BigQuery" + ] + }, + { + "cell_type": "markdown", + "id": "742ceefd", + "metadata": {}, + "source": [ + "### Dataset\n", + "\n", + "In this lab, you use the publicly available [Online Retail data set](https://archive.ics.uci.edu/ml/datasets/online+retail) from the UCI Machine Learning Repository. This dataset contains 541,909 transnational customer transactions occuring between (YYYY-MM-DD) 2010-12-01 and 2011-12-09 for a UK-based and registered non-store retailer. The company primarily sells unique all-occasion gifts. Many of the company's customers are wholesalers.\n", + "\n", + "**Citation** \n", + "Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.\n", + "\n", + "This lab is also inspired by the Google Cloud Architect Guide Series [Predicting Customer Lifetime Value with AI Platform: introduction](https://cloud.google.com/architecture/clv-prediction-with-offline-training-intro)." + ] + }, + { + "cell_type": "markdown", + "id": "9c7d9d01", + "metadata": {}, + "source": [ + "### Data ingestion" + ] + }, + { + "cell_type": "markdown", + "id": "df4efbb9", + "metadata": {}, + "source": [ + "Execute the command below to ingest the lab data from the UCI Machine Learning repository into `Cloud Storage` and then upload to `BigQuery` for data processing. The data ingestion and processing scripts are available under the `utils` folder in the lab directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7720d05e", + "metadata": {}, + "outputs": [], + "source": [ + "# BigQuery constants. Please leave these unchanged.\n", + "BQ_DATASET_NAME=\"online_retail\"\n", + "BQ_RAW_TABLE_NAME=\"online_retail_clv_raw\"\n", + "BQ_CLEAN_TABLE_NAME=\"online_retail_clv_clean\"\n", + "BQ_ML_TABLE_NAME=\"online_retail_clv_ml\"\n", + "BQ_URI=f\"bq://{PROJECT_ID}.{BQ_DATASET_NAME}.{BQ_ML_TABLE_NAME}\"" + ] + }, + { + "cell_type": "markdown", + "id": "557df7b2", + "metadata": {}, + "source": [ + "**Note**: This Python script will take about 2-3 min to download and process the lab data file. Follow along with logging output in the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a42e87bc", + "metadata": {}, + "outputs": [], + "source": [ + "!python utils/data_download.py \\\n", + " --PROJECT_ID={PROJECT_ID} \\\n", + " --GCS_BUCKET={GCS_BUCKET} \\\n", + " --BQ_RAW_TABLE_NAME={BQ_RAW_TABLE_NAME} \\\n", + " --BQ_CLEAN_TABLE_NAME={BQ_CLEAN_TABLE_NAME} \\\n", + " --BQ_ML_TABLE_NAME={BQ_ML_TABLE_NAME} \\\n", + " --URL=\"https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online Retail.xlsx\"" + ] + }, + { + "cell_type": "markdown", + "id": "6ca57a9f", + "metadata": {}, + "source": [ + "### Data processing" + ] + }, + { + "cell_type": "markdown", + "id": "c7293fc2", + "metadata": {}, + "source": [ + "As is the case with many real-world datasets, the lab dataset required some cleanup for you to utilize this historical customer transaction data for predictive CLV.\n", + "\n", + "The following changes were applied:\n", + "\n", + "* Keep only records that have a Customer ID.\n", + "* Aggregate transactions by day from Invoices.\n", + "* Keep only records that have positive order quantities and monetary values.\n", + "* Aggregate transactions by Customer ID and compute recency, frequency, monetary features as well as the prediction target.\n", + "\n", + "**Features**:\n", + "- `customer_country` (CATEGORICAL): customer purchase country.\n", + "- `n_purchases` (NUMERIC): number of purchases made in feature window. (frequency)\n", + "- `avg_purchase_size` (NUMERIC): average unit purchase count in feature window. (monetary)\n", + "- `avg_purchase_revenue` (NUMERIC): average GBP purchase amount in in feature window. (monetary)\n", + "- `customer_age` (NUMERIC): days from first purchase in feature window.\n", + "- `days_since_last_purchase` (NUMERIC): days from the most recent purchase in the feature window. (recency) \n", + "\n", + "**Target**: \n", + "- `target_monetary_value_3M` (NUMERIC): customer revenue from the entire study window including feature and prediction windows.\n", + "\n", + "Note: This lab demonstrates a simple way to use a DNN predict customer 3-month ahead CLV monetary value based solely on the available dataset historical transaction history. Additional factors to consider in practice when using CLV to inform interventions include customer acquisition costs, profit margins, and discount rates to arrive at the present value of future customer cash flows. One of a DNN's benefits over traditional probabilistic modeling approaches is their ability to incorporate additional categorical and unstructured features; this is a great feature engineering opportunity to explore beyond this lab which just explores the RFM numeric features." + ] + }, + { + "cell_type": "markdown", + "id": "402abff6", + "metadata": {}, + "source": [ + "## Exploratory data analysis (EDA) in BigQuery" + ] + }, + { + "cell_type": "markdown", + "id": "f4fa4d6c", + "metadata": {}, + "source": [ + "Below you use BigQuery from this notebook to do exploratory data analysis to get to know this dataset and identify opportunities for data cleanup and feature engineering." + ] + }, + { + "cell_type": "markdown", + "id": "91c50cbe", + "metadata": {}, + "source": [ + "### Recency: how recently have customers purchased?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50110392", + "metadata": {}, + "outputs": [], + "source": [ + "%%bigquery recency\n", + "\n", + "SELECT \n", + " days_since_last_purchase\n", + "FROM \n", + " `online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75edeba1", + "metadata": {}, + "outputs": [], + "source": [ + "recency.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89bc69b4", + "metadata": {}, + "outputs": [], + "source": [ + "recency.hist(bins=100);" + ] + }, + { + "cell_type": "markdown", + "id": "e857fb43", + "metadata": {}, + "source": [ + "From the chart, there are clearly a few different customer groups here such as loyal customers that have made purchases in the last few days as well as inactive customers that have not purchased in 250+ days. Using CLV predictions and insights, you can strategize on marketing and promotional interventions to improve customer purchase recency and re-active dormant customers." + ] + }, + { + "cell_type": "markdown", + "id": "1d4d8860", + "metadata": {}, + "source": [ + "### Frequency: how often are customers purchasing?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "34402015", + "metadata": {}, + "outputs": [], + "source": [ + "%%bigquery frequency\n", + "\n", + "SELECT\n", + " n_purchases\n", + "FROM\n", + " `online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc1fd5c2", + "metadata": {}, + "outputs": [], + "source": [ + "frequency.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cbeac7e", + "metadata": {}, + "outputs": [], + "source": [ + "frequency.hist(bins=100);" + ] + }, + { + "cell_type": "markdown", + "id": "00c933f5", + "metadata": {}, + "source": [ + "From the chart and quantiles, you can see that half of the customers have less than or equal to only 2 purchases. You can also tell from the average purchases > median purchases and max purchases of 81 that there are customers, likely wholesalers, who have made significantly more purchases. This should have you already thinking about feature engineering opportunities such as bucketizing purchases and removing or clipping outlier customers. You can also explore alternative modeling strategies for CLV on new customers who have only made 1 purchase as the approach demonstrated in this lab will perform better on customers with more relationship transactional history. " + ] + }, + { + "cell_type": "markdown", + "id": "00c0c043", + "metadata": {}, + "source": [ + "### Monetary: how much are customers spending?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b8d00ea", + "metadata": {}, + "outputs": [], + "source": [ + "%%bigquery monetary\n", + "\n", + "SELECT\n", + " target_monetary_value_3M\n", + "FROM\n", + "`online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "636a5010", + "metadata": {}, + "outputs": [], + "source": [ + "monetary.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08b651c5", + "metadata": {}, + "outputs": [], + "source": [ + "monetary['target_monetary_value_3M'].plot(kind='box', title=\"Target Monetary Value 3M: wide range, long right tail distribution\", grid=True);" + ] + }, + { + "cell_type": "markdown", + "id": "7bc60b98", + "metadata": {}, + "source": [ + "From the chart and summary statistics, you can see there is a wide range in customer monetary value ranging from 2.90 to 268,478 GBP. Looking at the quantiles, it is clear there are a few outlier customers whose monetary value is greater than 3 standard deviations from the mean. With this small dataset, it is recommended to remove these outlier customer values to treat separately, change your model's loss function to be more resistant to outliers, log the target feature, or clip their values to a maximum threshold. You should also be revisiting your CLV business requirements to see if binning customer monetary value and reframing this as a ML classification problem would suit your needs." + ] + }, + { + "cell_type": "markdown", + "id": "02e553fd", + "metadata": {}, + "source": [ + "### Establish a simple model performance baseline" + ] + }, + { + "cell_type": "markdown", + "id": "08221502", + "metadata": {}, + "source": [ + "In order to evaluate the performance of your custom TensorFlow DNN Regressor model you will build in the next steps, it is a ML best practice to establish a simple performance baseline. Below is a simple SQL baseline that multiplies a customer's average purchase spent compounded by their daily purchase rate and computes standard regression metrics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf088864", + "metadata": {}, + "outputs": [], + "source": [ + "%%bigquery\n", + "\n", + "WITH\n", + " day_intervals AS (\n", + " SELECT\n", + " customer_id,\n", + " DATE_DIFF(DATE('2011-12-01'), DATE('2011-09-01'), DAY) AS target_days,\n", + " DATE_DIFF(DATE('2011-09-01'), MIN(order_date), DAY) AS feature_days,\n", + " FROM\n", + " `online_retail.online_retail_clv_clean`\n", + " GROUP BY\n", + " customer_id\n", + " ),\n", + " \n", + " predicted_clv AS (\n", + " SELECT\n", + " customer_id,\n", + " AVG(avg_purchase_revenue) * (COUNT(n_purchases) * (1 + SAFE_DIVIDE(COUNT(target_days),COUNT(feature_days)))) AS predicted_monetary_value_3M,\n", + " SUM(target_monetary_value_3M) AS target_monetary_value_3M\n", + " FROM\n", + " `online_retail.online_retail_clv_ml`\n", + " LEFT JOIN day_intervals USING(customer_id)\n", + " GROUP BY\n", + " customer_id\n", + " )\n", + "\n", + "# Calculate overall baseline regression metrics.\n", + "SELECT\n", + " ROUND(AVG(ABS(predicted_monetary_value_3M - target_monetary_value_3M)), 2) AS MAE,\n", + " ROUND(AVG(POW(predicted_monetary_value_3M - target_monetary_value_3M, 2)), 2) AS MSE,\n", + " ROUND(SQRT(AVG(POW(predicted_monetary_value_3M - target_monetary_value_3M, 2))), 2) AS RMSE\n", + "FROM\n", + " predicted_clv" + ] + }, + { + "cell_type": "markdown", + "id": "956ac010", + "metadata": {}, + "source": [ + "These baseline results provide further support for the strong impact of outliers. The extremely high MSE comes from the exponential penalty applied to missed predictions and the magnitude of error on a few predictions.\n", + "\n", + "Next, you should look to plot the baseline results to get a sense of opportunity areas for you ML model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e14ff67", + "metadata": {}, + "outputs": [], + "source": [ + "%%bigquery baseline\n", + "\n", + "WITH\n", + " day_intervals AS (\n", + " SELECT\n", + " customer_id,\n", + " DATE_DIFF(DATE('2011-12-01'), DATE('2011-09-01'), DAY) AS target_days,\n", + " DATE_DIFF(DATE('2011-09-01'), MIN(order_date), DAY) AS feature_days,\n", + " FROM\n", + " `online_retail.online_retail_clv_clean`\n", + " GROUP BY\n", + " customer_id\n", + " ),\n", + " \n", + " predicted_clv AS (\n", + " SELECT\n", + " customer_id,\n", + " AVG(avg_purchase_revenue) * (COUNT(n_purchases) * (1 + SAFE_DIVIDE(COUNT(target_days),COUNT(feature_days)))) AS predicted_monetary_value_3M,\n", + " SUM(target_monetary_value_3M) AS target_monetary_value_3M\n", + " FROM\n", + " `online_retail.online_retail_clv_ml`\n", + " INNER JOIN day_intervals USING(customer_id)\n", + " GROUP BY\n", + " customer_id\n", + " )\n", + "\n", + "SELECT\n", + " *\n", + "FROM\n", + " predicted_clv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "afda73aa", + "metadata": {}, + "outputs": [], + "source": [ + "baseline.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a543c10", + "metadata": {}, + "outputs": [], + "source": [ + "ax = baseline.plot(kind='scatter',\n", + " x='predicted_monetary_value_3M', \n", + " y='target_monetary_value_3M',\n", + " title='Actual vs. Predicted customer 3-month monetary value',\n", + " figsize=(5,5),\n", + " grid=True)\n", + "\n", + "lims = [\n", + " np.min([ax.get_xlim(), ax.get_ylim()]), # min of both axes\n", + " np.max([ax.get_xlim(), ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "# now plot both limits against eachother\n", + "ax.plot(lims, lims, 'k-', alpha=0.5, zorder=0)\n", + "ax.set_aspect('equal')\n", + "ax.set_xlim(lims)\n", + "ax.set_ylim(lims);" + ] + }, + { + "cell_type": "markdown", + "id": "0d53ad3a", + "metadata": {}, + "source": [ + "## Train a TensorFlow model locally" + ] + }, + { + "cell_type": "markdown", + "id": "b3658b32", + "metadata": {}, + "source": [ + "Now that you have a simple baseline to benchmark your performance against, train a TensorFlow Regressor to predict CLV." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c45e2feb", + "metadata": {}, + "outputs": [], + "source": [ + "%%bigquery\n", + "\n", + "SELECT data_split, COUNT(*)\n", + "FROM `online_retail.online_retail_clv_ml`\n", + "GROUP BY data_split" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7e2994a", + "metadata": {}, + "outputs": [], + "source": [ + "%%bigquery clv\n", + "\n", + "SELECT *\n", + "FROM `online_retail.online_retail_clv_ml`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80339852", + "metadata": {}, + "outputs": [], + "source": [ + "clv_train = clv.loc[clv.data_split == 'TRAIN', :]\n", + "clv_dev = clv.loc[clv.data_split == 'VALIDATE', :]\n", + "clv_test = clv.loc[clv.data_split == 'TEST', :]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a15e9683", + "metadata": {}, + "outputs": [], + "source": [ + "# Model training constants.\n", + "# Virtual epochs design pattern:\n", + "# https://medium.com/google-cloud/ml-design-pattern-3-virtual-epochs-f842296de730\n", + "N_TRAIN_EXAMPLES = 2638\n", + "STOP_POINT = 20.0\n", + "TOTAL_TRAIN_EXAMPLES = int(STOP_POINT * N_TRAIN_EXAMPLES)\n", + "BATCH_SIZE = 32\n", + "N_CHECKPOINTS = 10\n", + "STEPS_PER_EPOCH = (TOTAL_TRAIN_EXAMPLES // (BATCH_SIZE*N_CHECKPOINTS))\n", + "\n", + "NUMERIC_FEATURES = [\n", + " \"n_purchases\",\n", + " \"avg_purchase_size\",\n", + " \"avg_purchase_revenue\",\n", + " \"customer_age\",\n", + " \"days_since_last_purchase\",\n", + "]\n", + "\n", + "LABEL = \"target_monetary_value_3M\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "627cc31a", + "metadata": {}, + "outputs": [], + "source": [ + "def df_dataset(df):\n", + " \"\"\"Transform Pandas Dataframe to TensorFlow Dataset.\"\"\"\n", + " return tf.data.Dataset.from_tensor_slices((df[NUMERIC_FEATURES].to_dict('list'), df[LABEL].values))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b0744b6", + "metadata": {}, + "outputs": [], + "source": [ + "trainds = df_dataset(clv_train).prefetch(1).batch(BATCH_SIZE).repeat()\n", + "devds = df_dataset(clv_dev).prefetch(1).batch(BATCH_SIZE)\n", + "testds = df_dataset(clv_test).prefetch(1).batch(BATCH_SIZE)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9459079", + "metadata": {}, + "outputs": [], + "source": [ + "def rmse(y_true, y_pred):\n", + " \"\"\"Custom RMSE regression metric.\"\"\"\n", + " return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))\n", + "\n", + "\n", + "def build_model():\n", + " \"\"\"Build and compile a TensorFlow Keras Regressor.\"\"\"\n", + " # Define input feature tensors and input layers.\n", + " feature_columns = [\n", + " tf.feature_column.numeric_column(key=feature)\n", + " for feature in NUMERIC_FEATURES\n", + " ]\n", + " \n", + " input_layers = {\n", + " feature.key: tf.keras.layers.Input(name=feature.key, shape=(), dtype=tf.float32)\n", + " for feature in feature_columns\n", + " }\n", + " \n", + " # Keras Functional API: https://keras.io/guides/functional_api\n", + " inputs = tf.keras.layers.DenseFeatures(feature_columns, name='inputs')(input_layers)\n", + " d1 = tf.keras.layers.Dense(256, activation=tf.nn.relu, name='d1')(inputs)\n", + " d2 = tf.keras.layers.Dropout(0.2, name='d2')(d1) \n", + " # Note: the single neuron output for regression.\n", + " output = tf.keras.layers.Dense(1, name='output')(d2)\n", + " \n", + " model = tf.keras.Model(input_layers, output, name='online-retail-clv')\n", + " \n", + " optimizer = tf.keras.optimizers.Adam(0.001) \n", + " \n", + " # Note: MAE loss is more resistant to outliers than MSE.\n", + " model.compile(loss=tf.keras.losses.MAE,\n", + " optimizer=optimizer,\n", + " metrics=[['mae', 'mse', rmse]])\n", + " \n", + " return model\n", + "\n", + "model = build_model()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8601ff5f", + "metadata": {}, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(model, show_shapes=False, rankdir=\"LR\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "354206ee", + "metadata": {}, + "outputs": [], + "source": [ + "tensorboard_callback = tf.keras.callbacks.TensorBoard(\n", + " log_dir='./local-training/tensorboard',\n", + " histogram_freq=1)\n", + "\n", + "earlystopping_callback = tf.keras.callbacks.EarlyStopping(patience=1)\n", + "\n", + "checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(\n", + " filepath='./local-training/checkpoints',\n", + " save_weights_only=True,\n", + " monitor='val_loss',\n", + " mode='min')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "730181fb", + "metadata": {}, + "outputs": [], + "source": [ + "history = model.fit(trainds,\n", + " validation_data=devds,\n", + " steps_per_epoch=STEPS_PER_EPOCH,\n", + " epochs=N_CHECKPOINTS,\n", + " callbacks=[[tensorboard_callback,\n", + " earlystopping_callback,\n", + " checkpoint_callback]])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2594d084", + "metadata": {}, + "outputs": [], + "source": [ + "LOSS_COLS = [\"loss\", \"val_loss\"]\n", + "\n", + "pd.DataFrame(history.history)[LOSS_COLS].plot();" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b71775db", + "metadata": {}, + "outputs": [], + "source": [ + "train_pred = model.predict(df_dataset(clv_train).prefetch(1).batch(BATCH_SIZE))\n", + "dev_pred = model.predict(devds)\n", + "test_pred = model.predict(testds)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b6eceb1", + "metadata": {}, + "outputs": [], + "source": [ + "train_results = pd.DataFrame({'actual': clv_train['target_monetary_value_3M'].to_numpy(), 'predicted': np.squeeze(train_pred)}, columns=['actual', 'predicted'])\n", + "dev_results = pd.DataFrame({'actual': clv_dev['target_monetary_value_3M'].to_numpy(), 'predicted': np.squeeze(dev_pred)}, columns=['actual', 'predicted'])\n", + "test_results = pd.DataFrame({'actual': clv_test['target_monetary_value_3M'].to_numpy(), 'predicted': np.squeeze(test_pred)}, columns=['actual', 'predicted'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4659dd09", + "metadata": {}, + "outputs": [], + "source": [ + "# Model prediction calibration plots.\n", + "fig, (train_ax, dev_ax, test_ax) = plt.subplots(1, 3, figsize=(15,15))\n", + "\n", + "train_results.plot(kind='scatter',\n", + " x='predicted',\n", + " y='actual',\n", + " title='Train: act vs. pred customer 3M monetary value',\n", + " grid=True,\n", + " ax=train_ax)\n", + "\n", + "train_lims = [\n", + " np.min([train_ax.get_xlim(), train_ax.get_ylim()]), # min of both axes\n", + " np.max([train_ax.get_xlim(), train_ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "train_ax.plot(train_lims, train_lims, 'k-', alpha=0.5, zorder=0)\n", + "train_ax.set_aspect('equal')\n", + "train_ax.set_xlim(train_lims)\n", + "train_ax.set_ylim(train_lims)\n", + "\n", + "dev_results.plot(kind='scatter',\n", + " x='predicted',\n", + " y='actual',\n", + " title='Dev: act vs. pred customer 3M monetary value',\n", + " grid=True,\n", + " ax=dev_ax)\n", + "\n", + "dev_lims = [\n", + " np.min([dev_ax.get_xlim(), dev_ax.get_ylim()]), # min of both axes\n", + " np.max([dev_ax.get_xlim(), dev_ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "dev_ax.plot(dev_lims, dev_lims, 'k-', alpha=0.5, zorder=0)\n", + "dev_ax.set_aspect('equal')\n", + "dev_ax.set_xlim(dev_lims)\n", + "dev_ax.set_ylim(dev_lims)\n", + "\n", + "test_results.plot(kind='scatter',\n", + " x='predicted',\n", + " y='actual',\n", + " title='Test: act vs. pred customer 3M monetary value',\n", + " grid=True,\n", + " ax=test_ax)\n", + "\n", + "test_lims = [\n", + " np.min([test_ax.get_xlim(), test_ax.get_ylim()]), # min of both axes\n", + " np.max([test_ax.get_xlim(), test_ax.get_ylim()]), # max of both axes\n", + "]\n", + "\n", + "test_ax.plot(test_lims, test_lims, 'k-', alpha=0.5, zorder=0)\n", + "test_ax.set_aspect('equal')\n", + "test_ax.set_xlim(test_lims)\n", + "test_ax.set_ylim(test_lims);" + ] + }, + { + "cell_type": "markdown", + "id": "2a5f1582", + "metadata": {}, + "source": [ + "You have trained a model better than your baseline. As indicated in the charts above, there is still additional feature engineering and data cleaning opportunities to improve your model's performance on customers with CLV. Some options include handling these customers as a separate prediction task, applying a log transformation to your target, clipping their value or dropping these customers all together to improve model performance.\n", + "\n", + "Now, you work through taking this local TensorFlow workflow to the cloud with Vertex AI." + ] + }, + { + "cell_type": "markdown", + "id": "24bb7c43", + "metadata": {}, + "source": [ + "## Create a managed Tabular dataset from your BigQuery data source" + ] + }, + { + "cell_type": "markdown", + "id": "f8383baa", + "metadata": {}, + "source": [ + "[**Vertex AI managed datasets**](https://cloud.google.com/vertex-ai/docs/datasets/prepare-tabular) can be used to train AutoML models or custom-trained models.\n", + "\n", + "You create a [**Tabular regression dataset**](https://cloud.google.com/vertex-ai/docs/datasets/bp-tabular) for managing the sharing and metadata for this lab's dataset stored in BigQuery. Managed datasets enable you to create a clear link between your data and custom-trained models, and provide descriptive statistics and automatic or manual splitting into train, test, and validation sets. \n", + "\n", + "In this lab, the data processing step already created a manual `data_split` column in your BQ ML table using [BigQuery's hashing functions](https://towardsdatascience.com/ml-design-pattern-5-repeatable-sampling-c0ccb2889f39) for repeatable sampling." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "964c1eb3", + "metadata": {}, + "outputs": [], + "source": [ + "tabular_dataset = aiplatform.TabularDataset.create(display_name=\"online-retail-clv\", bq_source=f\"{BQ_URI}\")" + ] + }, + { + "cell_type": "markdown", + "id": "420b6fd9", + "metadata": {}, + "source": [ + "## Vertex AI custom ML model training workflow" + ] + }, + { + "cell_type": "markdown", + "id": "c3806a39", + "metadata": {}, + "source": [ + "There are two ways you can train a custom model on Vertex AI:\n", + "\n", + "Before you submit a custom training job, hyperparameter tuning job, or a training pipeline to Vertex AI, you need to create a Python training application or a custom container to define the training code and dependencies you want to run on Vertex AI.\n", + "\n", + "**1. Use a Google Cloud prebuilt container**: if you use a Vertex AI prebuilt container, you write a Python `task.py` script or Python package to install into the container image that defines your code for training a custom model. See [Creating a Python training application for a pre-built container](https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container) for more details on how to structure you Python code. Choose this option if a prebuilt container already contains the model training libraries you need such as `tensorflow` or `xgboost` and you are just doing ML training and prediction quickly. You can also specific additional Python dependencies to install through the `CustomTrainingJob(requirements=...` argument.\n", + "\n", + "**2. Use your own custom container image**: If you want to use your own custom container, you write your Python training scripts and a Dockerfile that contains instructions on your ML model code, dependencies, and execution instructions. You will build your custom container with Cloud Build, whose instructions are specified in `cloudbuild.yaml` and publish your container to your Artifact Registry. Choose this option if you want to package your ML model code with dependencies together in a container to build toward running as part of a portable and scalable [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) workflow. " + ] + }, + { + "cell_type": "markdown", + "id": "2e42f26a", + "metadata": {}, + "source": [ + "### Containerize your model training code" + ] + }, + { + "cell_type": "markdown", + "id": "6b99d903", + "metadata": {}, + "source": [ + "In the next 5 steps, you proceed with **2. Use your own custom container image**. \n", + "\n", + "You build your custom model container on top of a [Google Cloud Deep Learning container](https://cloud.google.com/vertex-ai/docs/general/deep-learning) that contains tested and optimized versions of model code dependencies such as `tensorflow` and the `google-cloud-bigquery` SDK. This also gives you flexibility and enables to manage and share your model container image with others for reuse and reproducibility across environments while also enabling you to incorporate additional packages for your ML application. Lastly, by packaging your ML model code together with dependencies you also have a MLOps onboarding path to Vertex Pipelines.\n", + "\n", + "You walk through creating the following project structure for your ML mode code:\n", + "\n", + "```\n", + "|--/online-retail-clv-3M\n", + " |--/trainer\n", + " |--__init__.py\n", + " |--model.py\n", + " |--task.py\n", + " |--Dockerfile\n", + " |--cloudbuild.yaml\n", + " |--requirements.txt\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "2db0ba26", + "metadata": {}, + "source": [ + "#### 1. Write a `model.py` training script" + ] + }, + { + "cell_type": "markdown", + "id": "cb5a08e3", + "metadata": {}, + "source": [ + "First, you take tidy up your local TensorFlow model training code from above into a training script.\n", + "\n", + "The biggest change is you utilize the [TensorFlow IO](https://www.tensorflow.org/io/tutorials/bigquery) library to performantly read from BigQuery directly into your TensorFlow model graph during training. This improves your training performance rather than performing the intermediate step of reading from BigQuery into a Pandas Dataframe done for expediency above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0cae846", + "metadata": {}, + "outputs": [], + "source": [ + "# this is the name of your model subdirectory you will write your model code to. It is already created in your lab directory.\n", + "MODEL_NAME=\"online-retail-clv-3M\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbe19974", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile {MODEL_NAME}/trainer/model.py\n", + "import os\n", + "import logging\n", + "import tempfile\n", + "import tensorflow as tf\n", + "from explainable_ai_sdk.metadata.tf.v2 import SavedModelMetadataBuilder\n", + "from tensorflow.python.framework import dtypes\n", + "from tensorflow_io.bigquery import BigQueryClient\n", + "from tensorflow_io.bigquery import BigQueryReadSession\n", + "\n", + "\n", + "# Model feature constants.\n", + "NUMERIC_FEATURES = [\n", + " \"n_purchases\",\n", + " \"avg_purchase_size\",\n", + " \"avg_purchase_revenue\",\n", + " \"customer_age\",\n", + " \"days_since_last_purchase\",\n", + "]\n", + "\n", + "CATEGORICAL_FEATURES = [\n", + " \"customer_country\"\n", + "]\n", + "\n", + "LABEL = \"target_monetary_value_3M\"\n", + "\n", + "\n", + "def caip_uri_to_fields(uri):\n", + " \"\"\"Helper function to parse BQ URI.\"\"\"\n", + " # Remove bq:// prefix.\n", + " uri = uri[5:]\n", + " project, dataset, table = uri.split('.')\n", + " return project, dataset, table\n", + "\n", + "\n", + "def features_and_labels(row_data):\n", + " \"\"\"Helper feature and label mapping function for tf.data.\"\"\"\n", + " label = row_data.pop(LABEL)\n", + " features = row_data\n", + " return features, label\n", + "\n", + "\n", + "def read_bigquery(project, dataset, table):\n", + " \"\"\"TensorFlow IO BigQuery Reader.\"\"\"\n", + " tensorflow_io_bigquery_client = BigQueryClient()\n", + " read_session = tensorflow_io_bigquery_client.read_session(\n", + " parent=\"projects/\" + project,\n", + " project_id=project, \n", + " dataset_id=dataset,\n", + " table_id=table,\n", + " # Pass list of features and label to be selected from BQ.\n", + " selected_fields=NUMERIC_FEATURES + [LABEL],\n", + " # Provide output TensorFlow data types for features and label.\n", + " output_types=[dtypes.int64, dtypes.float64, dtypes.float64, dtypes.int64, dtypes.int64] + [dtypes.float64],\n", + " requested_streams=2)\n", + " dataset = read_session.parallel_read_rows()\n", + " transformed_ds = dataset.map(features_and_labels)\n", + " return transformed_ds\n", + "\n", + "\n", + "def rmse(y_true, y_pred):\n", + " \"\"\"Custom RMSE regression metric.\"\"\"\n", + " return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))\n", + "\n", + "\n", + "def build_model(hparams):\n", + " \"\"\"Build and compile a TensorFlow Keras DNN Regressor.\"\"\"\n", + "\n", + " feature_columns = [\n", + " tf.feature_column.numeric_column(key=feature)\n", + " for feature in NUMERIC_FEATURES\n", + " ]\n", + " \n", + " input_layers = {\n", + " feature.key: tf.keras.layers.Input(name=feature.key, shape=(), dtype=tf.float32)\n", + " for feature in feature_columns\n", + " }\n", + " # Keras Functional API: https://keras.io/guides/functional_api\n", + " inputs = tf.keras.layers.DenseFeatures(feature_columns, name='inputs')(input_layers)\n", + " d1 = tf.keras.layers.Dense(256, activation=tf.nn.relu, name='d1')(inputs)\n", + " d2 = tf.keras.layers.Dropout(hparams['dropout'], name='d2')(d1) \n", + " # Note: a single neuron scalar output for regression.\n", + " output = tf.keras.layers.Dense(1, name='output')(d2)\n", + " \n", + " model = tf.keras.Model(input_layers, output, name='online-retail-clv')\n", + " \n", + " optimizer = tf.keras.optimizers.Adam(hparams['learning-rate']) \n", + " \n", + " # Note: MAE loss is more resistant to outliers than MSE.\n", + " model.compile(loss=tf.keras.losses.MAE,\n", + " optimizer=optimizer,\n", + " metrics=[['mae', 'mse', rmse]])\n", + " \n", + " return model\n", + "\n", + "\n", + "def train_evaluate_explain_model(hparams):\n", + " \"\"\"Train, evaluate, explain TensorFlow Keras DNN Regressor.\n", + " Args:\n", + " hparams(dict): A dictionary containing model training arguments.\n", + " Returns:\n", + " history(tf.keras.callbacks.History): Keras callback that records training event history.\n", + " \"\"\"\n", + " training_ds = read_bigquery(*caip_uri_to_fields(hparams['training-data-uri'])).prefetch(1).shuffle(hparams['batch-size']*10).batch(hparams['batch-size']).repeat()\n", + " eval_ds = read_bigquery(*caip_uri_to_fields(hparams['validation-data-uri'])).prefetch(1).shuffle(hparams['batch-size']*10).batch(hparams['batch-size'])\n", + " test_ds = read_bigquery(*caip_uri_to_fields(hparams['test-data-uri'])).prefetch(1).shuffle(hparams['batch-size']*10).batch(hparams['batch-size'])\n", + " \n", + " model = build_model(hparams)\n", + " logging.info(model.summary())\n", + " \n", + " tensorboard_callback = tf.keras.callbacks.TensorBoard(\n", + " log_dir=hparams['tensorboard-dir'],\n", + " histogram_freq=1)\n", + " \n", + " # Reduce overfitting and shorten training times.\n", + " earlystopping_callback = tf.keras.callbacks.EarlyStopping(patience=2)\n", + " \n", + " # Ensure your training job's resilience to VM restarts.\n", + " checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(\n", + " filepath= hparams['checkpoint-dir'],\n", + " save_weights_only=True,\n", + " monitor='val_loss',\n", + " mode='min')\n", + " \n", + " # Virtual epochs design pattern:\n", + " # https://medium.com/google-cloud/ml-design-pattern-3-virtual-epochs-f842296de730\n", + " TOTAL_TRAIN_EXAMPLES = int(hparams['stop-point'] * hparams['n-train-examples'])\n", + " STEPS_PER_EPOCH = (TOTAL_TRAIN_EXAMPLES // (hparams['batch-size']*hparams['n-checkpoints'])) \n", + " \n", + " history = model.fit(training_ds,\n", + " validation_data=eval_ds,\n", + " steps_per_epoch=STEPS_PER_EPOCH,\n", + " epochs=hparams['n-checkpoints'],\n", + " callbacks=[[tensorboard_callback,\n", + " earlystopping_callback,\n", + " checkpoint_callback]])\n", + " \n", + " logging.info(model.evaluate(test_ds))\n", + " \n", + " # Create a temp directory to save intermediate TF SavedModel prior to Explainable metadata creation.\n", + " tmpdir = tempfile.mkdtemp()\n", + " \n", + " # Export Keras model in TensorFlow SavedModel format.\n", + " model.save(tmpdir)\n", + " \n", + " # Annotate and save TensorFlow SavedModel with Explainable metadata to GCS.\n", + " builder = SavedModelMetadataBuilder(tmpdir)\n", + " builder.save_model_with_metadata(hparams['model-dir'])\n", + " \n", + " return history" + ] + }, + { + "cell_type": "markdown", + "id": "c10121ec", + "metadata": {}, + "source": [ + "#### 2. Write a `task.py` file as an entrypoint to your custom ML model container" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d4d6add", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile {MODEL_NAME}/trainer/task.py\n", + "import os\n", + "import argparse\n", + "\n", + "from trainer import model\n", + "\n", + "if __name__ == '__main__':\n", + " parser = argparse.ArgumentParser()\n", + " # Vertex custom container training args. These are set by Vertex AI during training but can also be overwritten.\n", + " parser.add_argument('--model-dir', dest='model-dir',\n", + " default=os.environ['AIP_MODEL_DIR'], type=str, help='Model dir.')\n", + " parser.add_argument('--checkpoint-dir', dest='checkpoint-dir',\n", + " default=os.environ['AIP_CHECKPOINT_DIR'], type=str, help='Checkpoint dir set during Vertex AI training.') \n", + " parser.add_argument('--tensorboard-dir', dest='tensorboard-dir',\n", + " default=os.environ['AIP_TENSORBOARD_LOG_DIR'], type=str, help='Tensorboard dir set during Vertex AI training.') \n", + " parser.add_argument('--data-format', dest='data-format',\n", + " default=os.environ['AIP_DATA_FORMAT'], type=str, help=\"Tabular data format set during Vertex AI training. E.g.'csv', 'bigquery'\")\n", + " parser.add_argument('--training-data-uri', dest='training-data-uri',\n", + " default=os.environ['AIP_TRAINING_DATA_URI'], type=str, help='Training data GCS or BQ URI set during Vertex AI training.')\n", + " parser.add_argument('--validation-data-uri', dest='validation-data-uri',\n", + " default=os.environ['AIP_VALIDATION_DATA_URI'], type=str, help='Validation data GCS or BQ URI set during Vertex AI training.')\n", + " parser.add_argument('--test-data-uri', dest='test-data-uri',\n", + " default=os.environ['AIP_TEST_DATA_URI'], type=str, help='Test data GCS or BQ URI set during Vertex AI training.')\n", + " # Model training args.\n", + " parser.add_argument('--learning-rate', dest='learning-rate', default=0.001, type=float, help='Learning rate for optimizer.')\n", + " parser.add_argument('--dropout', dest='dropout', default=0.2, type=float, help='Float percentage of DNN nodes [0,1] to drop for regularization.') \n", + " parser.add_argument('--batch-size', dest='batch-size', default=16, type=int, help='Number of examples during each training iteration.') \n", + " parser.add_argument('--n-train-examples', dest='n-train-examples', default=2638, type=int, help='Number of examples to train on.')\n", + " parser.add_argument('--stop-point', dest='stop-point', default=10, type=int, help='Number of passes through the dataset during training to achieve convergence.')\n", + " parser.add_argument('--n-checkpoints', dest='n-checkpoints', default=10, type=int, help='Number of model checkpoints to save during training.')\n", + " \n", + " args = parser.parse_args()\n", + " hparams = args.__dict__\n", + "\n", + " model.train_evaluate_explain_model(hparams)" + ] + }, + { + "cell_type": "markdown", + "id": "18058766", + "metadata": {}, + "source": [ + "#### 3. Write a `Dockerfile` for your custom ML model container" + ] + }, + { + "cell_type": "markdown", + "id": "987cc52a", + "metadata": {}, + "source": [ + "Third, you write a `Dockerfile` that contains your model code as well as specifies your model code's dependencies.\n", + "\n", + "Notice the base image below is a [Google Cloud Deep Learning container](https://cloud.google.com/vertex-ai/docs/general/deep-learning) that contains tested and optimized versions of model code dependencies such as `tensorflow` and the `google-cloud-bigquery` SDK." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28ea8f68", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile {MODEL_NAME}/Dockerfile\n", + "# Specifies base image and tag.\n", + "# https://cloud.google.com/vertex-ai/docs/general/deep-learning\n", + "# https://cloud.google.com/deep-learning-containers/docs/choosing-container\n", + "FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3\n", + "\n", + "# Sets the container working directory.\n", + "WORKDIR /root\n", + "\n", + "# Copies the requirements.txt into the container to reduce network calls.\n", + "COPY requirements.txt .\n", + "# Installs additional packages.\n", + "RUN pip3 install -U -r requirements.txt\n", + "\n", + "# Copies the trainer code to the docker image.\n", + "COPY . /trainer\n", + "\n", + "# Sets the container working directory.\n", + "WORKDIR /trainer\n", + "\n", + "# Sets up the entry point to invoke the trainer.\n", + "ENTRYPOINT [\"python\", \"-m\", \"trainer.task\"]" + ] + }, + { + "cell_type": "markdown", + "id": "f2db8aea", + "metadata": {}, + "source": [ + "### 4. Write a `requirements.txt` file to specify additional ML code dependencies" + ] + }, + { + "cell_type": "markdown", + "id": "f13b99fb", + "metadata": {}, + "source": [ + "These are additional dependencies for your model code outside the deep learning containers needed for prediction explainability and the BigQuery TensorFlow IO reader." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "06998a4e", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile {MODEL_NAME}/requirements.txt\n", + "explainable-ai-sdk==1.3.0\n", + "tensorflow-io==0.15.0\n", + "pyarrow" + ] + }, + { + "cell_type": "markdown", + "id": "5214db92", + "metadata": {}, + "source": [ + "#### 5. Use Cloud Build to build and submit your container to Google Cloud Artifact Registry" + ] + }, + { + "cell_type": "markdown", + "id": "25ff06d2", + "metadata": {}, + "source": [ + "Next, you use [Cloud Build](https://cloud.google.com/build) to build and upload your custom TensorFlow model container to [Google Cloud Artifact Registry](https://cloud.google.com/artifact-registry).\n", + "\n", + "Cloud Build brings reusability and automation to your ML experimentation by enabling you to reliably build, test, and deploy your ML model code as part of a CI/CD workflow. Artifact Registry provides a centralized repository for you to store, manage, and secure your ML container images. This allows you to securely share your ML work with others and reproduce experiment results.\n", + "\n", + "**Note**: The initial build and submit step will take about 20 minutes but Cloud Build is able to take advantage of caching for subsequent builds." + ] + }, + { + "cell_type": "markdown", + "id": "65a8c7f1", + "metadata": {}, + "source": [ + "#### Create Artifact Repository for custom container images" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8984969", + "metadata": {}, + "outputs": [], + "source": [ + "ARTIFACT_REPOSITORY=\"online-retail-clv\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff4c1484", + "metadata": {}, + "outputs": [], + "source": [ + "# Create an Artifact Repository using the gcloud CLI.\n", + "!gcloud artifacts repositories create $ARTIFACT_REPOSITORY \\\n", + "--repository-format=docker \\\n", + "--location=$REGION \\\n", + "--description=\"Artifact registry for ML custom training images for predictive CLV\"" + ] + }, + { + "cell_type": "markdown", + "id": "b8703d94", + "metadata": {}, + "source": [ + "#### Create `cloudbuild.yaml` instructions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "efe17ff9", + "metadata": {}, + "outputs": [], + "source": [ + "IMAGE_NAME=\"dnn-regressor\"\n", + "IMAGE_TAG=\"latest\"\n", + "IMAGE_URI=f\"{REGION}-docker.pkg.dev/{PROJECT_ID}/{ARTIFACT_REPOSITORY}/{IMAGE_NAME}:{IMAGE_TAG}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c834b5a9", + "metadata": {}, + "outputs": [], + "source": [ + "cloudbuild_yaml = f\"\"\"steps:\n", + "- name: 'gcr.io/cloud-builders/docker'\n", + " args: [ 'build', '-t', '{IMAGE_URI}', '.' ]\n", + "images: \n", + "- '{IMAGE_URI}'\"\"\"\n", + "\n", + "with open(f\"{MODEL_NAME}/cloudbuild.yaml\", \"w\") as fp:\n", + " fp.write(cloudbuild_yaml)" + ] + }, + { + "cell_type": "markdown", + "id": "b590f66b", + "metadata": {}, + "source": [ + "#### Build and submit your container image to your Artifact Repository" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9361461", + "metadata": {}, + "outputs": [], + "source": [ + "!gcloud builds submit --timeout=20m --config {MODEL_NAME}/cloudbuild.yaml {MODEL_NAME}" + ] + }, + { + "cell_type": "markdown", + "id": "4efcc053", + "metadata": {}, + "source": [ + "Now that your custom container is built and stored in your Artifact Registry, its time to train your model in the cloud with Vertex AI." + ] + }, + { + "cell_type": "markdown", + "id": "ea2cdc6f", + "metadata": {}, + "source": [ + "## Run a custom training job on Vertex AI" + ] + }, + { + "cell_type": "markdown", + "id": "c77ba8b0", + "metadata": {}, + "source": [ + "### 1. Create a Vertex Tensorboard instance for tracking your model experiments" + ] + }, + { + "cell_type": "markdown", + "id": "f82f8bbb", + "metadata": {}, + "source": [ + "[**Vertex TensorBoard**](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview) is Google Cloud's managed version of open-source [**TensorBoard**](https://www.tensorflow.org/tensorboard) for ML experimental visualization. With Vertex TensorBoard you can track, visualize, and compare ML experiments and share them with your team. In addition to the powerful visualizations from open source TensorBoard, Vertex TensorBoard provides:\n", + "\n", + "* A persistent, shareable link to your experiment's dashboard.\n", + "* A searchable list of all experiments in a project.\n", + "* Integrations with Vertex AI services for model training evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec1755a1", + "metadata": {}, + "outputs": [], + "source": [ + "!gcloud beta ai tensorboards create \\\n", + "--display-name=$MODEL_NAME --region=$REGION" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aeac53ba", + "metadata": {}, + "outputs": [], + "source": [ + "TENSORBOARD_RESOURCE_NAME= !(gcloud beta ai tensorboards list --region=$REGION --format=\"value(name)\")\n", + "TENSORBOARD_RESOURCE_NAME= TENSORBOARD_RESOURCE_NAME[1]\n", + "TENSORBOARD_RESOURCE_NAME" + ] + }, + { + "cell_type": "markdown", + "id": "9ad5abad", + "metadata": {}, + "source": [ + "### 2. Run your custom container training job" + ] + }, + { + "cell_type": "markdown", + "id": "a92fe321", + "metadata": {}, + "source": [ + "Use the `CustomTrainingJob` class to define the job, which takes the following parameters specific to custom container training:\n", + "\n", + "* `display_name`: You user-defined name of this training pipeline.\n", + "* `container_uri`: The URI of your custom training container image.\n", + "* `model_serving_container_image_uri`: The URI of a container that can serve predictions for your model. You use a Vertex prebuilt container.\n", + "\n", + "Use the `run()` function to start training, which takes the following parameters:\n", + "\n", + "* `replica_count`: The number of worker replicas.\n", + "* `model_display_name`: The display name of the Model if the script produces a managed Model.\n", + "* `machine_type`: The type of machine to use for training.\n", + "* `bigquery_destination`: The BigQuery URI where your created Tabular dataset gets written to.\n", + "* `predefined_split_column_name`: Since this lab leveraged BigQuery for data processing and splitting, this column is specified to indicate data splits.\n", + "\n", + "The run function creates a training pipeline that trains and creates a Vertex `Model` object. After the training pipeline completes, the `run()` function returns the `Model` object.\n", + "\n", + "Note: This `CustomContainerTrainingJob` will take about 20 minutes to provision resources and train your model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e88b63a8", + "metadata": {}, + "outputs": [], + "source": [ + "# command line args for trainer.task defined above. Review the 'help' argument for a description.\n", + "# You will set the model training args below. Vertex AI will set the environment variables for training URIs.\n", + "CMD_ARGS= [\n", + " \"--learning-rate=\" + str(0.001),\n", + " \"--batch-size=\" + str(16),\n", + " \"--n-train-examples=\" + str(2638),\n", + " \"--stop-point=\" + str(10),\n", + " \"--n-checkpoints=\" + str(10),\n", + " \"--dropout=\" + str(0.2), \n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be63e362", + "metadata": {}, + "outputs": [], + "source": [ + "# By setting BASE_OUTPUT_DIR, Vertex AI will set the environment variables AIP_MODEL_DIR, AIP_CHECKPOINT_DIR, AIP_TENSORBOARD_LOG_DIR\n", + "# during training for your ML training code to write to.\n", + "TIMESTAMP=datetime.datetime.now().strftime('%Y%m%d%H%M%S')\n", + "BASE_OUTPUT_DIR= f\"gs://{GCS_BUCKET}/vertex-custom-training-{MODEL_NAME}-{TIMESTAMP}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0301c683", + "metadata": {}, + "outputs": [], + "source": [ + "job = aiplatform.CustomContainerTrainingJob(\n", + " display_name=\"online-retail-clv-3M-dnn-regressor\",\n", + " container_uri=IMAGE_URI,\n", + " # https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers\n", + " # gcr.io/cloud-aiplatform/prediction/tf2-cpu.2-3:latest\n", + " model_serving_container_image_uri=\"us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-3:latest\",\n", + ")\n", + "\n", + "model = job.run(\n", + " dataset=tabular_dataset,\n", + " model_display_name=MODEL_NAME,\n", + " # GCS custom job output dir.\n", + " base_output_dir=BASE_OUTPUT_DIR,\n", + " # the BQ Tabular dataset splits will be written out to their own BQ dataset for reproducibility.\n", + " bigquery_destination=f\"bq://{PROJECT_ID}\",\n", + " # this corresponds to the BigQuery data split column.\n", + " predefined_split_column_name=\"data_split\",\n", + " # the model training command line arguments defined in trainer.task.\n", + " args=CMD_ARGS,\n", + " # Custom job WorkerPool arguments.\n", + " replica_count=1,\n", + " machine_type=\"e2-standard-4\",\n", + " # Provide your Tensorboard resource name to write Tensorboard logs during training.\n", + " tensorboard=TENSORBOARD_RESOURCE_NAME,\n", + " # Provide your Vertex custom training service account created during lab setup.\n", + " service_account=f\"vertex-custom-training-sa@{PROJECT_ID}.iam.gserviceaccount.com\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "932c4086", + "metadata": {}, + "source": [ + "### 3. Inspect model training performance with Vertex TensorBoard" + ] + }, + { + "cell_type": "markdown", + "id": "daa6b127", + "metadata": {}, + "source": [ + "You can view your model's logs on the Vertex AI [**Experiments tab**](https://console.cloud.google.com/vertex-ai/experiments) in the Cloud Console. Click the **Open Tensorboard** link. You will be asked to authenticate with your Qwiklabs Google account before a Vertex Tensorboard page opens in a browser tab. Once your model begins training, you will see your training evaluation metrics written to this dashboard that you can inspect during the training run as well as after the job completes.\n", + "\n", + "Note: Tensorboard provides a valuable debugging tool for inspecting your model's performance both during and after model training. This lab's model trains in less than a minute and sometimes completes before the logs finish appearing in Tensorboard. If that's the case, refresh the window when the training job completes to see your model's performance evaluation." + ] + }, + { + "cell_type": "markdown", + "id": "28cfdf8e", + "metadata": {}, + "source": [ + "## Serve your model with Vertex AI Prediction: online model predictions and explanations" + ] + }, + { + "cell_type": "markdown", + "id": "0d343de7", + "metadata": {}, + "source": [ + "You have a trained model in GCS now, lets transition to serving your model with Vertex AI Prediction for online model predictions and explanations." + ] + }, + { + "cell_type": "markdown", + "id": "ce14ddf3", + "metadata": {}, + "source": [ + "### 1. Build the Explanation Metadata and Parameters" + ] + }, + { + "cell_type": "markdown", + "id": "02719fa3", + "metadata": {}, + "source": [ + "[**Vertex Explainable AI**](https://cloud.google.com/vertex-ai/docs/explainable-ai) integrates feature attributions into Vertex AI. Vertex Explainable AI helps you understand your model's outputs for classification and regression tasks. Vertex AI tells you how much each feature in the data contributed to the predicted result. You can then use this information to verify that the model is behaving as expected, identify and mitigate biases in your models, and get ideas for ways to improve your model and your training data.\n", + "\n", + "You retrieve these feature attributions to gain insight into your model's CLV predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba8decb7", + "metadata": {}, + "outputs": [], + "source": [ + "DEPLOYED_MODEL_DIR = os.path.join(BASE_OUTPUT_DIR, 'model')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48faadfe", + "metadata": {}, + "outputs": [], + "source": [ + "loaded = tf.keras.models.load_model(DEPLOYED_MODEL_DIR)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f10451af", + "metadata": {}, + "outputs": [], + "source": [ + "serving_input = list(\n", + " loaded.signatures[\"serving_default\"].structured_input_signature[1].keys())[0]\n", + "\n", + "serving_output = list(loaded.signatures[\"serving_default\"].structured_outputs.keys())[0]\n", + "\n", + "feature_names = [\n", + " \"n_purchases\",\n", + " \"avg_purchase_size\",\n", + " \"avg_purchase_revenue\",\n", + " \"customer_age\",\n", + " \"days_since_last_purchase\"\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba63105f", + "metadata": {}, + "outputs": [], + "source": [ + "# Specify sampled Shapley feature attribution method with path_count parameter \n", + "# controlling the number of feature permutations to consider when approximating the Shapley values.\n", + "\n", + "explain_params = aiplatform.explain.ExplanationParameters(\n", + " {\"sampled_shapley_attribution\": {\"path_count\": 10}}\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a1cec81", + "metadata": {}, + "outputs": [], + "source": [ + "# https://cloud.google.com/vertex-ai/docs/reference/rest/v1beta1/ExplanationSpec\n", + "input_metadata = {\n", + " \"input_tensor_name\": serving_input,\n", + " \"encoding\": \"BAG_OF_FEATURES\",\n", + " \"modality\": \"numeric\",\n", + " \"index_feature_mapping\": feature_names,\n", + "}\n", + "\n", + "output_metadata = {\"output_tensor_name\": serving_output}\n", + "\n", + "input_metadata = aiplatform.explain.ExplanationMetadata.InputMetadata(input_metadata)\n", + "output_metadata = aiplatform.explain.ExplanationMetadata.OutputMetadata(output_metadata)\n", + "\n", + "explain_metadata = aiplatform.explain.ExplanationMetadata(\n", + " inputs={\"features\": input_metadata}, outputs={\"medv\": output_metadata}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8692547b", + "metadata": {}, + "source": [ + "## Deploy a Vertex `Endpoint` for online predictions" + ] + }, + { + "cell_type": "markdown", + "id": "2ba9cd05", + "metadata": {}, + "source": [ + "Before you use your model to make predictions, you need to deploy it to an `Endpoint` object. When you deploy a model to an `Endpoint`, you associate physical (machine) resources with that model to enable it to serve online predictions. Online predictions have low latency requirements; providing resources to the model in advance reduces latency. You can do this by calling the deploy function on the `Model` resource. This will do two things:\n", + "\n", + "1. Create an `Endpoint` resource for deploying the `Model` resource to.\n", + "2. Deploy the `Model` resource to the `Endpoint` resource.\n", + "\n", + "The `deploy()` function takes the following parameters:\n", + "\n", + "* `deployed_model_display_name`: A human readable name for the deployed model.\n", + "* `traffic_split`: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs. If only one model, then specify as { \"0\": 100 }, where \"0\" refers to this model being uploaded and 100 means 100% of the traffic.\n", + "* `machine_type`: The type of machine to use for training.\n", + "* `accelerator_type`: The hardware accelerator type.\n", + "* `accelerator_count`: The number of accelerators to attach to a worker replica.\n", + "* `starting_replica_count`: The number of compute instances to initially provision.\n", + "* `max_replica_count`: The maximum number of compute instances to scale to. In this lab, only one instance is provisioned.\n", + "* `explanation_parameters`: Metadata to configure the Explainable AI learning method.\n", + "* `explanation_metadata`: Metadata that describes your TensorFlow model for Explainable AI such as features, input and output tensors.\n", + "\n", + "Note: This can take about 15 minutes to provision prediction resources for your model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "726c0e82", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = model.deploy(\n", + " traffic_split={\"0\": 100},\n", + " machine_type=\"e2-standard-2\",\n", + " explanation_parameters=explain_params,\n", + " explanation_metadata=explain_metadata\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "9bc4f1c7", + "metadata": {}, + "source": [ + "## Get an online prediction and explanation from deployed model" + ] + }, + { + "cell_type": "markdown", + "id": "36aaa774", + "metadata": {}, + "source": [ + "Finally, you use your `Endpoint` to retrieve predictions and feature attributions. This is a customer instance retrieved from the test set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "875bab00", + "metadata": {}, + "outputs": [], + "source": [ + "# actual: 3181.04\n", + "test_instance_dict = {\n", + " \"n_purchases\": 2,\n", + " \"avg_purchase_size\": 536.5,\n", + " \"avg_purchase_revenue\": 1132.7,\n", + " \"customer_age\": 123,\n", + " \"days_since_last_purchase\": 32,\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "d0946246", + "metadata": {}, + "source": [ + "To request predictions, you call the `predict()` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b9f446c", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint.predict([test_instance_dict])" + ] + }, + { + "cell_type": "markdown", + "id": "4ba59e1d", + "metadata": {}, + "source": [ + "To retrieve explanations (predictions + feature attributions), call the `explain()` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c78e91f", + "metadata": {}, + "outputs": [], + "source": [ + "explanations = endpoint.explain([test_instance_dict])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "999cda11", + "metadata": {}, + "outputs": [], + "source": [ + "pd.DataFrame.from_dict(explanations.explanations[0].attributions[0].feature_attributions, orient='index').plot(kind='barh');" + ] + }, + { + "cell_type": "markdown", + "id": "195e9dcc", + "metadata": {}, + "source": [ + "Based on the feature attributions for this prediction, your model has learned that average purchase revenue and customer age had the largest marginal contribution in predicting this customer's monetary value over the 3-month test period. It also identified the relatively lengthy days since last purchase as negatively impacting the prediction. Using these insights, you can plan for an experiment to evaluate targeted marketing interventions for this repeat customer, such as volume discounts, to encourage this customer to purchase more frequently in order to drive additional revenue." + ] + }, + { + "cell_type": "markdown", + "id": "2fc312cf", + "metadata": {}, + "source": [ + "## Next steps" + ] + }, + { + "cell_type": "markdown", + "id": "30ab0ae3", + "metadata": {}, + "source": [ + "Congratulations! In this lab, you walked through a machine learning experimentation workflow using Google Cloud's BigQuery for data storage and analysis and Vertex AI machine learning services to train and deploy a TensorFlow model to predict customer lifetime value. You progressed from training a TensorFlow model locally to training on the cloud with Vertex AI and leveraged several new unified platform capabilities such as Vertex TensorBoard and Explainable AI prediction feature attributions." + ] + }, + { + "cell_type": "markdown", + "id": "0749f152", + "metadata": {}, + "source": [ + "## License" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d2cfd56", + "metadata": {}, + "outputs": [], + "source": [ + "# Copyright 2021 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + } + ], + "metadata": { + "environment": { + "name": "tf2-gpu.2-3.m75", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-3:m75" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/Dockerfile b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/Dockerfile new file mode 100644 index 0000000000..7315c375e4 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/Dockerfile @@ -0,0 +1,21 @@ +# Specifies base image and tag. +# https://cloud.google.com/vertex-ai/docs/training/pre-built-containers +# us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-3:latest +FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3 + +# Sets the container working directory. +WORKDIR /root + +# Copies the requirements.txt into the container to reduce network calls. +COPY requirements.txt . +# Installs additional packages. +RUN pip3 install -U -r requirements.txt + +# Copies the trainer code to the docker image. +COPY . /trainer + +# Sets the container working directory. +WORKDIR /trainer + +# Sets up the entry point to invoke the trainer. +ENTRYPOINT ["python", "-m", "trainer.task"] diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/cloudbuild.yaml b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/cloudbuild.yaml new file mode 100644 index 0000000000..ab3fbcc27f --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/cloudbuild.yaml @@ -0,0 +1,5 @@ +steps: +- name: 'gcr.io/cloud-builders/docker' + args: [ 'build', '-t', 'us-central1-docker.pkg.dev/dougkelly-vertex-demos/online-retail-clv/dnn-regressor:latest', '.' ] +images: +- 'us-central1-docker.pkg.dev/dougkelly-vertex-demos/online-retail-clv/dnn-regressor:latest' \ No newline at end of file diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/requirements.txt b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/requirements.txt new file mode 100644 index 0000000000..af3b7f30a2 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/requirements.txt @@ -0,0 +1,3 @@ +explainable-ai-sdk==1.3.0 +tensorflow-io==0.16.0 +pyarrow diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/__init__.py b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/model.py b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/model.py new file mode 100644 index 0000000000..f2fda59021 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/model.py @@ -0,0 +1,149 @@ +import os +import logging +import tempfile +import tensorflow as tf +from explainable_ai_sdk.metadata.tf.v2 import SavedModelMetadataBuilder +from tensorflow.python.framework import dtypes +from tensorflow_io.bigquery import BigQueryClient +from tensorflow_io.bigquery import BigQueryReadSession + + +# Model feature constants. +NUMERIC_FEATURES = [ + "n_purchases", + "avg_purchase_size", + "avg_purchase_revenue", + "customer_age", + "days_since_last_purchase", +] + +CATEGORICAL_FEATURES = [ + "customer_country" +] + +LABEL = "target_monetary_value_3M" + + +def caip_uri_to_fields(uri): + """Helper function to parse BQ URI.""" + # Remove bq:// prefix. + uri = uri[5:] + project, dataset, table = uri.split('.') + return project, dataset, table + + +def features_and_labels(row_data): + """Helper feature and label mapping function for tf.data.""" + label = row_data.pop(LABEL) + features = row_data + return features, label + + +def read_bigquery(project, dataset, table): + """TensorFlow IO BigQuery Reader.""" + tensorflow_io_bigquery_client = BigQueryClient() + read_session = tensorflow_io_bigquery_client.read_session( + parent="projects/" + project, + project_id=project, + dataset_id=dataset, + table_id=table, + # Pass list of features and label to be selected from BQ. + selected_fields=NUMERIC_FEATURES + [LABEL], + # Provide output TensorFlow data types for features and label. + output_types=[dtypes.int64, dtypes.float64, dtypes.float64, dtypes.int64, dtypes.int64] + [dtypes.float64], + requested_streams=2) + dataset = read_session.parallel_read_rows() + transformed_ds = dataset.map(features_and_labels) + return transformed_ds + + +def rmse(y_true, y_pred): + """Custom RMSE regression metric.""" + return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true))) + + +def build_model(hparams): + """Build and compile a TensorFlow Keras DNN Regressor.""" + + feature_columns = [ + tf.feature_column.numeric_column(key=feature) + for feature in NUMERIC_FEATURES + ] + + input_layers = { + feature.key: tf.keras.layers.Input(name=feature.key, shape=(), dtype=tf.float32) + for feature in feature_columns + } + # Keras Functional API: https://keras.io/guides/functional_api + inputs = tf.keras.layers.DenseFeatures(feature_columns, name='inputs')(input_layers) + d1 = tf.keras.layers.Dense(256, activation=tf.nn.relu, name='d1')(inputs) + d2 = tf.keras.layers.Dropout(hparams['dropout'], name='d2')(d1) + # Note: a single neuron scalar output for regression. + output = tf.keras.layers.Dense(1, name='output')(d2) + + model = tf.keras.Model(input_layers, output, name='online-retail-clv') + + optimizer = tf.keras.optimizers.Adam(hparams['learning-rate']) + + # Note: MAE loss is more resistant to outliers than MSE. + model.compile(loss=tf.keras.losses.MAE, + optimizer=optimizer, + metrics=[['mae', 'mse', rmse]]) + + return model + + +def train_evaluate_explain_model(hparams): + """Train, evaluate, explain TensorFlow Keras DNN Regressor. + Args: + hparams(dict): A dictionary containing model training arguments. + Returns: + history(tf.keras.callbacks.History): Keras callback that records training event history. + """ + training_ds = read_bigquery(*caip_uri_to_fields(hparams['training-data-uri'])).prefetch(1).shuffle(hparams['batch-size']*10).batch(hparams['batch-size']).repeat() + eval_ds = read_bigquery(*caip_uri_to_fields(hparams['validation-data-uri'])).prefetch(1).shuffle(hparams['batch-size']*10).batch(hparams['batch-size']) + test_ds = read_bigquery(*caip_uri_to_fields(hparams['test-data-uri'])).prefetch(1).shuffle(hparams['batch-size']*10).batch(hparams['batch-size']) + + model = build_model(hparams) + logging.info(model.summary()) + + tensorboard_callback = tf.keras.callbacks.TensorBoard( + log_dir=hparams['tensorboard-dir'], + histogram_freq=1) + + # Reduce overfitting and shorten training times. + earlystopping_callback = tf.keras.callbacks.EarlyStopping(patience=2) + + # Ensure your training job's resilience to VM restarts. + checkpoint_callback = tf.keras.callbacks.ModelCheckpoint( + filepath= hparams['checkpoint-dir'], + save_weights_only=True, + monitor='val_loss', + mode='min') + + # Virtual epochs design pattern: + # https://medium.com/google-cloud/ml-design-pattern-3-virtual-epochs-f842296de730 + TOTAL_TRAIN_EXAMPLES = int(hparams['stop-point'] * hparams['n-train-examples']) + STEPS_PER_EPOCH = (TOTAL_TRAIN_EXAMPLES // (hparams['batch-size']*hparams['n-checkpoints'])) + + history = model.fit(training_ds, + validation_data=eval_ds, + steps_per_epoch=STEPS_PER_EPOCH, + epochs=hparams['n-checkpoints'], + callbacks=[[tensorboard_callback, + earlystopping_callback, + checkpoint_callback]]) + + logging.info(model.evaluate(test_ds)) + + # Create a temp directory to save intermediate TF SavedModel prior to Explainable metadata creation. + tmpdir = tempfile.mkdtemp() + + # Export Keras model in TensorFlow SavedModel format. + model.save(tmpdir) + + # Annotate and save TensorFlow SavedModel with Explainable metadata to GCS. + builder = SavedModelMetadataBuilder(tmpdir) + builder.save_model_with_metadata(hparams['model-dir']) + + return history diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/task.py b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/task.py new file mode 100644 index 0000000000..0c03902fa6 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/online-retail-clv-3M/trainer/task.py @@ -0,0 +1,34 @@ +import os +import argparse + +from trainer import model + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + # Vertex custom container training args. These are set by Vertex AI during training can be overwritten. + parser.add_argument('--model-dir', dest='model-dir', + default=os.environ['AIP_MODEL_DIR'], type=str, help='Model dir.') + parser.add_argument('--checkpoint-dir', dest='checkpoint-dir', + default=os.environ['AIP_CHECKPOINT_DIR'], type=str, help='Checkpoint dir set during Vertex AI training.') + parser.add_argument('--tensorboard-dir', dest='tensorboard-dir', + default=os.environ['AIP_TENSORBOARD_LOG_DIR'], type=str, help='Tensorboard dir set during Vertex AI training.') + parser.add_argument('--data-format', dest='data-format', + default=os.environ['AIP_DATA_FORMAT'], type=str, help="Tabular data format set during Vertex AI training. E.g.'csv', 'bigquery'") + parser.add_argument('--training-data-uri', dest='training-data-uri', + default=os.environ['AIP_TRAINING_DATA_URI'], type=str, help='Training data GCS or BQ URI set during Vertex AI training.') + parser.add_argument('--validation-data-uri', dest='validation-data-uri', + default=os.environ['AIP_VALIDATION_DATA_URI'], type=str, help='Validation data GCS or BQ URI set during Vertex AI training.') + parser.add_argument('--test-data-uri', dest='test-data-uri', + default=os.environ['AIP_TEST_DATA_URI'], type=str, help='Test data GCS or BQ URI set during Vertex AI training.') + # Model training args. + parser.add_argument('--learning-rate', dest='learning-rate', default=0.001, type=float, help='Learning rate for optimizer.') + parser.add_argument('--dropout', dest='dropout', default=0.2, type=float, help='Float percentage of DNN nodes [0,1] to drop for regularization.') + parser.add_argument('--batch-size', dest='batch-size', default=16, type=int, help='Number of examples during each training iteration.') + parser.add_argument('--n-train-examples', dest='n-train-examples', default=2638, type=int, help='Number of examples to train on.') + parser.add_argument('--stop-point', dest='stop-point', default=10, type=int, help='Number of passes through the dataset during training to achieve convergence.') + parser.add_argument('--n-checkpoints', dest='n-checkpoints', default=10, type=int, help='Number of model checkpoints to save during training.') + + args = parser.parse_args() + hparams = args.__dict__ + + model.train_evaluate_explain_model(hparams) diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/requirements.txt b/self-paced-labs/vertex-ai/train-deploy-tf-model/requirements.txt new file mode 100644 index 0000000000..cfb9a83554 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/requirements.txt @@ -0,0 +1,12 @@ +tensorflow==2.15.0 +pyarrow==10.0.1 +httplib2>=0.20.4 +grpcio-status>=1.38.1 +google-api-python-client>=1.8.0 +apache-beam>=2.28.0 +google-cloud-aiplatform[tensorboard]>=1.8.0 +six==1.16.0 +wget==3.2 +xlrd==2.0.1 +openpyxl==3.0.10 +pandas >= 1.5 \ No newline at end of file diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/data_download.py b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/data_download.py new file mode 100644 index 0000000000..1525169733 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/data_download.py @@ -0,0 +1,188 @@ +# Copyright 2021 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import logging +import shutil +import wget +import argparse +import pandas as pd +from google.cloud import storage +from google.cloud import bigquery +from google.cloud.exceptions import NotFound, Conflict + +from dataset_schema import table_schema +from dataset_clean import dataset_clean_query +from dataset_ml import dataset_ml_query + +LOCAL_PATH ="./data" +FILENAME = "online_retail" + + +def download_url2gcs(args): + """ + args: + """ + + #set GCS client. + client = storage.Client() + + # Retrieve GCS bucket. + bucket = client.get_bucket(args.GCS_BUCKET) + blob = bucket.blob("data/online_retail.csv") + + #See if file already exists. + if blob.exists() == False: + try: + os.mkdir(LOCAL_PATH) + logging.info('Downloading xlsx file...') + local_xlsx = wget.download(args.URL, out=f"{LOCAL_PATH}/{FILENAME}.xlsx") + logging.info('Converting xlsx -> csv...') + df = pd.read_excel(local_xlsx) + df.to_csv(f"{LOCAL_PATH}/{FILENAME}.csv", index=False) + logging.info('Uploading local csv file to GCS...') + blob.upload_from_filename(f"{LOCAL_PATH}/{FILENAME}.csv") + logging.info('Copied local csv file to GCS.') + # Delete all contents of a directory using shutil.rmtree() and handle exceptions. + try: + shutil.rmtree(LOCAL_PATH) + logging.info('Cleaning up local tmp data directory...') + except: + logging.error('Error while deleting local tmp data directory.') + + #print error if file doesn't exist. + except BaseException as error: + logging.error('An exception occurred: {}'.format(error)) + + #print error if file already exists in GCS. + else: + logging.warning('File already exists in GCS.') + + +def upload_gcs2bq(args, schema): + """ + args: + schema: + """ + # Construct a BigQuery client object. + client = bigquery.Client() + + # Construct a full Dataset object to send to the API. + logging.info('Initializing BigQuery dataset.') + dataset = bigquery.Dataset(f"{args.PROJECT_ID}.{args.BQ_DATASET_NAME}") + + try: + # Send the dataset to the API for creation, with an explicit timeout. + # Raises google.api_core.exceptions.Conflict if the Dataset already + # exists within the project. + dataset = client.create_dataset(dataset, timeout=30) # Make an API request. + # Specify the geographic location where the dataset should reside. + dataset.location = args.BQ_LOCATION + except Conflict: + logging.warning('Dataset %s already exists, not creating.', dataset.dataset_id) + else: + logging.info("Created dataset %s.%s", client.project, dataset.dataset_id) + + try: + URI = f"gs://{args.GCS_BUCKET}/data/{FILENAME}.csv" + RAW_TABLE_ID = f"{args.PROJECT_ID}.{args.BQ_DATASET_NAME}.{args.BQ_RAW_TABLE_NAME}" + + # Load job. + job_config = bigquery.LoadJobConfig( + schema=schema, + skip_leading_rows=1, + allow_jagged_rows=True, + write_disposition="WRITE_TRUNCATE", + source_format=bigquery.SourceFormat.CSV) + load_job = client.load_table_from_uri(source_uris=URI, destination=RAW_TABLE_ID, job_config=job_config) + logging.info('BQ raw dataset load job starting...') + load_job.result() # Waits for the job to complete. + logging.info('BQ raw dataset load job complete.') + except BaseException as error: + logging.error('An exception occurred: {}'.format(error)) + + destination_table = client.get_table(RAW_TABLE_ID) # Make an API request. + logging.info("Loaded %s rows into %s.",destination_table.num_rows, RAW_TABLE_ID) + + +def make_dataset_clean_bq(args, query: str): + """ + args: + query: + """ + client = bigquery.Client() + CLEAN_TABLE_ID = f"{args.PROJECT_ID}.{args.BQ_DATASET_NAME}.{args.BQ_CLEAN_TABLE_NAME}" + RAW_TABLE_ID = f"{args.PROJECT_ID}.{args.BQ_DATASET_NAME}.{args.BQ_RAW_TABLE_NAME}" + + clean_query = query.replace("@CLEAN_TABLE_ID", CLEAN_TABLE_ID).replace("@RAW_TABLE_ID", RAW_TABLE_ID) + + logging.info('BQ make clean dataset starting...') + try: + job = client.query(clean_query) + _ = job.result() + logging.info('BQ make clean dataset complete') + except BaseException as error: + logging.error('An exception occurred: {}'.format(error)) + + destination_table = client.get_table(CLEAN_TABLE_ID) # Make an API request. + logging.info("Loaded %s rows into %s.",destination_table.num_rows, CLEAN_TABLE_ID) + + +def make_dataset_ml_bq(args, query: str): + """ + args: + query: + """ + client = bigquery.Client() + ML_TABLE_ID = f"{args.PROJECT_ID}.{args.BQ_DATASET_NAME}.{args.BQ_ML_TABLE_NAME}" + CLEAN_TABLE_ID = f"{args.PROJECT_ID}.{args.BQ_DATASET_NAME}.{args.BQ_CLEAN_TABLE_NAME}" + + ml_query = query.replace("@ML_TABLE_ID", ML_TABLE_ID).replace("@CLEAN_TABLE_ID", CLEAN_TABLE_ID) + + logging.info('BQ make ML dataset starting...') + try: + job = client.query(ml_query) + _ = job.result() + logging.info('BQ make ML dataset complete') + except BaseException as error: + logging.error('An exception occurred: {}'.format(error)) + + destination_table = client.get_table(ML_TABLE_ID) # Make an API request. + logging.info("Loaded %s rows into %s.",destination_table.num_rows, ML_TABLE_ID) + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument("--PROJECT_ID", dest="PROJECT_ID", type=str, required=True) + parser.add_argument("--GCS_BUCKET", dest="GCS_BUCKET", type=str, required=True) + parser.add_argument("--URL", dest="URL", type=str, required=True) + parser.add_argument("--BQ_DATASET_NAME", dest="BQ_DATASET_NAME", type=str, default="online_retail") + parser.add_argument("--BQ_LOCATION", dest="BQ_LOCATION", type=str, default="US") + parser.add_argument("--BQ_RAW_TABLE_NAME", dest="BQ_RAW_TABLE_NAME", type=str, default="online_retail_clv_raw") + parser.add_argument("--BQ_CLEAN_TABLE_NAME", dest="BQ_CLEAN_TABLE_NAME", type=str, default="online_retail_clv_clean") + parser.add_argument("--BQ_ML_TABLE_NAME", dest="BQ_ML_TABLE_NAME", type=str, default="online_retail_clv_ml") + + args = parser.parse_args() + + logging.basicConfig( + level=logging.INFO, + format="\n %(asctime)s [%(levelname)s] %(message)s", + handlers=[logging.StreamHandler()] + ) + + download_url2gcs(args) + upload_gcs2bq(args, table_schema) + make_dataset_clean_bq(args, dataset_clean_query) + make_dataset_ml_bq(args, dataset_ml_query) \ No newline at end of file diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_clean.py b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_clean.py new file mode 100644 index 0000000000..b99c6acf9f --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_clean.py @@ -0,0 +1,49 @@ +# Copyright 2021 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""TODO.""" + +dataset_clean_query = """ +CREATE OR REPLACE TABLE `@CLEAN_TABLE_ID` +AS ( +WITH + customer_daily_sales AS ( + SELECT + CustomerID AS customer_id, + Country AS customer_country, + EXTRACT(DATE FROM InvoiceDate) AS order_date, + COUNT(DISTINCT InvoiceNo) AS n_purchases, + SUM(Quantity) AS order_qty, + ROUND(SUM(UnitPrice * Quantity), 2) AS revenue + FROM + `@RAW_TABLE_ID` + WHERE + CustomerID IS NOT NULL + AND Quantity > 0 + GROUP BY + customer_id, + customer_country, + order_date) + +SELECT + customer_id, + customer_country, + order_date, + n_purchases, + order_qty, + revenue +FROM + customer_daily_sales +) + +""" \ No newline at end of file diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_ml.py b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_ml.py new file mode 100644 index 0000000000..84d4e3e30d --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_ml.py @@ -0,0 +1,72 @@ +# Copyright 2021 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +dataset_ml_query = """ +CREATE OR REPLACE TABLE `@ML_TABLE_ID` +AS ( +WITH +-- Calculate features before CUTOFF_DATE date. + features AS ( + SELECT + customer_id, + customer_country, + COUNT(n_purchases) AS n_purchases, + AVG(order_qty) AS avg_purchase_size, + AVG(revenue) AS avg_purchase_revenue, + DATE_DIFF(MAX(order_date), MIN(order_date), DAY) AS customer_age, + DATE_DIFF(DATE('2011-09-01'), MAX(order_date), DAY) AS days_since_last_purchase + FROM + `@CLEAN_TABLE_ID` + WHERE + order_date <= DATE('2011-09-01') + GROUP BY + customer_id, + customer_country), + + -- Calculate customer target monetary value over historical period + 3M future period. + label AS ( + SELECT + customer_id, + SUM(revenue) AS target_monetary_value_3M + FROM + `@CLEAN_TABLE_ID` + WHERE + order_date < DATE('2011-12-01') + GROUP BY + customer_id + ) + +SELECT + features.customer_id, + features.customer_country, + features.n_purchases, -- frequency + features.avg_purchase_size, --monetary + features.avg_purchase_revenue, --monetary + features.customer_age, + features.days_since_last_purchase, --recency + label.target_monetary_value_3M, --target + CASE + WHEN MOD(ABS(FARM_FINGERPRINT(CAST(features.customer_id AS STRING))), 10) < 8 + THEN 'TRAIN' + WHEN MOD(ABS(FARM_FINGERPRINT(CAST(features.customer_id AS STRING))), 10) = 9 + THEN 'VALIDATE' + ELSE + 'TEST' END AS data_split +FROM + features +INNER JOIN label + ON features.customer_id = label.customer_id +); +""" \ No newline at end of file diff --git a/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_schema.py b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_schema.py new file mode 100644 index 0000000000..f12cccb9f2 --- /dev/null +++ b/self-paced-labs/vertex-ai/train-deploy-tf-model/utils/dataset_schema.py @@ -0,0 +1,27 @@ +# Copyright 2021 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from google.cloud import bigquery + + +table_schema = [ + bigquery.SchemaField("InvoiceNo", "STRING"), + bigquery.SchemaField("StockCode", "STRING"), + bigquery.SchemaField("Description", "STRING", mode="NULLABLE"), + bigquery.SchemaField("Quantity", "INTEGER"), + bigquery.SchemaField("InvoiceDate", "TIMESTAMP"), + bigquery.SchemaField("UnitPrice", "FLOAT"), + bigquery.SchemaField("CustomerID", "STRING", mode="NULLABLE"), + bigquery.SchemaField("Country", "STRING"), +] \ No newline at end of file