The watercooler components are:
- the AppService code (mainly scala and javascript code)
- built into a docker image
- the ADF (Azure DataFactory) components (pipelines, datasets, linked services, triggers etc) which orchestrate the offline processing of data
- ARM template in a json file named pipelines.json
- the ADB (Azure Databricks) Spark jobs (scala code) processing source data
- built into jars
- common scala code used by both app service code and ADB jobs
- maven module which gets shaded into the jars and into the docker image
- the ADB pyspark jobs processing source data
- python scripts delivered as they are
- the docker image used for the Watercooler AppService
- needs to be uploaded to a docker repository (Azure Container Registry is recommended) from where it can be accessed by the deployment script, and the App Service
- the project artifacts zip: containing the necessary jars,scripts and ARM templates
-
Install JDK 1.8
- for example, to install the OpenJDK, follow the appropriate installation steps
from https://openjdk.java.net/install/
or https://github.com/AdoptOpenJDK/openjdk8-upstream-binaries/releases/tag/jdk8u292-b10 (take the latest zip build), or search online for a more detailed installation guide, matching your
OS
- for Oracle jdk more help can be found here: https://java.com/en/download/help/download_options.html
- for Open jdk more help can be found here: https://openjdk.java.net/install/
- required to build, develop and run the project's JVM based components (the AppService code and ADB scala spark jobs)
- set $JAVA_HOME variable with the location of the JDK installation path (Observation: for windows set JAVA_HOME in the System Environment Variables; also add %JAVA_HOME%\bin to PATH)
- for example, to install the OpenJDK, follow the appropriate installation steps
from https://openjdk.java.net/install/
or https://github.com/AdoptOpenJDK/openjdk8-upstream-binaries/releases/tag/jdk8u292-b10 (take the latest zip build), or search online for a more detailed installation guide, matching your
OS
-
Install latest Maven 3 version
- download the latest maven 3 binary distribution archive from https://maven.apache.org/download.cgi
- installation steps: https://maven.apache.org/install.html
- required to build, develop and run the project's JVM based components (the AppService code and ADB scala spark jobs)
NOTE: For developing the application, only macOS/Linux based systems were used.
Building the application is possible from Windows systems (using Powershell, not the command-line interpreter), however solid knowledge of your development environment and of the differences between Windows/Unix systems is required. Some adaptations of the provided instructions might be required as well, as well as some additional steps during deployment.
Observation: if you are on windows and you encounter issues running
./prepare_artifacts.sh
than it is mandatory to do one of the following:
- either download git-scm from http://git-scm.com/download/win and use git-bash to run the script that will prepare the artifacts
- either run
sed -i 's/\r$//' install.sh
as specified in the ./azure/README.md; (howeversed
as a command, cannot be executed only from git-bash or cygwin)
- Build the necessary jars using maven run in the
jwc
folder:
For faster builds do (-DskipTests will skip running tests):
mvn clean install -DskipTests
This approach also will skip the unit tests.
To build the jars without skipping the tests, navigate to jwc
folder and
run mvn clean install
.
However, in order to have your tests pass, you'll have to provide connectivity to a sql database.
To provide access to the database follow the steps described here
The resulting jars can be found in the target
folder of the corresponding module.
-
Building the docker image
Observation: this step is optional; if you don't have possibility to push to a given registry, please use the default provided version: 0.2.1
docker build -t contosohub.azurecr.io/microsoft-gdc/watercooler:x.y.z docker push contosohub.azurecr.io/microsoft-gdc/watercooler:x.y.z
-
build via bin/prepare_artifacts.sh
-
needs to be uploaded to a "project deliverables AZBS storage account", to make it easily accessible to admins performing deployments
-
contains:
- the jars, python scripts and python wheel files, defining Spark jobs that run on ADB via ADF pipelines
- the pipelines.json file, defining the ADF entities
- the schema sql file, containing the database schema creation statements
- all the scripts and ARM templates used to deploy the application on a new environment
-
navigate to
./bin
folder and execute:./prepare_artifacts.sh
-
the result of running the above command will be a tar.gz file that we recommend renaming it to
wc-x.y.z.tar.gz
:mv build.tar.gz wc-0.2.1.tar.gz
-
upload the
wc-x.y.z.tar.gz
build file to a specific storage account from where you can download it via wget:wget https://testdeployment.blob.core.windows.net/watercooler-artifact/wc-x.y.z.tar.gz
In order to proceed with the installation in the Azure Cloud, please follow the instructions from here ./azure/README.MD
Please see the build pipeline documentation for details
Depending on what changed form one release to another, some or all of the steps below need to be performed:
- update ADB libraries (jars, python scripts and python utils wheel)
- detailed deployment steps provided in the Deploying individual components section
- the jars current are: jwc-events-creator.jar, jwc-profiles-extractor.jar
- the python scripts are: 000_cleanup.py,01_calendar_spark_processor,01_1_calendar_events_attendance.py,1_2_update_group_members_invitation_status.py,02_profiles_spark_processor.py, 03_persons_to_events_dill_assembler.py,04_generate_timetable_kmeans.py,05_export_to_csv.py, 06_spark_export_to_sql.py
- changes to DB schema
- update ADF entities (linked services, datasets, global parameters, pipelines, triggers etc)
- detailed deployment steps provided in the Deploying individual components section
- changes to DB schema
- update the App Service
- detailed deployment steps provided in the Deploying individual components section
Perform the next steps if the existing data is to be overwritten (e.g. for deployments using older versions of simulated data)
- Stop ADF triggers
- Delete ADF triggers (mainly delete triggers that are window based and which would not run again automatically after deployment)
- Stop ongoing trigger runs and running pipelines (use "Cancel Recursive" where appropriate)
- Run cleanup pipeline End2EndCleanup
Updating Azure DataFactory entities:
- update ADF linked services, if required
- update ADF global parameters, if required
- update ADF datasets, if required
- if there are changes made to the waterooler table schema
- the update has to be included in the pipelines ARM template update file described in the next step
- if there are changes made to the waterooler table schema
- update ADF pipelines (two possible approaches: delete&recreate vs manual update)
- delete&recreate (recommended approach):
- this approach relies on a creating a json ARM template file containing all the ADF entities which need to be recreated
- this can be obtained by copying pipelines.json and deleting all irrelevant entities from it (e.g. linked services), as well as all dependencies
- it should be named "ADF_update_<old_version>to<new_version>.json"
- on environments using production data, preserving the existing data from Watercooler sql database
- in this case the deployment process might need to be customized so that as little as possible of the ADF entities are recreated and rerun
- if non-backward-compatible changes are done to the pipelines and sql schema, then deleting the existing data is inevitable
- this approach relies on a creating a json ARM template file containing all the ADF entities which need to be recreated
- manual update: requires the admin to fully understand the changes done between different versions and take the necessary steps to perform the migration directly from the ADF UI
- delete&recreate (recommended approach):
- recreate ADF triggers based on current time
- start triggers that are relevant for the watercooler event creation pipeline
- wait for triggers to start pipelines and process new data
NOTE: Updating ADF on environments where git configuration is activated:
In order to be able to update ADF usingARM Template -> Import ARM Template
, you have to first disconnect ADF from git. Execute the update and then reconnect ADF to git.
Sometimes, importing the template from the ADF UI can fail, so, alternatively, the import can be done from the Azure Bash Cloud Shell using the command
az deployment group create --resource-group <adf_resource_group> --template-file <arm_template_file_containing_relevant_changes>.json
The template file must first be uploaded to the Cloud Shell using the "Upload/Download files" button.
NOTE: When updating pipelines/datasets/triggers in ADF using
ARM Template -> Import ARM Template
, remove all linked services dependencies ("dependsOn": [...]) from pipelines/datasets/triggers in order to avoid defining them in the update json file.
The dependencies are only needed when deploying on a new ADF where the order in which linked service and pipelines/datasets are created counts.
- stop Watercooler application from the App Service Overview page
- apply required DB migrations (DB schema or stored procedure changes made in the code, but not yet present on the target environment)
- this only needs to be explicitly done manually if the target env does not have flyway enabled
- deploy latest watercooler docker image into AppServices
- this can be done in several ways:
- from the App Service UI -> Deployment Center, using Azure Container Registry
- for this to work, the App Service and the Container Registry must be in the same subscription
- from the Azure CLI using the following command
az webapp config container set --name <app-service-name> --resource-group <existing_gdc_resource_group_name> --docker-custom-image-name contosohub.azurecr.io/microsoft-gdc/watercooler:<docker_image_tag> --docker-registry-server-url https://contosohub.azurecr.io -u jwc-readonly-token -p <password>
- from the App Service UI -> Deployment Center, using Azure Container Registry
- this can be done in several ways:
- update required AppService env variables (Application Settings, Connection Settings) from the Configuration page
- start the application
NOTE: The database update scripts don't have to be run manually on environments where flyway is enabled.
Flyway does this automatically when the application starts.
If the update scripts are run manually, then the application will fail at start-up because the flyway checksums won't match.
These libraries need to be deployed on the ADB cluster so that they can be run as spark jobs or used as utility code by such jobs.
In production, these spark jobs in turn, are only run by ADF pipelines.
Any such library can be deployed separately (e.g. on a development environment) to check that the latest changes work as expected.
The libraries can be deployed either from a project artifacts zip or from the local development environment.
Since deploying individual components (as opposed to deploying whole new project build) is done on development environments
to quickly ensure a fix or feature works as expected before creating a new official build of the entire project, most often
individual components will be deployed from the local environment. However, there might be cases when deploying from the
project artifacts zip (downloaded from the project deliverables AZBS storage account) would make sense, therefore we are
also going to briefly describe this scenario.
Deploying from the local environment
- open a terminal
- change the current directory to the location of the library to upload
- make sure you have the Databricks CLI installed and
configured to point to the desired ADB cluster
- connecting to several ADB clusters can be achieved by defining Connection Profiles
- to check the currently defined profiles, as well as the default one, run
cat ~/.databrickscfg
- upload the python scripts using the following command
dbfs cp --overwrite --profile <profile> ./<artifact_to_upload> dbfs:/mnt/watercooler/scripts/
- restart the ADB cluster
- install the new libraries (python utils wheel or jars) on the ADB cluster
- this step is explicitly required only if you are going to check that the new library works by directly running the ADB spark which it impacts. If you are going to simply run the ADF pipeline which depends on the library (more specifically, the pipeline which runs the ADB job that makes use of the library), then this step is not required
- now you can run the Spark job or ADF pipeline which is meant to make use of the new python scripts
Deploying from the project artifacts zip
This approach might be well suited when having a slow internet connection locally or when you are in a different geographical
region than the target environment, and wanting to deploy larger artifacts (e.g. large shaded jars).
If the resource you want to deploy was already built by the CI pipeline and uploaded to AZBS to the same region as the
target environment, then it might be faster than deploying it from local env.
- open Azure Bash Cloud Shell from Azure portal in your target environment
- download the project artifacts zip from the AZBS location where the CI pipeline uploaded it, using
wget <wc-x.y.z.tar.gz url>
- unzip it
cd wc
- continue with the instructions from ./azure/README.MD
The prepare_artifacts.sh
build script was tested on the following configurations:
- Windows 10 version 20H2 build: 19042.1052
- MacOS Catalina version 10.15.5 (19F101)
- MacOS BigSur version 11.4