deploy: c5a5abb

lifemapper · Oct 28, 2024 · e7264d4 · e7264d4
commit e7264d4
Show file tree

Hide file tree

Showing 112 changed files with 12,626 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: ee2edea7cd0e405e075f4650e6ba2801
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/pages/about.doctree b/.doctrees/pages/about.doctree
diff --git a/.doctrees/pages/aws/automation.doctree b/.doctrees/pages/aws/automation.doctree
diff --git a/.doctrees/pages/aws/aws_setup.doctree b/.doctrees/pages/aws/aws_setup.doctree
diff --git a/.doctrees/pages/aws/ec2_setup.doctree b/.doctrees/pages/aws/ec2_setup.doctree
diff --git a/.doctrees/pages/aws/roles.doctree b/.doctrees/pages/aws/roles.doctree
diff --git a/.doctrees/pages/history/aws_experiments.doctree b/.doctrees/pages/history/aws_experiments.doctree
diff --git a/.doctrees/pages/history/year3.doctree b/.doctrees/pages/history/year3.doctree
diff --git a/.doctrees/pages/history/year4_planA.doctree b/.doctrees/pages/history/year4_planA.doctree
diff --git a/.doctrees/pages/history/year4_planB.doctree b/.doctrees/pages/history/year4_planB.doctree
diff --git a/.doctrees/pages/history/year5.doctree b/.doctrees/pages/history/year5.doctree
diff --git a/.doctrees/pages/interaction/aws_prep.doctree b/.doctrees/pages/interaction/aws_prep.doctree
diff --git a/.doctrees/pages/interaction/debug.doctree b/.doctrees/pages/interaction/debug.doctree
diff --git a/.doctrees/pages/interaction/deploy.doctree b/.doctrees/pages/interaction/deploy.doctree
diff --git a/.doctrees/pages/workflow.doctree b/.doctrees/pages/workflow.doctree
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/lm_logo.png b/_images/lm_logo.png
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,45 @@
+Welcome to LmBISON - RIIS Analysis
+======================================
+
+The BISON repository contains data and scripts to annotate GBIF occurrence records
+with information regarding geographic location and USGS RIIS status of the record.
+
+
+Current
+------------
+
+.. toctree::
+    :maxdepth: 2
+
+    pages/about
+    pages/workflow
+
+Setup AWS
+------------
+
+.. toctree::
+    :maxdepth: 2
+
+    pages/aws/aws_setup
+
+Using BISON
+------------
+
+.. toctree::
+    :maxdepth: 2
+
+    pages/interaction/about
+
+History
+------------
+
+.. toctree::
+    :maxdepth: 2
+
+    pages/history/year4_planB
+    pages/history/year4_planA
+    pages/history/year3
+    pages/history/year5
+    pages/history/aws_experiments
+
+* :ref:`genindex`
diff --git a/_sources/pages/about.rst.txt b/_sources/pages/about.rst.txt
@@ -0,0 +1,12 @@
+About
+========
+
+The `Lifemapper BISON repository <https://github.com/lifemapper/bison>`_ is an open
+source project supported by USGS award G19AC00211.
+
+The aim of this repository is to provide a workflow for annotating and analyzing a
+large set of United States specimen occurrence records for the USGS BISON project.
+
+.. image:: ../.static/lm_logo.png
+  :width: 150
+  :alt: Lifemapper
diff --git a/_sources/pages/aws/automation.rst.txt b/_sources/pages/aws/automation.rst.txt
@@ -0,0 +1,140 @@
+Create lambda function to initiate processing
+------------------------------------------------
+* Create a lambda function for execution when the trigger condition is activated,
+  aws/events/bison_find_current_gbif_lambda.py
+
+  * This trigger condition is a file deposited in the BISON bucket
+
+    * TODO: change to the first of the month
+
+  * The lambda function will delete the new file, and test the existence of
+    GBIF data for the current month
+
+    * TODO: change to mount GBIF data in Redshift, subset, unmount
+
+Edit the execution role for lambda function
+--------------------------------------------
+* Under Configuration/Permissions see the Execution role Role name
+  (bison_find_current_gbif_lambda-role-fb05ks88) automatically created for this function
+* Open in a new window and under Permissions policies, Add permissions
+
+  * bison_s3_policy
+  * redshift_glue_policy
+
+Create trigger to initiate lambda function
+------------------------------------------------
+
+* Check for existence of new GBIF data
+* Use a blueprint, python, "Get S3 Object"
+* Function name: bison_find_current_gbif_lambda
+* S3 trigger:
+
+    * Bucket: arn:aws:s3:::gbif-open-data-us-east-1
+
+* Create a rule in EventBridge to use as the trigger
+
+  * Event source : AWS events or EventBridge partner events
+  * Sample event, "S3 Object Created", aws/events/test_trigger_event.json
+  * Creation method: Use pattern form
+  * Event pattern
+
+    * Event Source: AWS services
+    * AWS service: S3
+    * Event type: Object-Level API Call via CloudTrail
+    * Event Type Specifications
+
+      * Specific operation(s): GetObject
+      * Specific bucket(s) by name: arn:aws:s3:::bison-321942852011-us-east-1
+
+  * Select target(s)
+
+    * AWS service
+
+
+Lambda to query Redshift
+--------------------------------------------
+
+https://repost.aws/knowledge-center/redshift-lambda-function-queries
+
+https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift-data/client/execute_statement.html
+
+* Connect to a serverless workgroup (bison), namespace (bison), database name (dev)
+
+* When connecting to a serverless workgroup, specify the workgroup name and database
+  name. The database user name is derived from the IAM identity. For example,
+  arn:iam::123456789012:user:foo has the database user name IAM:foo. Also, permission
+  to call the redshift-serverless:GetCredentials operation is required.
+* need redshift:GetClusterCredentialsWithIAM permission for temporary authentication
+  with a role
+
+Lambda to start EC2 for task
+--------------------------------------------
+
+Lambda functions must be single-function tasks that run in less than 15 minutes.
+For complex or long-running tasks we start an EC2 instance containing bison code
+and execute it in a docker container.
+
+For each task, the lambda function should create a Spot EC2 instance with a template
+containing userdata that will either 1) pull the Github repo, then build the docker
+image, or 2) pull a docker image directly.
+
+Annotating the RIIS records with GBIF accepted taxa takes about 1 hour and uses
+multiple bison modules.
+
+EC2/Docker setup
+....................
+
+* Create the first EC2 Launch Template as a "one-time" Spot instance, no hibernation
+
+* The Launch template should have the following settings::
+
+  Name: bison_spot_task
+  Application and OS Images: Ubuntu
+  AMI: Ubuntu 24.04 LTS
+  Architecture: 64-bit ARM
+  Instance type: t4g.micro
+  Key pair: bison-task-key
+  Network settings/Select existing security group: launch-wizard-1
+  Configure storage: 8 Gb gp3 (default)
+    Details - encrypted
+  Advanced Details:
+    IAM instance profile: bison_ec2_s3_role
+    Shutdown behavior: Terminate
+    Cloudwatch monitoring: Enable
+    Purchasing option: Spot instances
+    Request type: One-time
+
+* Use the launch template to create a version for each task.
+* The launch template task versions must have the task name in the description, and
+  have the following script in the userdata::
+
+    #!/bin/bash
+    sudo apt-get -y update
+    sudo apt-get -y install docker.io
+    sudo apt-get -y install docker-compose-v2
+    git clone https://github.com/lifemapper/bison.git
+    cd bison
+    sudo docker compose -f compose.test_task.yml up
+    sudo shutdown -h now
+
+
+* For each task **compose.test_task.yml** must be replaced with the appropriate compose file.
+* On EC2 instance startup, the userdata script will execute
+* The compose file sets an environment variable (TASK_APP) containing a python module
+  to be executed from the Dockerfile.
+* Tasks should deposit outputs and logfiles into S3.
+* After completion, the docker container will stop automatically and the EC2 instance
+  will stop because of the shutdown command in the final line of the userdata script.
+* **TODO**: once the workflow is stable, to eliminate Docker build time, create a Docker
+  image and download it in userdata script.
+
+Lambda setup
+....................
+
+Triggering execution
+-------------------------
+The first step may be executed on a schedule, such as the second day of the month (since
+GBIF data is deposited on the first day of the month).
+
+Upon successful completion, the deposition of successful output into S3 can trigger
+following steps.
diff --git a/_sources/pages/aws/aws_setup.rst.txt b/_sources/pages/aws/aws_setup.rst.txt
@@ -0,0 +1,50 @@
+AWS Resource Setup
+********************
+
+Create policies and roles
+===========================================================
+
+The :ref:`_bison_redshift_lambda_role` allows access to the bison Redshift
+namespace/workgroup, lambda functions, EventBridge Scheduler, and S3 data.
+The Trusted Relationships on this policy allow each to
+
+The :ref:`_bison_redshift_lambda_role_trusted_relationships policy allow
+
+The :ref:`_bison_ec2_s3_role` allows an EC2 instance to access the public S3 data and
+the bison S3 bucket.  Its trust relationship grants AssumeRole to ec2 and s3 services.
+This role will be assigned to an EC2 instance that will initiate
+computations and compute matrices.
+
+The :ref:`_bison_redshift_s3_role` allows Redshift to access public S3 data and
+the bison S3 bucket, and allows Redshift to perform glue functions. Its trust
+relationship grants AssumeRole to redshift service.
+
+Make sure that the same role granted to the namespace is used for creating an external
+schema and lambda functions.  When mounting external data as a redshift table to the
+external schema, you may encounter an error indicating that the "dev" database does not
+exist.  This refers to the external database, and may indicate that the role used by the
+command and/or namespace differs from the role granted to the schema upon creation.
+
+Redshift Namespace and Workgroup
+===========================================================
+
+Namespace and Workgroup
+------------------------------
+
+A namespace is storage-related, with database objects and users.  A workspace is
+a collection of compute resources such as security groups and other properties and
+limitations.
+https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-workgroup-namespace.html
+
+External Schema
+------------------------
+The command below creates an external schema, redshift_spectrum, and also creates a
+**new** external database "dev".  It appears in the console to be the same "dev"
+database that contains the public schema, but it is separate.  Also note the IAM role
+used to create the schema must match the role attached to the namespace::
+
+    CREATE external schema redshift_spectrum
+        FROM data catalog
+        DATABASE dev
+        IAM_ROLE 'arn:aws:iam::321942852011:role/bison_redshift_s3_role'
+        CREATE external database if NOT exists;