Skip to content

Commit

Permalink
deploy: c5a5abb
Browse files Browse the repository at this point in the history
  • Loading branch information
zzeppozz committed Oct 28, 2024
0 parents commit e7264d4
Show file tree
Hide file tree
Showing 112 changed files with 12,626 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: ee2edea7cd0e405e075f4650e6ba2801
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/pages/about.doctree
Binary file not shown.
Binary file added .doctrees/pages/aws/automation.doctree
Binary file not shown.
Binary file added .doctrees/pages/aws/aws_setup.doctree
Binary file not shown.
Binary file added .doctrees/pages/aws/ec2_setup.doctree
Binary file not shown.
Binary file added .doctrees/pages/aws/roles.doctree
Binary file not shown.
Binary file added .doctrees/pages/history/aws_experiments.doctree
Binary file not shown.
Binary file added .doctrees/pages/history/year3.doctree
Binary file not shown.
Binary file added .doctrees/pages/history/year4_planA.doctree
Binary file not shown.
Binary file added .doctrees/pages/history/year4_planB.doctree
Binary file not shown.
Binary file added .doctrees/pages/history/year5.doctree
Binary file not shown.
Binary file added .doctrees/pages/interaction/aws_prep.doctree
Binary file not shown.
Binary file added .doctrees/pages/interaction/debug.doctree
Binary file not shown.
Binary file added .doctrees/pages/interaction/deploy.doctree
Binary file not shown.
Binary file added .doctrees/pages/workflow.doctree
Binary file not shown.
Empty file added .nojekyll
Empty file.
Binary file added _images/lm_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 45 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
Welcome to LmBISON - RIIS Analysis
======================================

The BISON repository contains data and scripts to annotate GBIF occurrence records
with information regarding geographic location and USGS RIIS status of the record.


Current
------------

.. toctree::
:maxdepth: 2

pages/about
pages/workflow

Setup AWS
------------

.. toctree::
:maxdepth: 2

pages/aws/aws_setup

Using BISON
------------

.. toctree::
:maxdepth: 2

pages/interaction/about

History
------------

.. toctree::
:maxdepth: 2

pages/history/year4_planB
pages/history/year4_planA
pages/history/year3
pages/history/year5
pages/history/aws_experiments

* :ref:`genindex`
12 changes: 12 additions & 0 deletions _sources/pages/about.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
About
========

The `Lifemapper BISON repository <https://github.com/lifemapper/bison>`_ is an open
source project supported by USGS award G19AC00211.

The aim of this repository is to provide a workflow for annotating and analyzing a
large set of United States specimen occurrence records for the USGS BISON project.

.. image:: ../.static/lm_logo.png
:width: 150
:alt: Lifemapper
140 changes: 140 additions & 0 deletions _sources/pages/aws/automation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
Create lambda function to initiate processing
------------------------------------------------
* Create a lambda function for execution when the trigger condition is activated,
aws/events/bison_find_current_gbif_lambda.py

* This trigger condition is a file deposited in the BISON bucket

* TODO: change to the first of the month

* The lambda function will delete the new file, and test the existence of
GBIF data for the current month

* TODO: change to mount GBIF data in Redshift, subset, unmount

Edit the execution role for lambda function
--------------------------------------------
* Under Configuration/Permissions see the Execution role Role name
(bison_find_current_gbif_lambda-role-fb05ks88) automatically created for this function
* Open in a new window and under Permissions policies, Add permissions

* bison_s3_policy
* redshift_glue_policy

Create trigger to initiate lambda function
------------------------------------------------

* Check for existence of new GBIF data
* Use a blueprint, python, "Get S3 Object"
* Function name: bison_find_current_gbif_lambda
* S3 trigger:

* Bucket: arn:aws:s3:::gbif-open-data-us-east-1

* Create a rule in EventBridge to use as the trigger

* Event source : AWS events or EventBridge partner events
* Sample event, "S3 Object Created", aws/events/test_trigger_event.json
* Creation method: Use pattern form
* Event pattern

* Event Source: AWS services
* AWS service: S3
* Event type: Object-Level API Call via CloudTrail
* Event Type Specifications

* Specific operation(s): GetObject
* Specific bucket(s) by name: arn:aws:s3:::bison-321942852011-us-east-1

* Select target(s)

* AWS service


Lambda to query Redshift
--------------------------------------------

https://repost.aws/knowledge-center/redshift-lambda-function-queries

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift-data/client/execute_statement.html

* Connect to a serverless workgroup (bison), namespace (bison), database name (dev)

* When connecting to a serverless workgroup, specify the workgroup name and database
name. The database user name is derived from the IAM identity. For example,
arn:iam::123456789012:user:foo has the database user name IAM:foo. Also, permission
to call the redshift-serverless:GetCredentials operation is required.
* need redshift:GetClusterCredentialsWithIAM permission for temporary authentication
with a role

Lambda to start EC2 for task
--------------------------------------------

Lambda functions must be single-function tasks that run in less than 15 minutes.
For complex or long-running tasks we start an EC2 instance containing bison code
and execute it in a docker container.

For each task, the lambda function should create a Spot EC2 instance with a template
containing userdata that will either 1) pull the Github repo, then build the docker
image, or 2) pull a docker image directly.

Annotating the RIIS records with GBIF accepted taxa takes about 1 hour and uses
multiple bison modules.

EC2/Docker setup
....................

* Create the first EC2 Launch Template as a "one-time" Spot instance, no hibernation

* The Launch template should have the following settings::

Name: bison_spot_task
Application and OS Images: Ubuntu
AMI: Ubuntu 24.04 LTS
Architecture: 64-bit ARM
Instance type: t4g.micro
Key pair: bison-task-key
Network settings/Select existing security group: launch-wizard-1
Configure storage: 8 Gb gp3 (default)
Details - encrypted
Advanced Details:
IAM instance profile: bison_ec2_s3_role
Shutdown behavior: Terminate
Cloudwatch monitoring: Enable
Purchasing option: Spot instances
Request type: One-time

* Use the launch template to create a version for each task.
* The launch template task versions must have the task name in the description, and
have the following script in the userdata::

#!/bin/bash
sudo apt-get -y update
sudo apt-get -y install docker.io
sudo apt-get -y install docker-compose-v2
git clone https://github.com/lifemapper/bison.git
cd bison
sudo docker compose -f compose.test_task.yml up
sudo shutdown -h now


* For each task **compose.test_task.yml** must be replaced with the appropriate compose file.
* On EC2 instance startup, the userdata script will execute
* The compose file sets an environment variable (TASK_APP) containing a python module
to be executed from the Dockerfile.
* Tasks should deposit outputs and logfiles into S3.
* After completion, the docker container will stop automatically and the EC2 instance
will stop because of the shutdown command in the final line of the userdata script.
* **TODO**: once the workflow is stable, to eliminate Docker build time, create a Docker
image and download it in userdata script.

Lambda setup
....................

Triggering execution
-------------------------
The first step may be executed on a schedule, such as the second day of the month (since
GBIF data is deposited on the first day of the month).

Upon successful completion, the deposition of successful output into S3 can trigger
following steps.
50 changes: 50 additions & 0 deletions _sources/pages/aws/aws_setup.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
AWS Resource Setup
********************

Create policies and roles
===========================================================

The :ref:`_bison_redshift_lambda_role` allows access to the bison Redshift
namespace/workgroup, lambda functions, EventBridge Scheduler, and S3 data.
The Trusted Relationships on this policy allow each to

The :ref:`_bison_redshift_lambda_role_trusted_relationships policy allow

The :ref:`_bison_ec2_s3_role` allows an EC2 instance to access the public S3 data and
the bison S3 bucket. Its trust relationship grants AssumeRole to ec2 and s3 services.
This role will be assigned to an EC2 instance that will initiate
computations and compute matrices.

The :ref:`_bison_redshift_s3_role` allows Redshift to access public S3 data and
the bison S3 bucket, and allows Redshift to perform glue functions. Its trust
relationship grants AssumeRole to redshift service.

Make sure that the same role granted to the namespace is used for creating an external
schema and lambda functions. When mounting external data as a redshift table to the
external schema, you may encounter an error indicating that the "dev" database does not
exist. This refers to the external database, and may indicate that the role used by the
command and/or namespace differs from the role granted to the schema upon creation.

Redshift Namespace and Workgroup
===========================================================

Namespace and Workgroup
------------------------------

A namespace is storage-related, with database objects and users. A workspace is
a collection of compute resources such as security groups and other properties and
limitations.
https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-workgroup-namespace.html

External Schema
------------------------
The command below creates an external schema, redshift_spectrum, and also creates a
**new** external database "dev". It appears in the console to be the same "dev"
database that contains the public schema, but it is separate. Also note the IAM role
used to create the schema must match the role attached to the namespace::

CREATE external schema redshift_spectrum
FROM data catalog
DATABASE dev
IAM_ROLE 'arn:aws:iam::321942852011:role/bison_redshift_s3_role'
CREATE external database if NOT exists;
Loading

0 comments on commit e7264d4

Please sign in to comment.