This repository contains miscellaneous scripts for use by FMNH staff, interns, and volunteers. These scripts are generated within Google Colaboratory ("Colab"). The benefits of Colab include: in-browser Python code execution, instructions embedded within scripts, and usability by non-technologists. To execute a Colab script, you must have and be logged into a Google account.
To run any script: click on the name (above) and then click the blue "Open in Colab" button that appears near the top of the page. This will navigate you to the active script within Google Colab.
- About: Download images from public-facing web databases for digitized herbarium collections, based on your search query.
- Preconditions: A stable internet connection, and local storage space for the files.
- Inputs: A data download from the Pteridoportal or other related portal site (e.g. Bryophyte Portal).
- Outputs: A ZIP file containing all of the image files.
- About: Process a Zooniverse classification results file by flattening JSON strings into columns, making it suitable for spreadsheet analysis.
- Preconditions: A Zooniverse account, an already-existing Zooniverse project which has gathered results (i.e. public participants). Classification results file has been downloaded from Zooniverse. Any rows that you don't wish to be processed (e.g. testing phase results) are deleted.
- Inputs: A CSV file, which has been uploaded to your Google Drive (in the root "My Drive" folder).
- Outputs: A ZIP file containing one CSV file for each Workflow.
- Known Limitations: Currently, only two types of Tasks are extracted:
- Question
- Drawing (rectangle tool)
- About: Given a folder of images in Google Drive, this script creates a duplicate folder with train/test splits.
- Preconditions: A folder of images in Google Drive -- either a single folder with images and metadata intermingled, or multiple folders with only metadata in the root. (For the latter, only can do stratified sampling). Authenticate your Google Drive in Colab (tutorial).
- Inputs: A Google Drive folder containing one dataset.
- Outputs: A Google Drive folder (with editor access).
- Known Limitations:
- Sampling is not random -- since our specimen image files are often named
specimen_id-image_count.tif
and are retrieved alphabetically, we sample images evenly across the entire folder. - In a flat folder
- Sampling is not random -- since our specimen image files are often named
Thanks and credit to the Grainger Bioinformatics Center and FMNH botany collections.
Please consult the included license for information about use and redistribution.