data_cleaning_demo

A demo for occurrence data cleaning

Summary

The purpose of this demo is to show how someone could "clean" an occurrence dataset so that it can be used in some analysis like species distribution modeling. This demo is packaged as a Docker container so that it can easily be used in a number of different environments without the extra tasks associated with installation.

Data Cleaning

For this demo, we will clean a set of occurrence records for Heuchera species. The original data includes 11,490 records and is a CSV file that includes species name, decimal latitude, and decimal longitude. To clean this dataset, we will first remove any records that have less than four decimal places of precision. For this scenario, we will assume that these records will be used for creating a species distribution model and thus, records with identical latitude and longitude values for a species are redundant and will be removed. Finally, we only want to create models for species with at least 12 points, so any species that have less than 12 points will be removed.

Setup Steps

Before we can run the demo, we need to build the container. This documentation assumes that Docker has been installed for your environment (see: https://docs.docker.com/get-docker/).

1a. Clone the repository

git clone https://github.com/biotaphy/data_cleaning_demo.git

1b. Download the repository from GitHub (without using git)

If you don't have git installed, you can download the repository as a zip file. Navigate to https://github.com/biotaphy/data_cleaning_demo/releases/latest and download the source code under the Assets section. After download, uncompress the file.

Change to the repository directory

cd data_cleaning_demo

Build the docker image

docker build docker/ -t dc_demo

Run the data cleaning script

Run a bash shell in the container interactively

Mac / Linux: docker run -v "$(pwd)"/data:/demo -it dc_demo bash or ./docker_run_unix.sh

Windows: docker run -v %cd%/data:/demo -it dc_demo bash or docker_run_sindows.bat

Run the data cleaning example from the container

# clean_occurrences -r /demo/cleaning_report.json /demo/heuchera.csv /demo/clean_data.csv /demo/wrangler_conf.json

From your local machine, inspect the original and cleaned occurrence data files (data/heuchera.csv and data/clean_data.csv). The fields have not changed between the two files, but the cleaning step removed all records with less than 4 decimal places of precision, any duplicate points (where species name, latitude, and longitude are the same), and any species with less than 12 points. The result is a cleaned dataset with 6678 occurrence records down from the original 11,490. You can see a more detailed report of how many points were removed by each filter in the file data/cleaning_report.json. It should contain something like the following.

{
    "input_records": 11489,
    "output_records": 6677,
    "wranglers": {
        "decimal_precision_filter": {
            "removed": 2978
        },
        "unique_localities_filter": {
            "removed": 1800
        },
        "minimum_points_filter": {
            "removed": 34
        }
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
docker		docker
LICENSE		LICENSE
README.md		README.md
docker_run_unix.sh		docker_run_unix.sh
docker_run_windows.bat		docker_run_windows.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data_cleaning_demo

Summary

Data Cleaning

Setup Steps

Run the data cleaning script

About

Releases

Packages

Languages

License

biotaphy/data_cleaning_demo

Folders and files

Latest commit

History

Repository files navigation

data_cleaning_demo

Summary

Data Cleaning

Setup Steps

Run the data cleaning script

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages