Skip to content

openbridge/ob_pysh-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pysh-db

pysh-db The Data Science Toolkit (DSK)

pysh-db is intended to jump start someone's efforts in using Python (or Bash). It is a pre-configured Docker image containing the leading Python data science software packages installed and ready to run.

Why pysh-db?

This can be valuable for individuals or teams who do not have the time or interest setup up their own development environments. Also, they may have limitations around what they can or cannot install on their corporate laptop. It can be tricky to get all the software packages installed and compiled. Leveraging a Docker image that does that for them can help increase velocity by allowing them time to focus on working with data, not system administration activities.

Contents

What is Included in pysh-db?

There are a few Python and OS related items that are pre-installed for convenience

Python

Sometimes you need helper "packages" and "libraries" to perform various operations. In Python, the typical (simplest) approach is to a Python package manager called pip. The image comes with the following packages pre-installed:

  • setuptools
  • cffi
  • psycopg2
  • cryptography
  • numpy
  • matplotlib
  • pandas
  • dateutil
  • sqlalchemy
  • mysql-connector
  • pytz
  • six
  • wsgiref
  • awscli
  • boto
  • scipy
  • seaborn
  • statsmodels

Why include these? The two primary reasons for including these packages are;

  1. they are commonly used in data science circles so there is an expectation they would be part of a default.
  2. they are referenced by AWS for use in creating User Defined Functions (UDF). However, the versions AWS listed are MUCH older versions than what is installed in this container.

To install the Postgres library for Python we run the pip install psycopg2 .... -U command as part of the image build. The -U tells pip to upgrade it if it is already installed.

Alpine (OS)

The included image uses Alpine Linux as the base container OS. This images leverages alpine:edge which is the latest release available. In addition to the Python packages, the Alpine postgres-client and mariadb-client libraries are installed. Without postgres-client installed the psycopg2 package would not work. It also provides the ability to use the Postgres or MySQL command line tools psql and mysql. More examples on this later.

Here are the install OS packages:

  • postgresql-client
  • mariadb-client # MySQL
  • python
  • bash
  • curl
  • less
  • groff
  • jq
  • py-numpy
  • freetype
  • libpng

Requirements

What is Docker?

This container is used for virtualizing your Python and Postgres development or work environment using Docker. If you don't know what Docker is read "What is Docker?".

Install Docker

Once you have a sense of what Docker is, you can then install the software. It is free: "Get Docker". Select the Docker package that aligns with your environment (ie. OS X, Linux or Windows). If you have not used Docker before, take a look at the guides:

If you already have a Linux instance running as a host or VM, you can install Docker command line. For example, on CentOS you would run yum install docker -y and then start the Docker service.

At this point, you should have Docker installed. Now you can get your Python and Postgres services up and running

Getting Setup

Once you have Docker running, you can build a Python image and run the container..

Step 1: Building an image

The simplest way to get pysh-db is to pull from Docker Hub:

docker pull openbridge/ob_pysh-db

If you want to build the image from scratch using the Dockefile, the first step is to grab the Dockerfile locally. Then you can execute the build command:

docker build -t openbridge/ob_pysh-db .

Due to the Python packages, the build can take awhile. Pandas, Numpy and Scipy take considerable time to build. When you run the docker build command, feel free to grab a cup of coffee.

Step 2: Running The Container

Once you have completed building your image, you can run through some basic checks to make sure it is working as expected. For example, you can test if you can start the bash cli:

docker run -it openbridge/ob_pysh-db bash

Will result in putting you inside the container at the command prompt like this:

foo@442343470ad0:/#

You certainly do not need to enter into the container like that. You can easily run Python "outside" the container. For example, you can run a simple test to have the container echo the current Python version

docker run -it openbridge/ob_pysh-db python -V

Will result in:

Python 2.7.13

Not only does this image have Python, it also has the Postgres libraries. For example, you can run PSQL directly at the prompt:

docker run -it openbridge/ob_pysh-db psql -V

Which will return

psql (PostgreSQL) 9.5.4

You can do the same with the mysql client:

docker run -it openbridge/ob_pysh-db mysql -V

Which will return

mysql Ver 15.1 Distrib 10.1.20-MariaDB, for Linux (x86_64) using readline 5.1

Congratulations, you have a working container.

Step 3: Getting Your Local Database setup

If you havea remote database you will be using, then this section will not be relevant. However, if you want to test or build locally and need a working database, then this will help you get started.

Postgres and Amazon Aurora (Postgres)

The easiest path is simply to grab the official Postgres image is from Docker hub. Take a look at the options: https://hub.docker.com/_/postgres/. To get the image, you can run docker pull postgres. This will pull the image locally and allow you to run a local Postgres DB.

You can also create your own custom image if needed. There are some good articles/how-to-guides to follow here and here

MySQL, MariaDB and Amazon Aurora (MySQL)

Due to the various flavors of MySQL, you have a few choices; MySQL or MariaDB. The best course of action is to align with whatever exists elsewhere in your organization to ensure the broadest capability and reduce any frustrations relating to versioning issues.

Both links provide an overview on how to get your self up and running. To grab the latest image of either database, simply pull them locally:

  • docker pull mariadb
  • docker pull mysql

If you are interested in using MariaDB, this is a detailed guid on setting up a MariaDB environment: Installing & Using MariaDB

Curious what the differences between the MySQL and MariaDB are? Read about it here and here.

Amazon Aurora

There is no direct Amazon Aurora Docker container. However, Amazon states that:

Amazon Aurora database engine is fully compatible with MySQL 5.6 using the InnoDB storage engine. This means the code, applications, drivers, and tools you already use with your MySQL databases can be used with Amazon Aurora with little or no change. This also allows for easy migration of existing MySQL databases using standard MySQL import and export tools or using MySQL binlog replication.

If you are using Amazon Aurora, make sure you are pulling the 5.6 Mysql image locally for development purposes. This will reduce any issues when you are ready to deploy your work and need to interact with Aurora.

To pull a version, specify it in your pull command: docker pull mysql:5.6

Connecting to Postgres, Amazon Auroa or Redshift

The openbridge/ob_pysh-db container comes with Python psycopg2 and postgres-client. For connecting to a Postgres or Redshift with Python, the Psycopg2 package is needed. If you want some background on this package, go here.

Connecting With psql

psql is an interactive terminal for working with Postgres. If you want some more background on psql you can check [this guide] (http://postgresguide.com/utilities/psql.html) or the offical docs.

Below are some example connections for Redshift using psql. It includes the remote host (-h), port (-p), username (-U) and database name (-d). You will be prompted for a password assuming your connection parameters are correct.

docker run -it openbridge/ob_pysh-db psql -h *****.us-east-1.redshift.amazonaws.com -p 5439 -U username -d mydatabase

Replace these values with those for your Redshift cluster.

Example: Listing Schema Tables And Objects

You can list all tables in your current schema using just \d.

docker run -it openbridge/ob_pysh-db psql -h *****.us-east-1.redshift.amazonaws.com -p 5439 -U username -d mydatabase \d

Specify an additional name and psql will tell you details about that named object.

docker run -it openbridge/ob_pysh-db psql -h *****.us-east-1.redshift.amazonaws.com -p 5439 -U username -d mydatabase \d mytablename

Running Queries Via psql

Standard Input Via -c

docker run -it openbridge/ob_pysh-db psql -h *****.us-east-1.redshift.amazonaws.com -p 5439 -U username -d mydatabase -c 'SELECT * FROM mytable'

File Input Via -f

Instead of passing the query via -c you can also leverage an external file that contains the query you want to execute:

docker run -it openbridge/ob_pysh-db psql -h *****.us-east-1.redshift.amazonaws.com -p 5439 -U username -d mydatabase -q -f dump.sql

Uisng -f reads commands from a file rather than standard input.

Connecting to MySQL, MariaDB or Amazon Aurora

Connecting With mysql

mysql is a simple SQL shell with input line editing capabilities. It supports interactive and noninteractive use.

Using mysql is similar to using to psql. To see all the available options for mysql, simply run:

docker run -it openbridge/ob_pysh-db mysql --help

Here is a sample connection statement:

docker run -it openbridge/ob_pysh-db mysql -h 127.0.0.1 -u mysql -D mydatabase -P 3306

This will show you all the tables in the specified database

docker run -it openbridge/ob_pysh-db mysql -h 127.0.0.1 -u mysql -D mydatabase -P 3306 -e "show tables;"

Running Queries Via mysql

Use can run mysql command to execute local sql file:

docker run -it openbridge/ob_pysh-db mysql -h 127.0.0.1 -u mysql < "./sql/myqueries.sql"

Here is an example command passing the set of commands via -e:

mysql -h 127.0.0.1 -u mysql -e "use mydatabase; INSERT IGNORE INTO users (id, userid, passwd, subscriptionid, uid, gid, homedir, status, shell, count, accessed, modified) VALUES ('', 'bob', '1231232131lkdjadjalj', 'trial', '999', '999', '/home', 'active', '/sbin/nologin', 0, 'NOW()', 'NOW()');"

or

docker run -it openbridge/ob_pysh-db mysql -h 127.0.0.1 -u mysql -e "./sql/myqueries.sql"

Exporting Data From Your Database

There may be times that you want to export data from a database. There are a few methods to accomplish this and we will only cover a few possibilities.

Export CSV from Redshift via UNLOAD and psql

We will assume you have a table called customers and you want to export it

    "id" integer NOT NULL,
    "name" varchar(50),
    "email" varchar(50),
    "city" varchar(50) DEFAULT NULL,
    "state" char(2) DEFAULT NULL,
    "created_at" date DEFAULT NULL,

Please note that exports from Redshift require a output location on Amazon S3 and that you have the proper permissions to that location.

You will want to connect to your database with your AWS credentials:

 docker run -it -e AWS_ACCESS_KEY_ID=XXXX -e AWS_SECRET_ACCESS_KEY=XXX openbridge/ob_pysh-db psql -h *****.us-east-1.redshift.amazonaws.com -p 5439 -U username -d mydatabase

You will notice the that AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are part of the run command. These reflect the key and secret required for Amazon to store the output from Redshift to a S3 bucket.

At the psql command prompt you will want to provide your UNLOAD command:

UNLOAD ('
  SELECT id, name, email, city, state, created_at  FROM (
    SELECT 1 as ordinal, \'id\' as id, \'name\' as name, \'email\' as email, \'city\' as city, \'state\' as state, \'created_at\' as created_at
    UNION all
    ( SELECT 2 as ordinal, CAST(id as varchar(255)) as id, name, email, city, state, CAST(created_at as varchar(255)) as created_at
    FROM customers )
  ) t ORDER BY ordinal
  TO 's3://mybucket/crm/customer_'
  CREDENTIALS 'aws_access_key_id=X; aws_secret_access_key=X'
  MANIFEST
  ESCAPE
  ADDQUOTES
  DELIMITER ','
  GZIP
  ALLOWOVERWRITE
  ESCAPE
  NULL AS '\\N';

The will create an export of files that are stored to S3. Based on the size of the export, it was broken into 3 parts as shown in the manifest file:

{
  "entries": [
    {"url":"s3://mybucket/crm/customer_0000_part_00"},
    {"url":"s3://mybucket/crm/customer_0001_part_00"},
    {"url":"s3://mybucket/crm/customer_0002_part_00"}
  ]
}

Each of the files customer_0000_part_00, customer_0001_part_00 and customer_0002_part_00 reflect the UNLOAD command and underlying query used.

If you want to pull those remote S3 files to your local compute, you can use the included awscli tools.

aws s3 cp s3://mybucket/crm /my/local/folder --recursive

Notes:

  1. UNLOAD by default creates encrypted files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE) 2.The S3 bucket specified in the command should be in the same region as your cluster. If they are in different regions, you will most likely see an error
  2. Rather than specify key-based access control by providing the access key ID and the secret access key, you can also use AWS IAM roles: CREDENTIALS 'aws_iam_role=arn:aws:iam::<account-id>:role/<role-name>'

Export Without Using UNLOAD command

A Redshift (only) example

 docker run -it openbridge/ob_pysh-db psql psql -h *****.eu-west-1.redshift.amazonaws.com -p 5439 -U test -d test -F, -A -c 'select * from  mytable' > mytable.csv

A Postgres (only) example:

 docker run -it openbridge/ob_pysh-db psql -U username -d mydatabase -c "Copy (Select * From mytablename LIMIT 10) To STDOUT With CSV HEADER DELIMITER ',';" > /path/to/store/mytablename.csv

Connecting Python To Your Database

There are hundreds, if not thousands, of articles and guides on how to use Python with databases. Here are a few to get your started:

Example: Basic psycopg2 Usage

Here is a basic Postgres connection in Python using psycopg2: `

import psycopg2
conn = psycopg2.connect(
    host="datawarehouse.cm4z2iunjfsc.us-west-2.redshift.amazonaws.com",
    user=redshift_user,
    port=port,
    password=redshift_pass,
    dbname=dbname)
cur = conn.cursor()
cursor.execute('select * from customer WHERE purchases < 70000')
for query in cursor:
    print str(query)

Redshift Test Script Using Python psycopg2

We have included a test script called redshift_test.py. This test script performs a very simple operation; it will connect to Redshift and print the tables associated with your connection details.

Usage

This script uses psycopg2 and argparse. These are included in pysh-db. If you are running them outside of Docker, you might need to run pip install psycopg2 argparse -U. This assumes you have pip installed. Don't have it? Get it here: https://pip.pypa.io/en/stable/installing/

To connect, you need to pass the connection details as arguments to the script as follows:

python redshift_test.py -h <host> -p <port>  -u <user> -x <password> -d <database>

Here is an example of passing sample arguments to Python. Replace with your own details:

python /local/path/to/wherever/youhave/redshift_test.py -h my.redshift.hostname.com -p 5439 -u myusername -x myawesomepass -d mydatabasename

Here is the same example using Docker:

docker run -it -v ~/Downloads/redshift_test.py:/redshift_test.py openbridge/ob_pysh-db python /redshift_test.py -h my.redshift.hostname.com -p 5439 -u myusername -x myawesomepass -d mydatabasename

Notice you mounted the script from the local machine in the Downloads directory to the root of the container: -v ~/Downloads/redshift_test.py:/redshift_test.py

This places the script inside the container and then you can run it just like you would normally.

Here is a sample of running a query in Redshift to list all tables in a given schema:

SELECT DISTINCT tablename
FROM pg_table_def
WHERE schemaname = 'yourschemanamehere'
ORDER BY tablename;

Data and Work File Persistence

You may have noticed that once you stop the container, if you previously wrote some data on the DB or some Python script, that data is lost. This is because by default Docker containers are not persistent. We can resolve this problem using a data container. Read about how to persist data: https://docs.docker.com/engine/tutorials/dockervolumes/

Example: Mounting Host Volume

Here is a quick example of mounting a host drive/directory via VOLUMES. Note the use of the -v command below. This signifies the host volume to be mounted.

In this example, your python app files are located locally at /src/app. So you want to make that available to your container. You end up mounting this to /app inside the container. This makes available your Python (or anything else) located in /src/app to the container.

Let's assume you have a query.py script you have worked on. The path to that file on the host would be /src/app/query.py. By mounting /src/app to /app, this will make available query.py within the container with a path of /src/app/query.py. To run query.py the docker command would look like:

docker run -it -v /src/app:/app openbridge/ob_pysh-db python /app/query.py

Per Docker:

The -v command mounts the host directory, /src/app, into the container at /app. If the path /app already exists inside the container's image, the /src/app mount overlays but does not remove the pre-existing content. Once the mount is removed, the content is accessible again. This is consistent with the expected behavior of the mount command.

Creating Your Own Custom Image

To persist configurations different than what is in the current Dockerfile, it is best to create your own "docker image". This is done via a custom Dockerfile. You can use this Dockerfile to create a custom image for dev purposes. What you name your image is up to you.

docker build -t python-dbdev .

Or

docker build -t python-my-db-tools .

Feel free to modify the Dockerfile as deemed appropriate. You can add and subtract default packages to fit your use case. There are a large number of choices available relating to versions and operating systems. You can see a complete list here: https://hub.docker.com/_/python/

Issues

If you have any problems with or questions about this image, please contact us through a GitHub issue.

Contributing

You are invited to contribute new features, fixes, or updates, large or small; we are always thrilled to receive pull requests, and do our best to process them as fast as we can.

Before you start to code, we recommend discussing your plans through a GitHub issue, especially for more ambitious contributions. This gives other contributors a chance to point you in the right direction, give you feedback on your design, and help you find out if someone else is working on the same thing.

References