Skip to content

Latest commit



350 lines (252 loc) · 11.8 KB

File metadata and controls

350 lines (252 loc) · 11.8 KB

A terraform setup for setting up hdp's big data analytics server instance in aws. 🔥🔥🔥

Table of Contents

Installing / Getting started

⚠️ Before running the scripts, create a remote s3 bucket to store the terraform state.

  • By default, the name of the remote state bucket is terraform-hadoop.
  • If you want to create your own bucket with any-other name, ensure that you replace the default remote bucket name mentioned in


To configure the public ip address, replace the HostIp environment variable found in env > dev.tfvars | prod.tfvars,

> curl


💡If you don't want to utilize global credentials, add AWS PROFILE=username> to each terraform and aws command given below.

Initialize terraform

> cd terraform/private_vpc
> terraform init

Create AWS keypair that will be used to login into AWS instance, same KeyPair would be used for initializing the other instances too

> cd terraform/scripts # generate keys inside scripts
> aws ec2 create-key-pair --key-name hwsndbx --query 'KeyMaterial' --output text > hwsndbx.pem

Optional: Workspaces

> terraform workspace list # created at terraform init

To create two new workspaces,

> terraform workspace new dev
> terraform workspace new prod

If we need to provision the resources in the dev workspaces we need to first select the dev workspace.

> terraform workspace select dev
> terraform apply


Apply terraform script,

> terraform plan
> terraform apply -auto-approve

Optional: Apply terraform script with environment variables,

> terraform plan -var-file=./env/dev.tfvars
> terraform apply -auto-approve -var-file=./env/dev.tfvars

OpenVPN - Bastion Host

Since we need a proper way to access our server and we cant tie the server down to our local dynamic ip which changes everytime, we create a new ec2 instance with openvpn to act as the bastion host.

For OpenVPN setup refer to this video.

Change the openvpn_ami_id based on your specified region,

> aws --region=us-east-1 ec2 describe-images --owner=aws-marketplace --filters 'Name=name,Values=OpenVPN Access Server 2.7.5*'
> cd terraform/bastion_host_openvpn
> terraform init
> terraform plan
> terraform apply

Setup OpenVPN

Read through this for more setup.

Connect to the OpenVPN instance using the assigned elastic ip,

> ssh -i ./scripts/hwsndbx.pem openvpnas@<elasticip>

Use all settings as default. And change the password

> sudo passwd openvpn

Then go to the OpenVPN WebUI https://<elastic-ip>:943. Use username as openvpn and password configured in the terminal above.

  • In Configuration > VPN Settings > Routing > Enable Should client Internet traffic be routed through the VPN?
  • With this configuration, the VPN client IP address is translated before being presented to resources inside the VPC. That means the client’s original IP address is remapped to one belonging to the VPC IP address space.

Using domain

We can use the domain by adding the nameserver generated by terraform apply output to the domain DNS.

Optional: Add SSL Cert for https

Read more on adding SSL Cert.

Right now you should be access you VPN's admin GUI by going to https:///admin. However, your browser will show a warning as the SSL cert is not valid. You can bypass this warning to access the admin, but we should setup a valid SSL cert.

  • Use ZeroSSL to obtain your cetificate for free.

Walk through the wizard to create a new Let's Encrypt certificate. You will be required to verify your domain as part of this process.

Copy the Certificate, CA Bundle and Private Key to files.

Login to your VPN access server GUI using the user openvpn and created on the server. Navigate to Settings > Web Server. From there, upload the Certificate, CA Bundle and Private Key files. Click validate and save if there are no errors.

> ssh root@<host>  "cat server.csr"|pbcopy 
> ssh root@<host>  "cat server.key"|pbcopy 

HDP Instance

Next, we will provision HDP as a spot instance if you need it as a readily-available instance change directory to ``.

> cd terraform/hdp_instance
> terraform init
> terraform plan
> terraform apply


So to connect using ssh we need a permission of 400 but by default it will be 644,

> ls -la # to see the permission of the pem file
> chmod 400 ./scripts/hwsndbx.pem # same key for all
> ssh -i ./scripts/hwsndbx.pem ec2-user@<output_instance_ip>

Install HDP through docker,

> docker info
> cd /tmp/hdp-docker-sandbox/HDP_2.6.5
> sudo bash
> docker ps
> docker ps -a

To restart the containers,

> cd /tmp/hdp-docker-sandbox
> sudo bash
  • After it finishes, access Ambari through http://elastic-public-ip:8080/.
  • The default Ambari credential is raj_ops:raj_ops and maria_dev: maria_dev . The default AmbariShell login credential is root:hadoop.

Basic commands

Docker troubleshooting

> sudo docker images
> sudo service docker restart
> sudo service docker status

Sandbox Bash

Read cloudera hdp sandbox and apache ambari shell commands for more information.

To peek into the docker sandbox,

> docker exec -it <docker-sandbox-image-id> /bin/bash
> ssh root@localhost -p 2222 # or you can use this with password hadoop
> ambari-agent status
> ambari-agent start # if stopped start
> ambari-server restart

Setup Python

HortonWorks doesnt come with lot of resources out-of-the-box to work with python,

> sudo su -
> yum install python-pip -y
> pip install google-api-python-client==1.6.4
# > curl | python
# > pip install --ignore-installed pyparsing
> pip install mrjob==0.5.11 #MRJob
> yum install nano -y

Example data files and scripts to play with,

> sudo su - maria_dev
> wget
> wget
> hadoop fs -copyFromLocal /user/maria_dev/ml-100k/
> python RatingsBreakdown.p
> python -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar #mrjob manually copies the file to hdfs temp location and executes it
> hostname -I | awk '{print $1}' # get the ip
> python -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs://
> python -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs:///user/maria_dev/ml-100k/

If python 3.6 needed

Look into this script


Change the ambari password once you create the instance,

> docker exec -it sandbox-hdp /bin/bash
> ambari-admin-password-reset
> ambari-agent restart

Add Hosts Ip to Mac

💡 C:\Windows\System32\drivers\etc\hosts on Windows or /etc/hosts on a MacOSX

In case you want a CNAME, you can add this line to your hosts file. Add hostip to the mac to use as a domain name locally, to save and exit out of nano editor ctrl + o > enter > ctrl + x

> sudo nano /etc/hosts # add the ip and map to a host
> sudo killall -HUP mDNSResponder # flush DNS cache

Pausing and Resuming Instances

⚠️ Keep in mind, though there aren't any changes for a stopped instance, you may still incur charges for EBS storage and ElasticIP associated to the instances.

Once created and you want to stop instances just execute,

> cd /tmp/hdp-docker-sandbox
> bash # pause the instance
> cd hdp_instance
> terraform output # get the id from output for hdp instance
> aws ec2 stop-instances --instance-ids <instance_id> --profile edutf
> cd bastion_host_openvpn
> terraform output # get the id from output for openvpn instance
> aws ec2 stop-instances --instance-ids <instance_id> --profile edutf

Once created and you want later to reboot after a stop,

> cd bastion_host_openvpn
> terraform output # get the id from output for openvpn instance
> aws ec2 start-instances --instance-ids <instance_id> --profile edutf
> cd hdp_instance
> terraform output # get the id from output for hdp instance
> aws ec2 start-instances --instance-ids <instance_id> --profile edutf
> cd terraform/hdp_instance
> ssh -i ./scripts/hwsndbx.pem ec2-user@<instance_ip>
> cd /tmp/hdp-docker-sandbox
> bash # resume the instance

Spark Notebooks

> ps -ef
> kill -HUP <PID>
> bash spark


To destroy the terraform instance,

> terraform destroy -auto-approve



MIT © Murshid Azher.