SLAMDATA FOR AMAZON REDSHIFT

Introduction

This guide sets out how to automatically transform complex JSON data from data sources such as S3 and MongoDB into insightful analytic ready tables in Amazon Redshift on a schedule. This guide takes ~35 minutes to complete.

If you run into trouble while using this guide please contact [email protected]

Installing REFORM via Docker

If you have installed REFORM via the AWS Marketplace, AMI or CloudFormation please skip to the next section.

Go to https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchInstanceWizard in your browser

Select Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type

Select t2.2xlarge

Click Review and Launch

Click Launch

Select Create a new key pair from the dropdown menu

Provide Key pair name as REFORM

Click Download Key Pair

Click Launch Instances

Make a note of the Instance ID which starts with i- following The following instance launches have been initiated:

Click the instance id which starts with i- following The following instance launches have been initiated:

Scroll all the way to the right and click the name of the security group for this instance, e.g. launch-wizard-20

Click Actions

Click Edit inbound rules

Click Add Rule

In the new rule select the Type HTTP

In the new rule select the Source My IP

Click Save

Go to https://console.aws.amazon.com/ec2/v2/home#AllocateAddress in your browser

Click Allocate

Make a note of the Elastic IP

Click Close

Click the Elastic IP you made a note of previously

Click Actions

Click Associate address

Provide Instance as the Instance ID you made a note of previously

Click Associate

Go to https://console.aws.amazon.com/ec2/v2/home#Instances in your browser

Click the Instance ID you made a note of earlier

Click Connect and follow the instructions

Once connected run the following commands at your SSH terminal

sudo yum install docker

sudo service docker start

If you are a REFORM customer and don’t have a DockerHub account please go to https://cloud.docker.com in your browser, create an account, and then contact [email protected]with your DockerHub account.

Go to https://cloud.docker.com/u/slamdata/repository/docker/slamdata/slamdata in your browser and find latest numbered version then replace 2.0.0 in following steps with that number

Run the following command at your SSH terminal and then provide your DockerHub account details

sudo docker login

Run the following command at your SSH terminal

sudo docker run --restart=always -d -p 80:80 -v vdw-volume:/var/lib/slamdata/data --name vdw slamdata/slamdata:2.0.0

Getting the instance id and public IP address of the REFORM EC2 instance

If you have installed REFORM via Docker please skip to the next section.

Go to https://console.aws.amazon.com/ec2/v2/home#Instances in your browser

Find the instance named after the AMI or CloudFormation stack and make a note of its Instance IDand Public IP Address

Creating a virtual table with REFORM

We will now create a virtual table using REFORM. We’ll use this to test that scheduling system is working correctly.

In your browser go to the Elastic IP you made a note of earlier

For a more in depth explanation of the following steps press Tutorial

Click Create virtual table

Click the Add button in the Data sources step

Provide Name with Post giraffe

Provide Connection Type with S3

Provide Bucket URI with https://s3.amazonaws.com/post-giraffe

Click Add

Click Source Post giraffe

Click Dir app-data

Click Data user-events.json

In the user-events.json step click All keys

In the All keys step click Key dateTime

In the dateTime step, in the Values section, click Column Number

In the Number step click As Seconds since 1970

In the Seconds since 1970 step, next to Column Datetime click Pick

In the dateTime step, in the Values section, click Column String

In the String step click As ISO datetime

In the ISO date time step, next to Column Datetime click Pick

Click Merge

Click Values of key dateTime

Click Values of key dateTime 2

Click Done

In the All keys step click Key S

In the S step, in the Key section, click Column Boolean

In the Boolean step click As Number

In the Number step, next to Column Number, click Pick

In the All keys step click Key MAS

In the MAS step, in the Key section, click Column Boolean

In the Boolean step click As Number

In the Number step, next to Column Number, click Pick

Click Untitled table

Provide the table name Emails

Click Done

Using this same process you can create virtual tables which provide access to any JSON data in your data sources such as S3 and MongoDB.

Creating a Redshift cluster

Go to https://console.aws.amazon.com/vpc/home?region=us-east-1#CreateSecurityGroup in your browser

Provide Security group name with RedshiftPublic

Provide Description as Allows public inbound Redshift access

Provide VPC with your VPC

Click Create

Click on the Security Group ID which starts with sg-

Click Actions

Click Edit inbound rules

Click Add Rule

Provide Type as Redshift

Provide Source as Anywhere

Click Save rules

Go to https://console.aws.amazon.com/redshift/home in your browser

Click Launch cluster

Provide Cluster identifier as redshift-cluster-1

Provide Master user name as awsuser

Provide a Master password and make a note of this

Click Continue

Click Continue again

Provide VPC security groups with RedshiftPublic

Click Continue

Click Launch cluster

Click View all clusters

Wait for your cluster to finish creating

Click Query editor

Provide Cluster with redshift-cluster-1

Provide Database with dev

Provide Database user with awsuser

Provide password as the Master password you made a note of earlier

Click Connect

In the Redshift query editor run the following queries with <password> swapped out for a strong password and make a note of this table_creator password

Don’t forget the '' around the password

create user table_creator with password '<password>';

grant create on database dev to table_creator;

Click Clusters

Click redshift-cluster-1

Make a note of the Endpoint

Scroll down and make a note of the Cluster public key

Integrating REFORM and Redshift

Run the following commands at your SSH terminal

sudo yum install jq

sudo yum install postgresql

curl https://gist.githubusercontent.com/beckyconning/df58494b6eb83b93652a3a57686d020a/raw/copytable > copytable

chmod +x copytable

curl https://gist.githubusercontent.com/beckyconning/df58494b6eb83b93652a3a57686d020a/raw/transfer > transfer

chmod +x transfer

curl https://gist.githubusercontent.com/beckyconning/df58494b6eb83b93652a3a57686d020a/raw/manifest > manifest

chmod +x manifest

curl https://gist.githubusercontent.com/beckyconning/df58494b6eb83b93652a3a57686d020a/raw/alltables > alltables

chmod +x alltables

At your SSH terminal type the following without pressing return

echo "

Then at your SSH terminal paste the Cluster public key you made a note of, type the following and then press return

" >> ~/.ssh/authorized_keys

Go to https://s3.console.aws.amazon.com/s3/home in your browser

Click Create bucket

Provide a Bucket name and make a note of this

Click Create at the bottom left

Go to https://console.aws.amazon.com/iam/home#/policies$new in your browser

Click Choose a service

Provide Find a service as s3

Click S3

Click Read

Click Write

Click Resources

Click Add ARN in the bucket section

Provide Bucket name as the bucket name you made a note of earlier

Click Add

Click Add ARN in the object section

Provide Bucket name as the bucket name you made a note of earlier

Provide Object name as *

Click Add

Click Review policy

Provide Name as read-write-reform-manifest

Click Create policy

Go to https://console.aws.amazon.com/iam/home#/users$new in your browser

Provide User name as reform-manifest

Click Programmatic access

Click Next: Permissions

Click Attach existing policies directly

Provide Search as read-write-reform-manifest

Click the checkbox to the left of read-write-reform-manifest

Click Next: Tags

Click Next: Review

Click Create user

Make a note of the Access key ID

Click Show

Make a note of the Secret access key

Copy the following and paste it into a text editor

INSTANCE_HOST="3.94.163.123" INSTANCE_USER="ec2-user" MANIFEST_BUCKET_URI="s3://slamdata-redshift" PGPASSWORD="cepxyx-4beGqa-muxwib" REDSHIFT_HOST="redshift-cluster-2.cwmedik7imzh.us-east-1.redshift.amazonaws.com" REDSHIFT_USER="table_creator" REDSHIFT_PORT=5439 REDSHIFT_DB="dev" AWS_ACCESS_KEY_ID="AKIAIXZFQOYNHFYCTJKA" AWS_SECRET_ACCESS_KEY="NPPk/BEG3eRC4lR9lesat/xCGU31pLhQF5TlH9Cy" ./alltables "http://localhost"

In your text editor do the following:

  • Replace the value of INSTANCE_HOST with the Elastic IP you made a note of earlier
  • Replace the value of MANIFEST_BUCKET_URI with the s3:// followed by the bucket name you made a note of earlier
  • Replace the value of PGPASSWORD with the table_creator password you made a note of earlier
  • Replace the value of REDSHIFT_HOST with the Endpoint you made a note of earlier, up to but excluding the :
  • Replace the value of REDSHIFT_PORT with the number following the : in the endpoint you made a note of earlier
  • Replace the value of AWS_ACCESS_KEY_ID with the Access key ID you made a note of earlier
  • Replace the value of AWS_SECRET_ACCESS_KEY with the Secret access key you made a note of earlier

Copy this modified command, paste it into your SSH terminal and press return

The test table and any other virtual tables you made are now being streamed from your data source into insightful analytic-ready tables in Redshift.

If it worked correctly you should be shown something similar to the following

Scheduling

Enabling the Systems Manager for the EC2 Instance

Go to https://console.aws.amazon.com/iam/home#/roles$new in your browser

Click EC2

Click Next: Permissions

Provide Search with amazonec2roleforssm

Click the checkbox next to AmazonEC2RoleforSSM

Click Next: Tags

Click Next: Review

Provide Role name with ReformEC2Role

Click Create role

Go to https://console.aws.amazon.com/ec2/v2/home#Instances in your browser

Click the Instance ID you made a note of earlier

Click Actions

Click Instance settings

Click Attach/Replace IAM Role

Provide IAM role with ReformEC2Role

Click Apply

Creating an Systems Manager Automation document

Go to https://console.aws.amazon.com/systems-manager/documents/create-document in your browser

Provide Name as ReformToRedshiftThenShutdown

Provide Document type as Automation document

Provide Content as the following

Click Create document

Creating a Systems Manager Automation Assume Role

Go to https://console.aws.amazon.com/iam/home#/roles$new in your browser

Click EC2

Click Next: Permissions

Provide Search as amazonssmautomationrole

Click the checkbox next to AmazonSSMAutomationRole

Click Next: Tags

Click Next: Review

Provide Role name ReformSSMAutomationAssumeRole

Click Create role

Click ReformSSMAutomationAssumeRole

Make a note of the Role ARN

Click Trust relationships

Click Edit trust relationship

Replace Policy document with the following

Click Update Trust Policy

Click Permissions

Click Add inline policy

Click Choose a service

Provide Find a service as iam

Click IAM

Provide Filter actions as passrole

Click PassRole

Click Resources

Click Add ARN

Provide the Role ARN you made a note of earlier

Click Add

Click Review policy

Provide Name as ReformSSMAutomationAssumeRolePassRole

Click Create policy

Create a CloudWatch Event

Go to https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#rules:action=create in your browser

Click Schedule

Click Cron expression

Provide cron expression: 0 0 * * ? * (this runs every day at midnight, check out https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html for information on how to set your own schedule)

Click Add target

Click Lambda function

Click SSM Automation

Provide Document ReformToRedshiftThenShutdown

Provide InstanceId with the Instance Id you made a note of earlier

Provide AutomationAssumeRole with the Role ARN you made a note of earlier

Provide InstanceHost with the Elastic IP you made a note of earlier

Provide ManifestBucketUri with s3:// followed by the Bucket name you made a note of earlier

Provide PgPassword with the password for table_creator password you made a note of earlier

Provide RedshiftHost with the endpoint you made a note of earlier, up to and excluding the :

Provide RedshiftPort with the with the number following the : in the endpoint you made a note of earlier

Provide RedshiftUser with table_creator

Provide AwsAccessKey with the access key of the IAM account which you made a note of earlier

Provide AwsSecretAccessKey with the secret access key of the IAM account which you created earlier

Click the Configure details button

Provide Name as ScheduledReformToRedshiftThenShutdown

Click Create rule

Conclusion

All unarchived virtual tables in your copy of REFORM will be streamed automatically into Redshift on the schedule you specified.

REFORM will shut down after each scheduled transfer to reduce costs associated with running an instance. If you would prefer it not to do so remove the stopInstances object from the SSMAutomation Document you created earlier.

You can create, edit and archive your virtual tables as you please by starting your instance and going to the Elastic IP you made a note of earlier in your browser.