I noticed recently with the new 4.x releases of Amazon Elastic MapReduce PrestoDB is now a supported app. PrestoDB is heralded as an amazing distributed SQL query engine, originally developed at facebook and then open sourced (https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/). It supports multiple backends such as Hive (which covers any data you might have on HDFS or even AmazonS3), Cassandra, and even MySQL and PostgresSQL via JDBC.
AirBnB apparently uses PrestoDB internally and they have open sourced a project called Airpal, which is essentially a web-based query UI that allows users to interact with PrestoDB on the backend.
Amazon released a cool blog post that provides a CloudFormation stack for setting up an Airpal server: https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR. I decided to give it a shot and record my notes here:
Setting Up PrestoDB and Airpal
The first step is to spin up an EMR cluster with PrestoDB installed. This was simple and familiar to me. All it involves is a simple checkbox as shown in the blog post:
The next step is to run their special CloudFormation stack. This is a json document that describes how to spin up an instance with a certain Amazon machine image, and defines a bootstrap script that starts up the Airpal server. The stack is parameterized, so you the user need to provide the URL of your EMR master node, as that is how Airpal will connect to Presto on the backend.
I messed this part up at first by leaving off the http//: scheme, which led to this issue: https://github.com/airbnb/airpal/issues/128
After that hiccup, the next issue was modifying the EMR master security group so that the Airpal server can reach it on port 8889. I added the following TCP rule to enable connectivity:
Type Protocol Port Range Source
Custom TCP Rule TCP 8889 <Airpal server ip>/32
Next we need to actually define some tables in Hive. Unfortunately this requires SSHing directly into the EMR cluster to open the hive terminal, the whole process is described in the blog post. PrestoDB uses the hive metadata to actually define the schema, but it does not use the hive query execution engine.
Exploring the Airpal UI
With the tables defined we are finally ready to connect to the Airpal UI. Unfortunately I found the UI at this stage in the project pretty underwhelming.
First off the page looked pretty busted in my Chrome browser, with UI elements getting messed up as you interact with the different features:
The two inputs on the right start to impinge on one another, and the main nav bar under the query space gets lost underneath another element. (Update: everything seems to look fine in Safari, not sure what the deal is there)
Next when I launched a query it was unclear whether anything was actually happening on the back end. I don’t know if this is a result of me running the app on a t2.micro instance and having a slow request time, but there appeared to be no indicator of what queries were actually running. It usually took a hard refresh to make results appear. The only way I knew the app was working was by SSHing into the machine and tailing the server logs. (Update: After reading through the source code I’ve seen there is a progress bar but the layout problems were hiding it).
Finally when the results did appear the visualization was pretty underwhelming. Just a simple table without column sorting or searching. They do allow export to CSV which is great though. Given how amazing the AirBnB UI is, I had high hopes for this tool, but as the project is only on version 0.1 I think slick interfaces have taken a back seat to the core functionality, which appears to be solid.
Looking past Airpal, PrestoDB itself seems very promising. I SSH’ed directly into the cluster again and decided to compare the following simple GROUP BY query on the example wikipedia dataset using both the presto and hive CLIs:
select count(1), language from wikistats group by language;
Running in hive takes 147 seconds, as we have to wait for a whole MapReduce job to be setup and run in the background.
In presto the exact same query only takes 82 seconds, and it actually shows us what is going on with our query from the command line (as opposed to Hive which just pipes out MapReduce job monitoring info):
presto:default> select count(1), language from wikistats group by language;
Query 20160206_224112_00053_xtqaa, RUNNING, 2 nodes, 171 splits
0:27 [94.5M rows, 4.16GB] [3.46M rows/s, 156MB/s] [=========================>>>>>>>>>>>>>>>>>>>>> ] 59%
STAGES ROWS ROWS/s BYTES BYTES/s QUEUED RUN DONE
0.........R 0 0 0B 0B 0 1 0
1.......R 16.6K 607 435K 15.9K 0 2 0
2.....R 94.5M 3.46M 4.16G 156M 6 61 101
It would be a great feature to pipe this more detailed progress information back through the front end; I plan to see if it is possible and maybe open up a pull request sometime in the future.