Trying out Presto and Airpal on EMR

I noticed recently with the new 4.x releases of Amazon Elastic MapReduce PrestoDB is now a supported app. PrestoDB is heralded as an amazing distributed SQL query engine, originally developed at facebook and then open sourced (https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/). It supports multiple backends such as Hive (which covers any data you might have on HDFS or even AmazonS3), Cassandra, and even MySQL and PostgresSQL via JDBC.

AirBnB apparently uses PrestoDB internally and they have open sourced a project called Airpal, which is essentially a web-based query UI that allows users to interact with PrestoDB on the backend.

Amazon released a cool blog post that provides a CloudFormation stack for setting up an Airpal server: https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR. I decided to give it a shot and record my notes here:

Setting Up PrestoDB and Airpal

The first step is to spin up an EMR cluster with PrestoDB installed. This was simple and familiar to me. All it involves is a simple checkbox as shown in the blog post:

Select install PrestoDB

The next step is to run their special CloudFormation stack. This is a json document that describes how to spin up an instance with a certain Amazon machine image, and defines a bootstrap script that starts up the Airpal server. The stack is parameterized, so you the user need to provide the URL of your EMR master node, as that is how Airpal will connect to Presto on the backend.

I messed this part up at first by leaving off the http//: scheme, which led to this issue: https://github.com/airbnb/airpal/issues/128

After that hiccup, the next issue was modifying the EMR master security group so that the Airpal server can reach it on port 8889. I added the following TCP rule to enable connectivity:

Next we need to actually define some tables in Hive. Unfortunately this requires SSHing directly into the EMR cluster to open the hive terminal, the whole process is described in the blog post. PrestoDB uses the hive metadata to actually define the schema, but it does not use the hive query execution engine.

Exploring the Airpal UI

With the tables defined we are finally ready to connect to the Airpal UI. Unfortunately I found the UI at this stage in the project pretty underwhelming.

First off the page looked pretty busted in my Chrome browser, with UI elements getting messed up as you interact with the different features:

Airpal Messed Up

The two inputs on the right start to impinge on one another, and the main nav bar under the query space gets lost underneath another element. (Update: everything seems to look fine in Safari, not sure what the deal is there)

Next when I launched a query it was unclear whether anything was actually happening on the back end. I don’t know if this is a result of me running the app on a t2.micro instance and having a slow request time, but there appeared to be no indicator of what queries were actually running. It usually took a hard refresh to make results appear. The only way I knew the app was working was by SSHing into the machine and tailing the server logs. (Update: After reading through the source code I’ve seen there is a progress bar but the layout problems were hiding it).

Finally when the results did appear the visualization was pretty underwhelming. Just a simple table without column sorting or searching. They do allow export to CSV which is great though. Given how amazing the AirBnB UI is, I had high hopes for this tool, but as the project is only on version 0.1 I think slick interfaces have taken a back seat to the core functionality, which appears to be solid.

Exploring PrestoDB

Looking past Airpal, PrestoDB itself seems very promising. I SSH’ed directly into the cluster again and decided to compare the following simple GROUP BY query on the example wikipedia dataset using both the presto and hive CLIs:

select count(1), language from wikistats group by language;

Running in hive takes 147 seconds, as we have to wait for a whole MapReduce job to be setup and run in the background.

In presto the exact same query only takes 82 seconds, and it actually shows us what is going on with our query from the command line (as opposed to Hive which just pipes out MapReduce job monitoring info):

It would be a great feature to pipe this more detailed progress information back through the front end; I plan to see if it is possible and maybe open up a pull request sometime in the future.

Density Map of 4-year Colleges in the Continental U.S.

I was recently working with the IPEDS dataset, which is a very interesting collection of information on all of the postsecondary institutions in the US. I was surprised to see data on over 7000 schools, most of which I had never heard of. This inspired me to create a visualization of all the schools there are in the US and where they are located. The best way I could think of to visualize this is a contour plot where instead of elevation the metric is density of colleges at each point in the US.

Here’s a screencap of the resulting visualization:

See it in HTML form at:

http://ipeds-visualization.s3-website-us-west-2.amazonaws.com/

Loading Avro to Redshift

Recently, AWS added support for bulk loading avro files direct to redshift. Previously, redshift only supported delimited flat files, fixed width files, and JSON as input formats. This could create a hassle for some workflows, as you might end up transforming your data to one of these formats just to get it into redshift.

I ran a few tests of how bulk loading avro to redshift works. I started with a simple avro schema:

And what would make sense to me as the complementary SQL table:

Assuming we have an avro DataFile loaded to redshift that fits the above schema, we can load our table using the following query:

The ‘auto’ format parameter seems to work the same as it does for bulk loading JSON, all top level fields in the avro schema are matched to their respective column based on field name. Everything seems to just work, and the data ends up in our table in the form we would expect it to, converting properly between avro and postgres types.

One interesting thing I noticed is that if you have a top level field with a complex data type such as {"name": "array_field", "type": {"type": "array", "items": "string"}} or {"name": "map_field", "type": {"type": "map", "values": "int"}}, and a schema that includes a varchar column with the same name, the loader will serialize the value to JSON and store it in the varchar column. This is useful behavior as redshift has native JSON accessor functions that work pretty well.

I also tried an avro / table combination that would have a blatant type mismatch (string into int column), and the load fails with an error as expected.

The next question is how is the performance of bulk loading avro compares to the other data formats. Using the above AllTypesTest avro schema I created three sets of test files with random data values, and CSV equivalents for each one, then benchmarked loading them to a redshift cluster. My results were as follows:

Cluster Instance Type: dc1.large
Cluster Size: 1 node

NRecords AvroLoadTime AvroSize CSVLoadTime CSVSize
1.00E+06 55.783 s 39M 8.87 s 65M
1.00E+07 167.285 s 387M 28.825 s 654M
1.00E+08 1520.653 s 3.8G 251.418 s 6.4G

As we can see loading the exact same data from avro takes about 9 times longer compared to CSV. This is super disappointing as I’m pretty sure that deserialization performance for avro should be much better than for CSV (see one benchmark here). However, the performance does not seem to be debilitatingly horrible. If your workflows are all already avro based, using avro bulk loading could be the right call, as it’s just so much cleaner then adding in a pre-load transformation step. Plus in theory avro should be faster, so I suspect as AWS pushes backend updates we will see performance improve.

My code to generate my test avro and CSV files was as follows:

Visualizing Medicaid and Medicare Spending

Recently I learned about a cool data set that was released, containing a bunch of different aggregated statistics about the medicaid and medicare programs in the US. The data contains not only information about spending, but disease prevalence as well. This dataset is available for anyone to download here.

There are a ton of variables, with data for 5 different years for all 50 states and DC. I felt like an interactive visualization could be helpful for allowing people to explore this important data. I put something together using d3 to create a map based visualization, with angular to make it interactive. Because all the data is of a reasonable size (about 10MB when serialized), I was able to make this a fully static page. I also added the ability to bookmark analyses via query params, which allowed me to link a few interesting trends I noticed at the bottom of the page.

Here’s what the finished result looks like:

Hosted as a static site on S3 here:

mmleads-puf-visualization.s3-website-us-west-2.amazonaws.com

One trip up is that is seems to fail on iOS safari, probably because the json blob with the data is of an unreasonable size. Would be cool to go back and figure out some way around that. I envision I could split the json up into smaller files and have them fetched more dynamically, kind of like a poor-man’s read only DB.

Visualizing Manacher’s Algorithm

A commonly used algorithm question for interviews is finding the longest palindrome in a given string. It is a good question because there are so many ways to approach it. It is also a good question because many are surprised to learn that this can actually be done in linear time. The linear time algorithm, attributed to Manacher, is pretty mind boggling. I think this contributes to making the longest palindrome question good for interviews; even if you have seen Manacher’s approach beforehand, it is pretty hard to get an intuitive grasp on it. You would be hard pressed to regurgitate it back, like you might be able to do with other algorithms.

In an effort to get an intuitive grasp of how Manacher’s algorithm works, I have created a visualization using D3.js, which renders each step in the process of finding the longest palindrome. It is interactive and allows you to enter your own string to run through Manacher’s algorithm. I have to say I really didn’t understand the beauty of the algorithm until I made this visualization. I would like to do this in the future for other tricky algorithms and post them as computer science learning aids.

The visualization is hosted as a static site on AWS S3:

http://manacher-viz.s3-website-us-east-1.amazonaws.com/

Here’s what the finished result looks like:

Creating a Fully Tested Web App

I’ve noticed that my standards for testing “frontend” code are much more lax than the standards I have for testing “backend” code. The attitude I often find myself making is that the backend stuff is more “important”. Anything that goes wrong in the frontend is just a superficial error. I might say something like, “Oh the response the server returned was actually correct, it just wasn’t displayed properly.” This is obviously incredibly stupid. After researching this I realized there are a lot of awesome technologies for testing web apps that I have yet to try. Here I present a proof of concept for a fully tested frontend: a minimal angular app that has unit tests of the client side javascript, and E2E tests to ensure the final rendered page is correct when shown in a browser.

The technology stack for this proof of concept is as follows:

Server: Express (Node.js web application framework)
App: Angular.js
Testing Environment: Karma.js
Testing Framework: Mocha.js
Assertion/Expectation Library: Chai.js
E2E Testing Framework: Protractor.js

First of all, there are a bunch of technologies floating out there to choose from, and many alternatives at each level. Most of the decisions I made here were the suggested tools for testing an angular app. Protractor.js is the preferred E2E framework for angular, and Karma.js is the preferred testing environment. Within Karma.js you have the option of choosing a few alternatives for writing the actual tests, they state support for Jasmine, Mocha, or QUnit. Mocha apparently offers more flexibility than Jasmine, but for my part this decision was pretty arbitrary. Part of this flexibility is choosing your own assertion library to use with Mocha, something which is “batteries included” with the Jasmine testing framework. For this I am using Chai. My server itself is written using Express, just because of its simplicity.

So lets start with the app itself, we have one controller:

And one view:

We will have a button that you click to make a number go up, like I said, a minimal static angular app. But how do we know that increment() will do what it implies? In steps karma. First we need to create karma config file, this will tell karma what javascripts to load and how exactly we want to test it. Our config looks like this:

Once we have this setup, we need to just create some tests in app/static/js/tests for karma to discover. We will have one test file which corresponds to our one controller, IndexController:

These tests are pretty exotic looking, so lets break them down:

describe is a function provided to us by the Mocha framework, it simply defines a block within which tests can be defined.

Next we see beforeEach(module('App'));, this is known as a hook in the Mocha framework. It will execute before every test defined inside of this block. The module is the angular.module function, which means we will instantiate our module before each test.

inject() comes to us from the angular mocks library, it will allow us to do the mock dependency injection necessary to instantiate a controller.

Finally after we are nested three deep in describe blocks, we have an actual live controller, which allows us access to attributes of the scope. Here we find our it() blocks, which are basically test cases in the Mocha framework.

Inside of this we have our expect calls, expect comes from chai, our assertion library, and will throw an error if our test conditions are not met.

This is all well and good, we expect that the scope will start off with theNumber being 0, and will increment up by one when increment() is called. But how can be certain that our webpage will actually display these changes correctly to the user? In come s Protractor. We start with our protractor configuration:

Protractor needs an actual browser to run, seeing as it must actually render the webpage. This config points it to a local selenium server (which we will spin up using the node module webdriver-manager, included with protractor), and then points it to a “spec” file, which contains our E2E tests:

This looks pretty similar to our Mocha tests. I’m not sure if it is using the same framework under the hood or just imitating it, but the consistency is nice. However, protractor lets us actually visit and inspect our webpage! We start with a call to browser() which will point to our local running express.js server. One the browser has loaded the page, we can use a jQuery like interface to find and inspect elements. We have carefully id tagged the elements of interest in our page, so that now we can easily find them and verify that our scope variable $scope.theNumber is indeed being displayed properly in the final rendered page. Hooray!

The code I have presented is available as a project seed on github: https://github.com/qwwqwwq/karma_protractor_seed

Using D3 In Rails

I’ve been learning Rails lately, it’s amazing how the framework lets you to make a functional website with minimal thought or effort. I was cruising along until I decided to try and add a slick d3 visualization into my app, this part required a bit of poking around to get right. In the interest of helping others I present here a minimal Rails app that displays a simple d3 visualization.

This is the humble graph we will be creating.

This is the humble graph we will be creating.

We begin with a fresh Rails 4.2.0 app, initialized by rails new. From there we need to modify or create the following files to display our chart:


app/assets/javascript/graph.js       //actual d3 code goes here
app/assets/javascript/application.js //include d3.js library
app/controllers/graph_controller.rb  //this controller will serve the data for our graph
app/views/graph/index.html.erb       //this template contains the rendering endpoint
app/config/routes.rb                 //route our graph and data endpoints
app/vendor/javascripts/d3.js         //the d3 library itself

First we start with the controller. Our controller is very simple:

There are two methods here, index simply gives us our template that will hold the graph, and data gives us the json we will use to build the graph. One issue that arises is that Rails does not like just serving json, it defaults to trying to serve .erb. So we need to configure Rails to serve json from the data method, we do this in config/routes.rb like so:

Failing to change the default format for graph/data will result in an ActionController::UnknownFormat error.

Next we need some actual d3 code that will grab the data from graph/data and use it. We start with putting the d3.js library file in the canonical app/vendor/javascripts folder. However, for our app to actually use any of the libraries in vendor/javascripts, we need to include them by adding an include statement in app/vendor/javascripts/application.js:

With that out of the way we can finally write some d3:

Pretty simple stuff. We will make an AJAX request to graph/data and then render a bar chart.

Finally the last piece is our graph/index template, all we need is an svg with id graph there:

And we’re done! Rails stitches together the rest for us. Feel free to clone this as a project seed:
https://github.com/qwwqwwq/rails_d3_seed

AWS Key Management Service

AWS announced a lot of cool new services at their re:Invent 2014 conference. Among these was their Key Management Service (KMS). The main idea behind this service is that managing your own secrets can be hard, now you can just let AWS do it for you.

The console interface to this service lives inside of the Identity and Access Management console:

IAM_location

I guess it is not important enough to warrant its own icon in the services tray yet. The KMS interface looks something like this:

KMS_interface

The main actions you can perform through this UI are creating keys, and toggling existing keys from enabled to disabled. When you create a key you define a pretty simple key policy by selecting which users will be allowed to administrate the key, and which users will be able to use the key. Administrators are allowed to do things like disable the key, while users can only the use the key for encryption and decryption. There is also the concept of grants, which allow one user to grant another user access to an aws key.

Currently it seems like the only way to actually access the API for KMS is through the AWS Java SDK. Hopefully support will be added to the Python SDK (boto) soon. I hacked together a quick class to demonstrate a proof-of-concept for this service:

Which should return output like this:

One interesting thing is the concept of encryption context. As far as I can tell, encryption context is just a string to string map which you pass to KMS along with the plaintext you are encrypting. The resulting decryption call must pass the same encryption context to decrypt that cipher text, or else the decryption will fail. It seems your encryption context cannot be empty, this is the reason for the {“hello”:”goodbye”} map that I pass in my example. This was pretty confusing at first, as I was getting 400: Invalid EncryptionContext responses from the API for seemingly no reason. I’m not sure what the reasoning behind this is, or if I am just doing something wrong.

Also on the topic of encryption context: apparently there are two kinds of encryption context, the type used by the simple Encrypt/Decrypt endpoints, and that used by the Grant endpoints. The main difference is that the first integrity checks the context, but does not incorporate it into the cipher, while the latter does incorporate it into the cipher. This allows you to require a certain encryption context when you make a grant to another user, of which the docs give one interesting example:

As an example, consider Amazon EBS. When an Amazon EBS volume is attached to an Amazon EC2 instance, a grant is created that allows only that instance to decrypt only that volume. This is accomplished by encoding the volume ID in the encryption context passed to the CreateGrant function. Without the encryption context, the Amazon EC2 instance would require access to all volumes encrypted under the key before it could decrypt a specific volume.

Pretty exciting stuff, KMS is clearly going to be a huge win for security conscious apps.

Algorithmic Detection of Header Rows

In text-based tabular data formats like CSVs, a header row can be included, but it is not always consistent. What I propose is a probabilistic method for checking if a text-based tabular file begins with a header row. Essentially it boils down to this: get the character frequencies in the first row, and compare them to the character frequencies in the rest of the document. Calculate a permutation p-value by calculating every such one-against-all comparison. If the first row beats alpha, it is a header.

Take the following example:

first_name last_name age height
john smith 23 185cm
jane doe 44 166cm
alvin harris 56 150cm
christine brown 22 176cm

A histogram of character frequencies for the first row vs. the rest of the rows looks like this:

char

These differences are significant enough to be detected reliably, especially if the header row is free form verbose English or abbreviations and the data is mostly numeric. It is probably not totally reliable, but if you can make some assumptions about your data this type of technique could help aid automation of data tasks.

The code to implement this sort of header detection looks like this:

Here I have chosen bhattacharyya distance as my distance metric for the character frequencies, but any distance metric probably works fine since we are using a permutation p-value. Source code in the style of python command line utility is available on github: determine_header.py

Setting Up NuoDB on AWS

NuoDB is scalable RDBMS, that allows almost a full subset of the functionality of MySQL. The appealing thing is that you can swap in NuoDB for your MySQL application that needs to be scaled up and not rethink too much how you interact with your database. I recently tried out NuoDB and I thought I would share my experience of how I got it up and running on AWS as the documentation I’ve seen has been kind of inconsistent and it took a little hacking around to get everything to work.

Step 1:

Obtain a NuoDB developer license here. Without this the multi-host configuration will not be functional.

Step 2:

On your EC2 dashboard go to Launch Instance > AWS Marketplace > Search NuoDB, then choose your instance size and region and spin it up. This will start up a UNIX instance with NuoDB already installed and running. Generate a private key for these instances if you don’t have one already. For this example I’m using three instances which I will refer to as A, B, and C. Its helpful if you label them as such in your dashboard, which should look something like this:

Screen Shot 2013-09-13 at 8.00.23 PM

Step 3:

Under Network and Security > Security Groups there should be an automatically generated security group NuoDB-1-2-AutogenByAWSMP-. Select that group and under the Inbound tab create a custom TCP rule for port range 48005-48100, with the source set to the private IP of the instance. Do this for instances A and B, which will be our “Broker” instances.

Screen Shot 2013-09-13 at 8.02.41 PM

Step 4:

Install your license on all instances. SSH into the instance using your private key, then it’s helpful if you add the NuoDB executables to the path:

Copy your license key into a text file somewhere on the instances. The file should contain only the license key. Then we need to start up the NuoDB manager, which you can do here (shown with the default credentials):

This statement will not return anything, but you can verify the license has been installed with:

Step 5:

Now we need to configure each of our instances. For each instance, set the config file at /opt/nuodb/etc/default.properties to the following:

Instance A:

Instance B:

Instance C:

Once each config file is set, relaunch the nuoagent on each instance in this order: A, B, C. You can do this with the following command:

Step 6:

Now that our hosts are set up to talk to one another, we need to give them some tasks. We need to assign one host to be a storage manager, and another to be a transaction engine. We do this again through nuodbmgr, execute these commands on either of the broker instances (A or B).

I found that unless the archive is in /var/opt/nuodb/production-archives/ the storage manager will silently fail to start, Perhaps this is a result of some other misconfiguration, but this is not referenced in the NuoDB docs.

Next we start our transaction engine, from the same interactive nuodbmgr session:

At this point we have an empty database running, from the same instance you ran nuodbmgr on, you can check it out for yourself:

There should be no tables as the database is empty. If you want at this point you can load the “Hello World” data that is included with the NuoDB distro, it is a bunch of ice hockey statistics.

Step 7:

Now the reason why we have configured two brokers in this example is so that you can connect to the database externally. The external broker (Instance A) advertises its External DNS, and communicates with the internal broker (Instance B), which also doubles as the transaction engine, and further communicates with the storage manager (Instance C), which is where the data actually is. Externally you can connect to the database via:

NuoDB also offers bindings and drivers in a variety of languages, in addition to being able to use the command line tool nuosql to interact with your database.