Loading Avro to Redshift

Recently, AWS added support for bulk loading avro files direct to redshift. Previously, redshift only supported delimited flat files, fixed width files, and JSON as input formats. This could create a hassle for some workflows, as you might end up transforming your data to one of these formats just to get it into redshift.

I ran a few tests of how bulk loading avro to redshift works. I started with a simple avro schema:

And what would make sense to me as the complementary SQL table:

Assuming we have an avro DataFile loaded to redshift that fits the above schema, we can load our table using the following query:

The ‘auto’ format parameter seems to work the same as it does for bulk loading JSON, all top level fields in the avro schema are matched to their respective column based on field name. Everything seems to just work, and the data ends up in our table in the form we would expect it to, converting properly between avro and postgres types.

One interesting thing I noticed is that if you have a top level field with a complex data type such as {"name": "array_field", "type": {"type": "array", "items": "string"}} or {"name": "map_field", "type": {"type": "map", "values": "int"}}, and a schema that includes a varchar column with the same name, the loader will serialize the value to JSON and store it in the varchar column. This is useful behavior as redshift has native JSON accessor functions that work pretty well.

I also tried an avro / table combination that would have a blatant type mismatch (string into int column), and the load fails with an error as expected.

The next question is how is the performance of bulk loading avro compares to the other data formats. Using the above AllTypesTest avro schema I created three sets of test files with random data values, and CSV equivalents for each one, then benchmarked loading them to a redshift cluster. My results were as follows:

Cluster Instance Type: dc1.large
Cluster Size: 1 node

NRecords AvroLoadTime AvroSize CSVLoadTime CSVSize
1.00E+06 55.783 s 39M 8.87 s 65M
1.00E+07 167.285 s 387M 28.825 s 654M
1.00E+08 1520.653 s 3.8G 251.418 s 6.4G

As we can see loading the exact same data from avro takes about 9 times longer compared to CSV. This is super disappointing as I’m pretty sure that deserialization performance for avro should be much better than for CSV (see one benchmark here). However, the performance does not seem to be debilitatingly horrible. If your workflows are all already avro based, using avro bulk loading could be the right call, as it’s just so much cleaner then adding in a pre-load transformation step. Plus in theory avro should be faster, so I suspect as AWS pushes backend updates we will see performance improve.

My code to generate my test avro and CSV files was as follows: