Thursday, December 01, 2011

Amazon.com's Simple Storage Service and Hayloft

Information technology is like fashion: you live long enough, you see it change many times. And sometimes the new stuff looks kinda like the old stuff. But that doesn't mean you don't get excited about it. Change is inevitable; you might as well learn to enjoy it.

I'm watching the IT world evolve to a model that is handheld battery powered consumer devices, which these days we're calling smartphones and tablets, communicating wirelessly with enormous distributed computing systems, which we now call cloud computing. I'm all excited. For lots of reasons. Not the least of which is the fact that my professional career over the past thirty-five years as wobbled back and forth between developing software for small embedded systems and for high performance computers, gigantic distributed systems, and server farms. Turns out, the skill sets are the same. I am not the first person to have noticed this.

My most recent project is to develop Hayloft, a C++ object oriented interface to Amazon Web Services (AWS) Simple Storage Service (S3). My goal is for Hayloft to make it easy to use S3 from embedded devices. S3 is Amazon.com's web-based storage service that can (if you do it right) reliably and inexpensively store blobs of arbitrary data. How much data? A single blob, or object, can be up to five terabytes in size. And you can store a lot of objects. Objects are organized into buckets. Each bucket can be thought of as a web site with its own domain name, which means bucket names have to conform to the internet's Domain Name System (DNS) syntax. Each object is independently addressable via its own URL. To S3, these objects are simply opaque data files in the classic mass storage system reference model sense: they can range from HTML pages to scientific databases to virtual machine images to what have you.

(Indeed, I strongly suspect you could build a distributed mass storage system from S3 just as Cycle Computing built a 10,000 core distributed supercomputer from Amazon.com's Elastic Computing Cloud (EC2) for one of their customers. I'm pretty excited about that, too.)

Why would you you want to access S3 from an embedded device? Can you think of any applications for an infinitely large, internet-accessible, web-addressable, reliable, secure, storage system? Whether you are building clever consumer applications for the commercial space, developing sensor platforms for the defense and intelligence domain, or information processing applications for the industrial, financial, or medical markets, if the thought of this doesn't make your hands shake, you aren't thinking clearly. Seriously. Terabytes of storage are just a wireless connection away from your handheld and embedded applications.

Here's a little taste of what Hayloft is like. These examples are taken directly from the unit test suite included with Hayloft with all the EXPECT and ASSERT statements removed. I've also removed all of the error recovery and consistency convergence code which in any practical circumstances are very necessary; I'll talk more about that in later articles. Hayloft presents both a synchronous and an asynchronous interface. These examples use the simpler synchronous interface: when the C++ constructor completes, all the S3 work is done.

Here's a code snippet that creates a new bucket, and writes a local file into an object in it. The user key id and secret access key, which are sort of like your login and password for AWS, and which are provided by Amazon.com, are in environmental variables. Also in an environmental variable is a bucket suffix appended to all bucket names to make them globally unique; I use my internet domain name. All of the other S3 parameters, like location constraint, endpoint, access control list, and so forth can be omitted, because for experimenting, the defaults are reasonable. Input and output is handled using Desperado I/O functors.

BucketCreate create("bucket");
PathInput input("./oldfile.txt");
Size bytes = size(input);
ObjectPut put("Object.txt", create, input, bytes);

By default, buckets and objects are created by Hayloft with private access. With just a few more lines you can specify public read access so that you can (and I have) retrieve this object using your web browser.

AccessPublicRead access;
Context context;
context.setAccess(access);
BucketCreate create("bucket", context);
Properties properties;
properties.setAccess(access);
PathInput input("./oldfile.txt");
Size bytes = size(input);
ObjectPut put("Object.txt", create, input, bytes, properties);

This object can now be accessed using the following URL.

http://bucket.hayloft.diag.com.s3.amazonaws.com/Object.txt

This is a code snippet that reads the object into a local file, then deletes the object.

PathOutput output("./newfile.txt");
ObjectGet get("Object.txt", create, output);
ObjectDelete delete("Object.txt", create);

Other than some #include statements for header files, and some curly brackets and what not, that's about all there is to it, if you ignore (at your peril) error recovery. Here's a snippet that gets a table of contents from an existing bucket, then uses the C++ Standard Template Library (STL) to iterate through it and print the object names.

BucketManifest manifest("bucket");
BucketManifest::Manifest::const_iterator here = manifest.getManifest().begin();
BucketManifest::Manifest::const_iterator there = manifest.getManifest().end();
while (here != there) {
const char * key = here->first.c_str();
printf("%s\n", key);
++here;
}

While using Hayloft is easy, installing it may take a few minutes. Hayloft is built for GNU and Linux on top of Desperadito, my C++ systems programming library (a subset of the much larger Desperado), and libs3, Bryan Ischo's excellent C-based S3 library. libs3 is built on top of CURL, Open SSL, and XML2, all of which can be probably acquired through the standard package manager on your Linux system. The unit tests for Hayloft are built with Google Test (or Google Mock, which includes Google Test), Google's outstanding C++ unit testing framework, and Lariat, my thin layer over Google Test.

But once you get past the installation, starting to play with S3 in C++ applications is as simple as pulling out your credit card, signing up for an AWS account, and writing a few lines of code.

No comments: