Space Vatican

Ramblings of a curious coder

Using Glacier From Ruby With Fog

Glacier is Amazon’s latest storage service, designed for long term storage of data. It’s not a direct replacement for S3: retrieving data is slow and potentially expensive, so stay away if you expect to retrieve data often or if waiting a few hours for retrieval is too slow.

One good use case would be if you have to keep data for a very long time for regulatory reasons. Storage is a lot cheaper than S3: storage costs $0.01/GB/month. By comparison if you have less than 50TB S3 charges $0.11, and its 1-5PB rate is $0.08/GB (more on pricing later). In case the long-termness needs hammering in, Amazon actually charges a fee for data deleted within 90 days of upload.

As of early September 2012 the AWS ruby sdk doesn’t include glacier support. If like me you want to use glacier from your ruby apps, one of your options is the fog gem that supports glacier from version 1.6 onwards.

Basic concepts

Glacier centers around 3 concepts: vaults, archives and jobs.

A vault is a collection of many archives. An archive is a single file stored in a vault and jobs are how you ask Glacier to perform retrieval operation. Fog’s glacier support includes models, so creating a vault is done like this:

1
2
3
4
5
6
7
8
9
glacier = Fog::AWS::Glacier.new
vault = glacier.vaults.create :id => 'myvault'
#=>   <Fog::AWS::Glacier::Vault
# id="myvault",
# created_at=2012-09-03 19:22:47 UTC,
# last_inventory_at=nil,
# number_of_archives=0,
# size_in_bytes=0,
# arn="arn:aws:glacier:us-east-1:1234567890123456:vaults/myvault"

The stats in the vault description are only updated once a day, so don’t be alarmed when you upload files and the numbers don’t change. Even the daily update can be a bit out of date, I uploaded several hundred archives over the course of a day and the stats shown the next morning (supposedly updated at midnight) only accounted for around half of them.

Getting data into Glacier is easy:

1
2
3
4
5
vault.archives.create(:body => 'data', :description => 'my first archive')
# =>   <Fog::AWS::Glacier::Archive
# id="gTesVWGCyKvv07zdXiYnbjvjIBpWKPi_F80dLklgpGSfoE_zj4htSy8mNuZNuMrGIzKT8WNgmKtirq_ZxL3bYj3G-3nw",
# description="my first archive",
# body="data"

Unless you’re tracking metadata externally you’ll probably want to set the description to something meaningful or you’ll have no idea what’s in the archive.

For larger files you’ll want to use multipart mode by supplying a file (or IO instance) as the body and the chunk size to use (must be a power of 2 multiple of 1MB).

1
vault.archives.create :body => File.new('somefile'), :multipart_chunk_size => 1024*1024

With a smaller part size you’ll waste less time if you need to retry chunks and you’ll use less memory but there will be more overhead from opening a connection every 1MB of the way. There is also a hard limit of 10000 chunks per upload, so you’ll need bigger chunks for large uploads. Fog uploads parts one at a time but if you drop down to the raw api you can upload parts in any order you want, and multiple parts at a time.

Jobs

Glacier stores the bulk of its data in some mysterious form of storage that is cheap and durable but very slow. To get your data you create a job: this asks amazon to place a copy of that data on fast storage. You can then access that copy of your data with an http request.

If you want to know what’s in a vault, arm yourself with patience and create an inventory retrieval job:

1
2
3
4
5
vault.jobs.create(:type => Fog::AWS::Glacier::Job::INVENTORY)
# =>   <Fog::AWS::Glacier::Job
#    id="SsRqiT0MNuA9bZy6EhCN9klYXOHtuiOsxZHa0vjGna-zQqTLCY4nbHpMLt-12aL4kZryUjoCvd-a",
#    action="InventoryRetrieval",
#    ...

After 3-4 hours the job will complete and you can request its output. The get_output method returns an Excon::Response whose body is the data you’re interested in: a json document describing the vault’s contents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
job.get_output.body
=>
{"VaultARN"=>"arn:aws:glacier:us-east-1:1234567890123456:vaults/vault1",
 "InventoryDate"=>"2012-09-04T00:02:26Z",
 "ArchiveList"=>
  [{"ArchiveId"=>
     "PXK1d6pQPtNDkoLxAtuzyvjrV7pVXiakLgMMZ7p0i7b2slXTXV7JP2mVFDc-mgOMkhwn7lA0Izn5jFKxYIKAQoHzA",
    "ArchiveDescription"=>"",
    "CreationDate"=>"2012-09-01T17:01:09Z",
    "Size"=>4,
    "SHA256TreeHash"=>
     "3a6eb0790f39ac87c94f3856b2dd2c5d110e6811602261a9a923d3bb23adc8b7"},
   {"ArchiveId"=>
     "kbB38fFWfaWvhUG7H-iBg8vPqr0Y1uwcY0nTxw4ETOJnOBudWhrTy3IOTxUKkmRHdmIkHt2e-e-OFmfvndj2T_REg",
    "ArchiveDescription"=>"",
    "CreationDate"=>"2012-09-02T00:14:37Z",
    "Size"=>9756860,
    "SHA256TreeHash"=>
     "a4b72323790e372e71e631c240dfa7aa4f43d78e5f49243a589ca3f2a87a14cd"}
  ]}

The vault inventory only gets updated once a day: the 4 hour job just retrieves the last computed inventory rather than calculating a fresh one. If you want something more up to date or with more information about the archives you’ll have to store that information outside of glacier. The list of vaults in your account is instantly accessible so if you don’t have too much data you could store each archive in its own vault (you’re limited to 1000 vaults per region).

You can wait for the job to complete by polling, using the usual job.wait_for {ready?} idiom but you’re much better off using Glacier’s SNS integration. You can set this up on a per job basis:

1
2
vault.jobs.create :type => Fog::AWS::Glacier::Job::INVENTORY,
                  :sns_topic => 'arn:aws:sns:us-east-1:1234567890123456:mytopic'

or for all jobs linked to a vault

1
2
vault.set_notification_configuration 'arn:aws:sns:us-east-1:1234567890123456:mytopic',
                                 %w(ArchiveRetrievalCompleted InventoryRetrievalCompleted)

When you get the notification retrieve the job

1
job = vault.jobs.get(job_id)

and then you can get its output. I’m not doing anything very fancy so I’ve just configured a topic thats send me an email although obviously you could trigger automated processing of the job output off the back of this.

Because getting an inventory is so slow you might want to store your archive ids and what they correspond to outside of Glacier. For my backups I store metadata associated with archive ids in SimpleDB. I chose SimpleDB because it doesn’t have any fixed hourly costs (unlike DynamoDB for example) and I wanted it to be independant from any of the other datastores I use.

Retrieving archives

You retrieve an archive by creating an archive retrieval job.

1
vault.jobs.create(:type => Fog::AWS::Glacier::Job::ARCHIVE, :archive_id => '...' )

Again, be prepared to wait a 3-4 hours for the job to complete. With archives you’ll often not want to load the whole archive into memory (which is what get_output does by default). Instead you can stream the output to an IO instance (such as a file):

1
2
3
File.open('myarchive') do |f|
  job.get_output :io => f
end

You can also retrieve portions of a file. To only get the first 500 bytes of an archive

1
2
3
File.open('myarchive', :range => 0...500) do |f|
  job.get_output :io => f
end

Once completed jobs stick around for 24 hours. I’m not sure how this is supposed to work if you had an archive of the maximum size (40TB) allowed – to grab it all in 24 hours you’d have to average 485MB/s which would stress most network and storage systems. Even AWS import/export could be tricky – you’d have to ship them some sort of storage array.

Pricing

Storage pricing is fairly straightforward: there’s a flat per GB/month fee for any data you store ($0.01 in us-east-1, slightly more in other regions). Retrieval is where it gets interesting.

Obviously you pay the usual bandwidth charges if you’re transferring data out of the region (there is no bandwidth charge when transferring to EC2 instances in the seame region). Then it appears simple: you can download 5% of your data per month for free and beyond that there is a $0.01/GB fee. In fact it’s rather complicated because it’s based on peak throughput, not total volume.

The monthly 5% allowance is converted into a daily allowance (~.17%/day). If you burst through your daily allowance in any one day of the month then you’re billed (for the whole month) based on your peak hour in that day (I’ve no idea what happens if the amount of data stored changes substantially during a month),

If you retrieve 0.1% on one day and 0.1 % the next day then you get charged nothing. If you were instead to retrieve 0.2% in one day, then Amazon would find your peak hour, deduct your prorated free usage charge your for the remainder.

In the example on their pricing page, an account has 12TB stored so can retrieve 20GB per day for free, but bursts through that limit by retrieving 1GB/hour for 24 hours. The free allowance is 0.82GB/hour, so the chargeable volume is 0.18GB.

Amazon then charges you for that that for every hour in the month: for a thirty day month you’d pay 720 * 0.18 * 0.01 = $1.30. Furthermore it would appear that the time period used is the time it takes Amazon to get execute the job (ie 3-4 hours), not how long it takes for you to download the file. This would also imply that if you only download half the file you’re still paying for the whole file, because Amazon still had to transfer the whole file out of long term storage.

One consequence is that retrieving files one at a time is cheaper than retrieving them in parallel – if those 24 gigabytes were retrieved simultanously and the jobs all took 4 hours then the peak rate would be 6GB/hour and the monthly charge would be $37.2 (assuming there weren’t any other periods of heavier usage that month) for retrieving the same bytes. Having had those expensive hours you can continue to retrieve data at that rate for the rest of the month for no extra cost.

So try not to get too much data out, and if you do have to, do it in february on a non leap year (since then you’ll only pay for 672 hours). It would also appear to be a good idea to store smallish archives so that you can be granular with your retrievals.

API

Of all the AWS APIs I’ve worked on for fog glacier was by far the easiest to work with. It’s more RESTFUL than some of the older ones so slightly more predictable. Nearly all the api calls are idempotent: if a request fails (for example the network connection drops) it’s safe to retry exactly the same request without having to figure out whether the previous attempt succeeded. Best of all, the data it returns is JSON, so no XML parsing. Thanks amazon!