Getting started with boto and Glacier

Amazon recently released Glacier, a new web service designed to store rarely accessed data. Thanks to boto, a Python interface to Amazon Web Services, it's very easy to store/retrieve archives from Glacier.

If you have never heard about Amazon Glacier you should read the Amazon Glacier FAQ and the Amazon Glacier developer guide.

The basics

With Glacier, a backuped file is an archive stored in a vault. To make an analogy with Amazon S3, an archive is like a key and a vault is like a bucket.

To download an archive, and even to get the inventory, you must first initiate a job that will complete within 3-5 hours, you can optionally get notified via the Amazon Simple Notification Service, then you can download the result.

Also, Amazon specify that you should maintain your own inventory.

Getting started with boto

Here is the strict minimum to store/retrieve an archive. You should also check the API Reference.

import boto

ACCESS_KEY_ID = "XXXXX"
SECRET_ACCESS_KEY = "XXXXX"

# boto.connect_glacier is a shortcut return a Layer2 instance 
glacier_connection = boto.connect_glacier(aws_access_key_id=ACCESS_KEY_ID,
                                    aws_secret_access_key=SECRET_ACCESS_KEY)

vault = glacier_connection.create_vault("myvault")

# Uploading an archive
# ====================

# You must keep track of the archive_id
archive_id = vault.upload_archive("mybackup.tgz")

# Retrieving an archive
# =====================

# You must initiate a job to retrieve the archive
retrieve_job = vault.retrieve_archive(archive_id)

# or if the job is pending (with job_id = retrieve_job.id)
# retrieve_job = vault.get_job(job_id)

# You can check if the job is completed either manually, or via Amazon SNS
if retrieve_job.completed:
    job.download_to_file("mybackup.tgz")

That's it !

Keeping track of the inventory

I chosed to use shelve to store both the inventory and waiting jobs.

Here is a simple class that can help you getting started:

(gist available here)

# encoding: utf-8
import os
import shelve
import boto.glacier
import boto
from boto.glacier.exceptions import UnexpectedHTTPResponseError

ACCESS_KEY_ID = "XXXXXXXXXXXXX"
SECRET_ACCESS_KEY = "XXXXXXXXXXX"
SHELVE_FILE = os.path.expanduser("~/.glaciervault.db")


class glacier_shelve(object):
    """
    Context manager for shelve
    """

    def __enter__(self):
        self.shelve = shelve.open(SHELVE_FILE)

        return self.shelve

    def __exit__(self, exc_type, exc_value, traceback):
        self.shelve.close()


class GlacierVault:
    """
    Wrapper for uploading/download archive to/from Amazon Glacier Vault
    Makes use of shelve to store archive id corresponding to filename and waiting jobs.

    Backup:
    >>> GlacierVault("myvault")upload("myfile")
    
    Restore:
    >>> GlacierVault("myvault")retrieve("myfile")

    or to wait until the job is ready:
    >>> GlacierVault("myvault")retrieve("serverhealth2.py", True)
    """
    def __init__(self, vault_name):
        """
        Initialize the vault
        """
        layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                                    aws_secret_access_key = SECRET_ACCESS_KEY)

        self.vault = layer2.get_vault(vault_name)


    def upload(self, filename):
        """
        Upload filename and store the archive id for future retrieval
        """
        archive_id = self.vault.create_archive_from_file(filename, description=filename)

        # Storing the filename => archive_id data.
        with glacier_shelve() as d:
            if not d.has_key("archives"):
                d["archives"] = dict()

            archives = d["archives"]
            archives[filename] = archive_id
            d["archives"] = archives

    def get_archive_id(self, filename):
        """
        Get the archive_id corresponding to the filename
        """
        with glacier_shelve() as d:
            if not d.has_key("archives"):
                d["archives"] = dict()

            archives = d["archives"]

            if filename in archives:
                return archives[filename]

        return None

    def retrieve(self, filename, wait_mode=False):
        """
        Initiate a Job, check its status, and download the archive when it's completed.
        """
        archive_id = self.get_archive_id(filename)
        if not archive_id:
            return
        
        with glacier_shelve() as d:
            if not d.has_key("jobs"):
                d["jobs"] = dict()

            jobs = d["jobs"]
            job = None

            if filename in jobs:
                # The job is already in shelve
                job_id = jobs[filename]
                try:
                    job = self.vault.get_job(job_id)
                except UnexpectedHTTPResponseError: # Return a 404 if the job is no more available
                    pass

            if not job:
                # Job initialization
                job = self.vault.retrieve_archive(archive_id)
                jobs[filename] = job.id
                job_id = job.id

            # Commiting changes in shelve
            d["jobs"] = jobs

        print "Job {action}: {status_code} ({creation_date}/{completion_date})".format(**job.__dict__)

        # checking manually if job is completed every 10 secondes instead of using Amazon SNS
        if wait_mode:
            import time
            while 1:
                job = self.vault.get_job(job_id)
                if not job.completed:
                    time.sleep(10)
                else:
                    break

        if job.completed:
            print "Downloading..."
            job.download_to_file(filename)
        else:
            print "Not completed yet"

Bakthat

You may also want to check out bakthat, a Python tool I wrote, that allow you to compress, encrypt (symmetric encryption) and upload files directly to Amazon S3/Glacier, you can use it either via command line, or as a python module.

Your feedback

Don't hesitate if you have any questions !

Share this article

Tip with Bitcoin

Tip me with Bitcoin and vote for this post!

1FKdaZ75Ck8Bfc3LgQ8cKA8W7B86fzZBe2

© Thomas Sileo. Powered by Pelican.