Amazon recently released Glacier, a new web service designed to store rarely accessed data. Thanks to boto, a Python interface to Amazon Web Services, it's very easy to store/retrieve archives from Glacier.
If you have never heard about Amazon Glacier you should read the Amazon Glacier FAQ and the Amazon Glacier developer guide.
The basics
With Glacier, a backuped file is an archive stored in a vault. To make an analogy with Amazon S3, an archive is like a key and a vault is like a bucket.
To download an archive, and even to get the inventory, you must first initiate a job that will complete within 3-5 hours, you can optionally get notified via the Amazon Simple Notification Service, then you can download the result.
Also, Amazon specify that you should maintain your own inventory.
Getting started with boto
Here is the strict minimum to store/retrieve an archive. You should also check the API Reference.
import boto
ACCESS_KEY_ID = "XXXXX"
SECRET_ACCESS_KEY = "XXXXX"
# boto.connect_glacier is a shortcut return a Layer2 instance
glacier_connection = boto.connect_glacier(aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=SECRET_ACCESS_KEY)
vault = glacier_connection.create_vault("myvault")
# Uploading an archive
# ====================
# You must keep track of the archive_id
archive_id = vault.upload_archive("mybackup.tgz")
# Retrieving an archive
# =====================
# You must initiate a job to retrieve the archive
retrieve_job = vault.retrieve_archive(archive_id)
# or if the job is pending (with job_id = retrieve_job.id)
# retrieve_job = vault.get_job(job_id)
# You can check if the job is completed either manually, or via Amazon SNS
if retrieve_job.completed:
job.download_to_file("mybackup.tgz")
That's it !
Keeping track of the inventory
I chosed to use shelve to store both the inventory and waiting jobs.
Here is a simple class that can help you getting started:
# encoding: utf-8
import os
import shelve
import boto.glacier
import boto
from boto.glacier.exceptions import UnexpectedHTTPResponseError
ACCESS_KEY_ID = "XXXXXXXXXXXXX"
SECRET_ACCESS_KEY = "XXXXXXXXXXX"
SHELVE_FILE = os.path.expanduser("~/.glaciervault.db")
class glacier_shelve(object):
"""
Context manager for shelve
"""
def __enter__(self):
self.shelve = shelve.open(SHELVE_FILE)
return self.shelve
def __exit__(self, exc_type, exc_value, traceback):
self.shelve.close()
class GlacierVault:
"""
Wrapper for uploading/download archive to/from Amazon Glacier Vault
Makes use of shelve to store archive id corresponding to filename and waiting jobs.
Backup:
>>> GlacierVault("myvault")upload("myfile")
Restore:
>>> GlacierVault("myvault")retrieve("myfile")
or to wait until the job is ready:
>>> GlacierVault("myvault")retrieve("serverhealth2.py", True)
"""
def __init__(self, vault_name):
"""
Initialize the vault
"""
layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
aws_secret_access_key = SECRET_ACCESS_KEY)
self.vault = layer2.get_vault(vault_name)
def upload(self, filename):
"""
Upload filename and store the archive id for future retrieval
"""
archive_id = self.vault.create_archive_from_file(filename, description=filename)
# Storing the filename => archive_id data.
with glacier_shelve() as d:
if not d.has_key("archives"):
d["archives"] = dict()
archives = d["archives"]
archives[filename] = archive_id
d["archives"] = archives
def get_archive_id(self, filename):
"""
Get the archive_id corresponding to the filename
"""
with glacier_shelve() as d:
if not d.has_key("archives"):
d["archives"] = dict()
archives = d["archives"]
if filename in archives:
return archives[filename]
return None
def retrieve(self, filename, wait_mode=False):
"""
Initiate a Job, check its status, and download the archive when it's completed.
"""
archive_id = self.get_archive_id(filename)
if not archive_id:
return
with glacier_shelve() as d:
if not d.has_key("jobs"):
d["jobs"] = dict()
jobs = d["jobs"]
job = None
if filename in jobs:
# The job is already in shelve
job_id = jobs[filename]
try:
job = self.vault.get_job(job_id)
except UnexpectedHTTPResponseError: # Return a 404 if the job is no more available
pass
if not job:
# Job initialization
job = self.vault.retrieve_archive(archive_id)
jobs[filename] = job.id
job_id = job.id
# Commiting changes in shelve
d["jobs"] = jobs
print "Job {action}: {status_code} ({creation_date}/{completion_date})".format(**job.__dict__)
# checking manually if job is completed every 10 secondes instead of using Amazon SNS
if wait_mode:
import time
while 1:
job = self.vault.get_job(job_id)
if not job.completed:
time.sleep(10)
else:
break
if job.completed:
print "Downloading..."
job.download_to_file(filename)
else:
print "Not completed yet"
Bakthat
You may also want to check out bakthat, a Python tool I wrote, that allow you to compress, encrypt (symmetric encryption) and upload files directly to Amazon S3/Glacier, you can use it either via command line, or as a python module.
Your feedback
Don't hesitate if you have any questions !
Tip with Bitcoin
Tip me with Bitcoin and vote for this post!
Leave a comment