The S3 Backup Challenge
When migrating a large file system to AWS S3, One of the surprising challenges is the implementation of a backup strategy. With file-system based storage, you often back up the disk with periodic snapshots (daily, weekly, monthly, etc). There is comfort in knowing that you have a plethora of time capsuled copies stored away and, in the event of an extinction level event, can recover most of your data.
It turns out this analogy doesn't easily transfer over to S3. In fact, after investigating the confusing alternatives available for S3 backup, it can feel like you are ending up with less recoverability than you have now. The reality is that you need to think about backup a little differently and focus on the recovery scenarios. In the final analysis, you are probably better off with S3 in this light than with traditional disk snapshots.
S3 Recovery Scenarios
Assuming that you have researched S3, you probably want to handle the following scenarios -
- S3 bucket fails. Unlikely but the equivalent of a disk crash.
- Accidentally delete objects. Human-error on our side.
- Permanently delete object versions. Again, human error on our side.
- Objects/data get corrupted. Backups could propagate to multiple versions
- Accidentally delete bucket. The apocalypse scenario. The equivalent of rm -rf mybucket/
1. S3 Bucket Fails - Solution: Cross Region Replication
Problem: AWS has an infrastructure failure in S3 or the surrounding data center. AWS claims > 99.99% reliability in S3 so this scenario is unlikely however the solution is to have the entire bucket copied to another bucket somewhere else. Fortunately, AWS provides a built-in mechanism to achieve this - Cross Region Replication (CRR).
Steps to enable CRR
1. Create a new bucket to replicate data into. Eg. mybucket-replica. Create it in a different region. Eg. If your original bucket was in us-east1, put the replication bucket in us-east2. Make sure to enable versioning on the bucket.
2. On the original bucket, go to Management->Replication and Add Rule. Again, versioning must be enabled on your original bucket.
3. [Source] To replicate all content (going forward), keep the defaults and hit Next. [Destination] Select the new bucket that you created (Eg. mybucket-replica). Click Next. [Permissions] Select Create new role. Review and click Save.
Note: Replication does not retroactively copy objects that were already in the bucket so you may need to manually copy over existing objects.
2. Accidentally Delete S3 Objects - Solution: S3 Bucket Versioning
This scenario covers perhaps the most common case which is your own code or process causes objects to get deleted inadvertently. Fortunately, S3 versioning handles this case automatically; and better than "brick-level" restore from snapshots. With versioning enabled, objects are not actually deleted from buckets. While you will reference the "current" version, you can revert to previous versions at any time. We won't go through the mechanics of restoring previous versions here but it is important to enable versioning on the bucker.
Steps to enable S3 bucket Versioning
1. Go into the bucket properties. Click on Versioning and then Enable versioning.
3. Permanently Delete Object Versions - Solution: MFA Delete Protection (versioning)
So would if you accidentally delete old (all) previous versions of an object? To prevent this from occurring, you can require Multi-Factor Authentication (MFA) for all permanent deletions of objects. So an administrator will have to provide a code from an MFA-enabled device to achieve this deletion making it very unlikely to happen without the deliberate intent to delete that object.
There is a surprising lack of detail surrounding both the conceptual approach and the mechanics of implementing this. Part of the confusion is that there are two different contexts for "MFA Delete". In this case we are talking about an MFA Delete property on the bucket's versioning property. This cannot be set from the console. You have to use the AWS CLI to enable this but before doing that you need MFA set up on the account and an MFA device. Think of an MFA device as being one of the RSA SecurId fobs with a constantly changing 6 digit number.
Steps to enable S3 MFA Delete on a bucket
1. First, make sure that you have MFA enabled on your account (root). IAM -> Activate MFA on your root account. You will need to set up a device. The easiest option is to download the Authenticator app from Google onto your phone. You will then scan the QR code shown in the AWS console on your phone and you have now have MFA enabled.
2. Now you can set up MFA Delete on the versioning information for the bucket. Again, to do this you need to use the CLI. Here is a command to set this up. Note that the '123456' that you see below is an actual code shown on your MFA device (Authenticator app on your phone).
aws s3api put-bucket-versioning --bucket mybucket --versioning-configuration Status=Enabled,MFADelete=Enabled --mfa "arn:aws:iam::111111222222:mfa/root-account-mfa-device 123456"
3. While you cannot set this property in the console, you can see it. If you click on the bucket, you should see the properties on the right where you can verify that MFA delete is enabled.
4. Objects/data get corrupted - Solution: S3 Bucket Versioning
Another scenario that can occur is that some portion of your data gets corrupted. This corruption, if left untreated, can propagate to multiple snapshots or versions. With a finite number of backups, this could result in ALL of the snapshots becoming corrupted. Fortunately, if you don't expire out old versions, your S3 bucket can contain a full version history of each object making recovery as straightforward as finding the last uncorrupted version.
Steps to enable S3 MFA Delete on a bucket
(See scenario 2 above to enable bucket versioning)
5. Accidentally Delete Bucket - Solution: MFA Delete Bucket Policy
Finally, we have the scenario that is most unsettling; the entire bucket getting inadvertently deleted. This feels much worse than any scenario is in the disk/snapshot world because this is like losing the current file system AND ALL BACKUPS at the same time.
First of all remember that we enabled cross-region replication in step 1. So, we do have another copy of the entire bucket with ALL versions somewhere else AND it is up to date.
The key here is to make it virtually impossible to delete the bucket. We do this by setting a bucket policy that denies delete attempts without MFA; similar to the MFA delete on the bucket versioning but set up differently. For this, we use a bucket policy.
Now, while the bucket policy that we put in place will require MFA, remember that a user can log into the console using MFA. Technically, they now have an MFA authentication and could delete the bucket from the console. A nice secondary check that you can include in a policy is the age of the authentication; that is, how long ago did the user authenticate. One precaution that you can take is to set this time, the MultiFactorAuthAge, to 1 second! This will make it impossible to achieve from a log into the console. So now the only way to achieve the delete is to first change the policy.
Steps to enable S3 MFA Delete on a bucket
1. Go into the bucket and click Permissions then Bucket Policy
2. Enter or add something similar to the following policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "MFADeleteOnly",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:DeleteBucket",
"Resource": "arn:aws:s3:::myBucket",
"Condition": {
"NumericGreaterThanIfExists": {
"aws:MultiFactorAuthAge": "1"
}
}
}
]
}
Summary
With a little thought you can create a highly available S3-based storage environment that is resilient enough to recover from virtually all of the common mayhem scenarios. While the approach differs from traditional file system snapshots (backups), in many ways it provides better recovery in practice.
Thank you for your tutorial. It was very helpful. I thought I would let you know that Policies now have a different syntax, so your MFA bucket policy will not work.
ReplyDeleteHere is an example of the new syntax:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt001",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:DeleteBucket",
"Resource": "arn:aws:s3:::mybucket",
"Condition": {
"NumericGreaterThanIfExists": {
"aws:MultiFactorAuthAge": "1"
}
}
}
]
}
Notes:
Version: MUST be "2012-10-17"; this is the AWS version of the compiler, this is NOT your version of the policy.
The entire statement must now be inside the "Statement": [...]
"Sid" must now be inside the "Statement"
"Sid" is truly your own personal reference, it has no bearing on the actions of the policy at all. Use this to help you remember what the statement is doing for you.
Cheers!
Jason
Thanks for the correction! I have updated the post.
DeleteBy default, Amazon S3 stores buckets in a minimum of 3 geographically distant availability zones within the same availability region. So if one bucket fails in one region, then it should be quickly fixed, so CRR is not needed.
ReplyDeleteCRR is not needed if an AZ fails.
DeleteIf a region fails, CRR will help (As its cross-REGION replication).
This was very helpful thank you. Regarding step 5 - how can the whole bucket be deleted without the contents first being removed? As of 2020 anyway, trying to delete a non-empty bucket on aws dashboard will prompt a "only empty buckets can be deleted" message...
ReplyDeleteIs there some other way (eg. through CLI) to delete a bucket without first removing the contents?