We recently migrated our AWS deployment from EC2-Classic to EC2 VPC. The VPC model on its own is a worthy change as it allows the vast majority of our instances not to be reachable from the public internet. In addition, increasing numbers of AWS features are only available on the VPC platform, such as:
- Enhanced Networking
- T2, M4, C4 instances
- Flow logs
- Changing security groups of running instances
- Internal load balancers
There are also aspects that are just saner. For example VPC security groups apply to EC2 instances, RDS instances, Elasticache instances etc. whereas in the classic world there are separate database security groups, cache security groups, redshift cluster security groups etc. all doing basically the same thing but spread across umpteen services rather than being defined in one place. Amazon has a long list of fine grained differences. One of the few feature regressions I can find that is the lack of IPv6 load balancers.
For small deployments migration is easy: you can snapshot your instances or databases and recreate them in a VPC. This is obviously only possible if fairly extended downtime is acceptable and gets more and more difficult to coordinate as the complexity of the deployment grows.
A stumbling point to a more incremental migration used to be that there was no integration between EC2 security groups and VPC security groups. This made such a move nearly as involved as moving data centres (for example see instagram’s approach). Luckily this changed in early 2015 with the addition of ClassicLink, which allows EC2-Classic instances to be linked to a VPC, with VPC security groups controlling communication between these instances and VPC resources.
Deployment overview
Our deployment is comprised of:
- A number of web applications, each hosted on an autoscaled set of EC2 instances, behind an ELB.
- An RDS database.
- A MongoDB replica set, hosted on EC2 instances managed by us.
- An Elasticsearch cluster, also hosted on EC2 instances managed by us.
We needed a migration plan that would keep the amount of downtime reasonable. For us downtime of upto 10-15 minutes in one of our scheduled windows is acceptable, as an infrequent event. We were able to complete the migration with one such event.
We used CloudFormation to create subnets, security groups and routing tables so that we could guarantee that our staging and production environments were the same. This also provides us with a history of changes to network configuration.
There are plenty of resources covering VPC design - our basic goal was to aim for all of our instances to be in private subnets, with the exception being bastion instances. One thing to watch out for is that instances in private subnets need to go through your NAT instance(s) to connect to AWS API endpoints, making them a critical piece of infrastructure. This is especially true if you use services such as DynamoDB as your primary data store. For S3 you can setup a VPC endpoint so that connectivity to S3 does not have to go through the NAT instance (VPC endpoints for more services are promised too).
We didn’t use elastic IPs very much prior to VPC, however these can also be migrated from Classic to VPC. Since moving to VPC we’ve assigned elastic IPs to our NAT gateways as this makes it easy for us to provide a small list of static ip addresses connections may come from, for those partners with such requirements.
Establishing connectivity
After a couple of false starts, we had a VPC setup with all the security groups and subnets we would need (one early mistake was to setup the VPC in the 172.16.0.0/16 space. This overlaps with the addresses of the Amazon provided DNS servers in EC2-Classic, which meant that classic-linked instances could not access DNS).
The first step was to link all of our EC2 instances with our VPC. We also updated our provisioning scripts and autoscaling groups so that new instances would also be linked to the VPC with the same security groups. With connectivity between our EC2 instances and the VPC in place, we could start the actual migration.
Data stores
As recommended in the documentation we first migrated our data stores. We started with our MongoDB replica set. The steps we performed were
- Created a new instance in the appropriate private subnet.
- Added it as a secondary member of the replica set.
- Promoted the new secondary to master (after verifying that applications could connect to the new secondary instance). This required 30s or so of downtime.
- Added a second EC2-VPC hosted secondary and arbiter.
- Removed the EC2-Classic MongoDB instances from the replica set & destroyed them.
We moved the Elasticsearch instances next. Our Elasticsearch cluster uses the aws-cloud plugin to detect new members based on the instance tags. For each member of the cluster, we
- Created a new instance in the VPC.
- Waited for it to join the cluster.
- Migrated shards off the old instance, using the
cluster.routing.allocation.exclude._ip
setting - Destroyed the old instance.
The RDS instance was the most troublesome: while an EC2-classic instance can use use ClassicLink to access an RDS instance located in a VPC, the reverse is not true. The same restriction to services such as elasticache.
The approach we followed was to create a read replica inside the VPC. While it only takes a few clicks / api calls to create a replica in another region (inside or outside a VPC), creating a VPC replica in the same region isn’t supported. Instead you have to use RDS’s external replication support. We followed the steps listed here, a brief summary is:
- Snapshot a replica of the database
- Create a new instance in the VPC from the snapshot
- Set the new instance to replicate from the existing instance
- Cut over to the new instance.
It’s important to create the RDS instance in a private subnet: if not then the instance’s endpoint will resolve to its public IP address on classic-linked instances, despite the fact that connectivity via the private IP address is possible. We wanted our RDS instance to be in a private subnet anyway, so this wasn’t an issue. With an instance in a private subnet the endpoint will resolve to the private hostname everywhere.
We used an ssh tunnel to allow our VPC read replica to connect to the existing RDS instance’s private hostname (this tunnel went via a classic-linked instance).
During one of our maintenance windows we disabled access to the old database, before disabling replication and pointing all our applications at the VPC based RDS instance. This is the step that required several minutes downtime (this could have been reduced a little if we had automated it further).
Application instances
First we replaced our load balancers with VPC based load balancers. Classic-linked instances can be added to a VPC load balancer, so this does not require any change to the instances. Instances can be attached to multiple load balancers and the load balancers attached to an autoscaling group can be changed at any time, so the following steps require no downtime:
- Attach the new load balancers to the existing autoscaling groups
- Update the DNS entries for the applications to point at the new load balancers
- Wait for traffic to old load balancers to cease (this took between a few hours to a day or so, despite amazon setting a TTL of 60 on alias records)
- Detach the old load balancers from the autoscaling groups and destroy them.
The very last step was the application instances themselves. The process for these was pretty simple. For each application:
- Create a new VPC autoscaling group with a launch configuration otherwise identical to the existing one
- Set desired instance count of the new autoscaling group
- Wait for new instances to be available
- Set desired instance count of the old autoscaling group to 0
- Destroy the old autoscaling group
Timelines
The above steps could be completed pretty quickly if needed. It would probably have been feasible with sufficient resource and a degree of automation to do this overnight. A lower effort & risk path was for us to do this gradually, over a period of weeks. Our experiments in our staging environment convinced us there was no problem with the reliability of ClassicLink and we saw no issues during our migration, which resulted in instances relying on ClassicLink for several weeks.