JeffBeal.net

Job 6: Amazon Relational Database Service

About 12–18 months after starting at Amazon, the project I was working on lost a lot of its leadership, and was eventually canceled. After a month or two helping out some other teams, we were told we were going to be reorganized to a completely different organization to work on some new projects for Amazon Web Services.

For some context, this was happening in 2008. Prior to joining Amazon the year earlier, I hadn't heard of AWS. I had taken a few days to go through a small AWS hack-a-thon project when I joined Amazon, which introduced us to the basics of EC2, S3, SQS, and SimpleDB, but the services were not really in use internally at Amazon. At first, the idea of moving from a relatively central organization in the eCommerce Platform to what seemed like still an early experiment, was a little unnerving, but once I understood the product we would be working on, I got pretty excited.

The product was the Relational Database Service. The pitch was simple — make it easy for AWS customers to provision and manage a relational database system in the cloud, and automate a lot of the more generic management aspects of the system. Three engineers and I from my original team, along with part of our testing team, joined with six engineers who had already laid out some of the initial groundwork for the system, but hadn't quite gotten things up and running.

Towards the end of my last project at Amazon, my team had taken some time to read up on the Scrum software development methodology, and had built what we thought was a pretty effective process for pushing out working software incrementally every three to four weeks. I had moved into a manager role, and at the very beginning of the RDS project, was focused on building my manager skills more than on being an engineer. We stayed divided into two sub-teams, with me continuing on as ScrumMaster for my original team.

Not too long after the project started, it became clear that the merger of two teams left the team with too many managers for the engineers we had. Towards the end of the last project, they had hired a Senior Manager to manage me, and he had come with us to RDS. The team we joined also had a Director, a Senior Manager, and a Manager, so looking strictly at titles, there were five managers for nine engineers. This was not working. Of the five managers, I was the only one who had worked recently as an engineer at Amazon, so they asked me — and I agreed — to move back to an engineering role.

As an engineer on the team, I found myself working on a lot of features that, while critical to take the product to the next stage, often didn't seem like the core piece of the product. The first feature that I remember putting significant effort into was making the database security groups work. This wasn't the central core of making the database work, but the security groups were essential to allow customers to secure their database instances from network intrusion. (Modern AWS has much better controls for this with their VPC system, but that didn't exist when we were working.)

Over the course of the next year or two, I found myself working on a lot of work that, similar to security groups, was essential but kind of around the edges of the database itself. After security groups, I worked on parameter groups and billing. I did a lot of work over time making sure that our process for deleting a database instance was completely robust to all sorts of error conditions that came up. The last feature I remember working on prior to the launch of RDS was the point-in-time restore feature, that allowed customers to restore a copy of their database to its exact state at any point in time over the past seven days.

The final process of launching RDS was a blur. There were so many late nights and weekends getting things ready to go. We had run a fairly extended private beta program and had tons of feedback from early customers to incorporate. AWS processes evolved over the course of our project, so we had to make sure we coordinated with the command-line interface team and the AWS console team to make sure that, instead of launching with only an API, we launched with full CLI and Console support. If I remember correctly, in the end we launched about two months behind schedule, but with quite a bit more functionality than we had originally intended as part of the launch.

By the time we launched RDS, the team had grown to about fifteen engineers from starting with just under ten. At the same time, most of that surplus of managers we had at the beginning of the project had left the team. Instead of having a manager for every two engineers, we had one manager left for fifteen engineers and growing. For my overall contributions to the project, I earned a promotion to Senior Engineer around the time of the product launch, but technically held that title only briefly, as I decided to switch back to a manager role to help grow the team.

As a new manager on the team, one of my chief responsibilities was representing RDS at AWS's weekly operations meeting. Every Wednesday, every manager for every service spent two hours packed into a room, representing how well our services stood up to customer traffic that week. If we had a major outage, we were expected to present an in-depth discussion of why our systems had failed, what we had done to fix it, and how it would never happen again. If there was time after going over all of the outages, they would randomly pick one of us to show our dashboards and metrics, and everybody in the room would look for what we were missing. I didn't love this meeting, but I definitely learned a lot from it. (I think that you could come up with far less stressful and more humane ways to get similar value as what this meeting provided, but that's a topic for another post.)

1: Lessons I'm Learning

I really cannot overstate how much I learned about building software working on this project. The engineers I worked with remain the best I have worked with (so far) and the product itself is, in many ways, the professional accomplishment of which I am most proud (so far). I will get into more detail about many aspects of this project over time, but for now, I want to focus on a couple.

Disagreements can be healthy

One of the practices we had on this team that I think is pretty unique in my experience was the amount of time we spent talking about what we would build before we built it. Especially in the beginning of the project, almost every feature started out with a design document which the team would review and discuss before an engineer started to actually write out the code. While I no longer think it is necessarily a good idea to go to this extreme, I remember that some of our pre-code discussions could get pretty heated as this team of extremely talented engineers would debate the various pros and cons of different approaches. At the time, I definitely remember feeling tons of frustration as my ideas would be dissected and debated by other members of the team, but in hindsight, I can see how much my skills as an engineer were strengthened through these discussions. I think that fostering an environment where a team can trust each other enough to stridently argue with one another, but then align on a team decision, is one of the most difficult and important factors in building a strong team.

Resiliency in the face of failure builds reliability

One of the biggest challenges we had working on RDS was the unreliability of our infrastructure. RDS was built almost entirely on top of existing AWS infrastructure — especially EC2 and EBS — and, at the time, they were far from reliable. These systems embraced the concept of ephemeral hardware, and encouraged customers to accept the possibility that their EC2 instances could disappear at any moment, but databases had to provide much higher levels of availability and durability in order to provide their core value to customers. A lot of my contributions to this project over time came because I was usually the first engineer in the office, and would be reviewing our overnight test failures. In many cases, our tests failed because we had encountered a new failure scenario from our underlying systems. We would absolutely reach out to our partner teams to let them know of the issue, but rather than wait for them to improve their system, we figured out ways to be more resilient to each new error. RDS was far from perfect when we launched, but it was a successful product from Day One, and generally well-regarded as a reliable system. It's reliability did not come from a lack of underlying error, but from handling many different types of errors in a way that minimized their impact on the RDS customers.

Testing is hard

Another thing we did on this team was write a lot of tests. We had multiple types of tests, and not all tests would be written by the engineer who built the feature. Sometimes, we would have one engineer write the tests while another built the feature, and we also had dedicated test engineers who did nothing but write tests and testing frameworks. From the very start of the project, we had tests running hourly that would create a database, make sure you could write code against it, and then delete it and tear it down. Getting these tests right was hard, and I learned tremendous respect for those engineers with a mind for testing. I learned that the mindset for testing is more about breaking than about building, and thus there can be significant value in having people on the team with that mindset, coming along behind you, trying their hardest to break what you have built.