On July 30 and August 10, we had outages on Etsy.com. We take site outages very seriously around here, dissecting what happened so that we can learn and prepare for the future. Since we’re all in this together, I think it’s important to share with you, our community, information about the outages, how we handled them, why we took the site down when we did, and what we’re doing to help prevent similar events.
To give some context, the events on 7/30 and 8/10 were unrelated, and didn’t have anything to do with public-facing changes or experiments.
July 30: Background
On July 30, we had an outage, and the EtsyStatus comments about it are here.
The Short Version:
- We needed to do a database upgrade to support new languages in Etsy’s future.
- We also needed to make an improvement to how our databases did nightly backups, because they were slowing the site down when they ran.
- We expected to make those improvements separately. Instead, they were accidentally made at the same time.
- In order to confirm that there was no data loss or corruption during the accidental upgrade, we took the site down while we verified everything was in order, which it was.
- We brought the site back up.
The Longer Version:
As you might be aware, Etsy is growing rather quickly globally. In order to ensure that sellers and buyers across the planet can be successful, we need to support languages other than English. We’ve already started with German, French, Italian, Dutch, and Japanese, but we want to support even more. We’re even looking for translators of new languages in the Etsy community.
Some languages have special characters in their alphabets that take up more space to store in a database. Without diving too far into technical details, suffice it to say that in order to support the special characters that some languages require to be complete, we needed to do an upgrade of the database server software.
Now this isn’t a run-of-the-mill change. It’s not the installation of another database server; it’s the upgrading of the 80+ databases that already have data in them. We couldn’t be too careful with this upgrade; we broke it up into a number of steps so that we could be very deliberate in the rollout. We took a good amount of care to plan the upgrade such that it could be done without having to shut down Etsy.com. Since this type of upgrade isn’t something that happens often, we wanted to take it slow and easy.
We did a number of things in order to have confidence that the upgrade would be safe to do:
- We upgraded test servers to confirm that everything would still work, as well as get the special characters that were needed.
- We tested database backups and recovery, so that we could recover from any surprises in the upgrade if we needed to.
- We’ve got a lot of smart engineers here at Etsy, but we reached out to a number of external database experts to ask their advice, since they had experience with this upgrade. We dotted the i’s and crossed the t’s.
- Once we had confidence, we would upgrade one (and only one) server first, and have many eyes and alerts on it looking for anything out of the ordinary, ready to change back quickly if need be.
The one live server we upgraded performed excellently. In fact, the upgrade made the database run faster, because there were performance improvements as well as bug fixes. This server ran for an entire week without any issues at all, and our plan was to upgrade the rest of the 80+ servers slowly over the following week, in a way that would be transparent to the community. This meant no downtime.
So there was a sane and careful plan for the upgrading of the servers over a period of time. The upgrade had gone out to only one server, and the upgrade for the other servers were basically placed “on-deck,” so that when we were ready, it would be straightforward.
In the meantime, we researched some issues (completely unrelated to the upgrade) that we had been seeing when we did nightly database backups. From time to time, as a server was finishing up its backup, it would stall or lock up for a handful of seconds, sometimes up to 30 seconds. This would mean that around 3 a.m. ET, anyone using the site who just happened to hit that one database during this time would have to wait 30 seconds for their response. This is an eternity when you’re trying to list an item or buy something.
This is obviously not good, and while there aren’t many people around the world using the site at that time, it’s still something we needed to fix. We eventually fixed it, and now don’t have the issue at all. We thought it was worth sharing, so we even posted on our engineering blog about it.
When we went to make this improvement on how the backups run on all of the servers, we used an automated tool whose responsibility it is to make sure all of the servers are consistent. After we tested the fix, we pushed the fix to the backups using the tool, and expected to confirm later that night that they did their backups successfully without pausing.
What we didn’t know at the time was that deploying the improvement to the backups also meant deploying the database upgrade as well. We weren’t ready to start upgrading the remaining databases; we only wanted to fix the bug with the backups.
It wasn’t clear to the engineer who deployed the fix for the backup that it would be coupled with the database upgrade. He was under the impression that it was only going to be the backup fix pushed out, basically cutting in front of the upgrade that was currently “on-deck.”
This was a reasonable expectation, because in almost every case, we deploy improvements one at a time. We test first, and then deploy the improvement. In this case, we had tested and only partially deployed the upgrade, to one server only, in order to be extra sure over time that the upgrade was solid.
But the improvement to the backups didn’t cut in front of the upgrade; it pushed the upgrade into production along with it.
So over the course of 5-7 minutes, about 60% of Etsy’s database servers automatically upgraded themselves. This is exactly what the software was told to do, and it worked remarkably well. Except that it was a complete surprise to the engineer, whose intention was not to do the upgrade yet, but just to improve the database backups.
We ordinarily wouldn’t have upgraded databases while they were serving live traffic. The plan was (and always is) in those cases to bring the servers out of production, upgrade them, confirm they are behaving correctly, and put them back live. We can do this without anyone noticing because these servers are arranged in pairs. One in, one out. Upgrade, and swap.
But on July 30, they all upgraded themselves via our automation system while still being live. When we detected this was happening, we disabled the site in order to make sure we weren’t going to corrupt or lose any data, and manage the upgrade more gracefully.
We spent the majority of the outage time making sure that data that had been upgraded was being honored and wasn’t lost or corrupted since the it happened so suddenly. No loss was expected, but we take a “trust but verify” approach when it comes to data on Etsy.com.
Once we were able to confirm that the databases were correct and behaving normally, we brought the site back up. All the while, we tried to do our best to let the community know what was happening, on the EtsyStatus blog.
Of course since the upgrade happened, albeit surprisingly and intrusively, the new international languages and other improvements can move forward.
So what did we learn from this event, and what are we doing to help prevent similar things from happening in the future?
First, we’re bolstering our automated tools to make it clearer to the engineers what is being deployed. If the engineer who was deploying had seen that the “on-deck” changes would have gone out with the backup improvement, then he would have stopped and taken the upgrade out of the list of things to be deployed.
Next, we’re changing the way we do large upgrades such as this one. Before, you had to remember that something was “on-deck” when deploying something else, and even if an engineer is given the opportunity to review what is in the deploy, it can still be performed accidentally. So for large upgrades like this one, we’ll do them in such a way that you can’t deploy it alongside anything else, you have to specify that you deliberately want to perform the upgrade.
There are a number of tools we’re also building in order to confirm that databases have the correct data in worst-case scenarios. We want to be able to 100% confirm as quickly as we can that the data lines up with what we expect and that we’re storing that data correctly in more than one place.
August 10: Background
On August 10, we had another outage, and the EtsyStatus comments about it are here.
The Short Version:
- We need to create unique ID numbers for the various elements on Etsy.com, such as shops, listings, treasuries, etc.
- The servers need to be told in what range these numbers will be in, so they can set aside space and memory for them.
- The space we set aside for some of the ID numbers wasn’t large enough.
- We took the site down in order to fix those “too small” ranges, and confirm that ID numbers that were expected to be unique weren’t colliding with others.
- After confirming all was okay, we brought the site back online again, and began proactively looking for and enlarging ranges that might overflow in the future.
The Long Version:
We have about 20 million registered accounts on Etsy, 2 million shops, over a 100 million (sold and new) listings, as well as many new treasuries, teams, convos, tags, banners, etc. every second of every day. In order to keep track of these and other things in the servers, we need to assign them unique ID numbers.
When a new member registers, or a new shop opens, or a new listing is uploaded, we go to a special set of servers to get a new unique ID number for it. The job of those servers is to make sure that no IDs get reused for the same thing. For example, we don’t want two shops to get the same ID number, because if they did, we couldn’t be sure which one we should show listings from when a buyer wants to browse one of those shops.
Because numbers are infinite, and computers are not, we have to give the servers a general range of numbers that we’re looking to store. We do that so it can set aside memory and disk space for those values. There are two main ranges for ID numbers at Etsy: an “INT” and a “BIGINT.” An “INT” is a number that can go from 1 up all the way to 2,147,483,647 (or about 2.1 billion), whereas a “BIGINT” can go from 1 up to 9,223,372,036,854,775,807 (or about 9.2 quintillion). If you’re interested at all about this topic, there’s a very technical Wikipedia page about it.
In the case where we actually create the ID numbers, we use the BIGINT range. This is good, because, we can be sure that we’re not going to “run out” of numbers to generate for IDs for a very long time. Nine quintillion is a lot of anything.
After we create an ID number for the various things on the site (shops, members, listings, treasuries, photos, etc.), we store those ID numbers in other databases alongside the information that goes with it. Think of this as the index in the back of a reference book. For example, in the case where we want to show the shop page for shop number 123,456, we want to “ask” the databases queries like: “Give me all listings for shop number 123,456.” Then we take the answer that we get, and display them on the shop page. (This is a simplified example.)
Now some fields don’t need to be of the BIGINT type, because they won’t ever (or, it would be an astronomically long time before they could) reach 9 quintillion. In fact, it would be a huge waste of space for the servers, space that we could use for other things more important.
In those cases, we can just use the type of INT, which I mentioned has a maximum of 2,147,483,647, or a little over 2 billion. An example of this is Etsy Teams. We have about 13,400 Teams on Etsy, which easily fits into the 2 billion range of possible numbers for it, and no danger of reaching 2.1 billion anytime soon.
On Friday August 10, one of the places where we generate these unique ID numbers went over the 2,147,483,647 maximum. When the code went to insert these numbers into places where it was expecting less than 2.1 billion, we’d get an error, because it would refuse. One of those places was for treasuries and treasury comments, and another place was on activity feeds.
When we noticed that we had places in the databases where the code would try to insert larger numbers than the INT maximum (2.1 billion) into a place that was defined as an INT, we took the site down purposefully in order to make sure we:
- Didn’t risk the loss or corruption of any data by leaving the site running with this condition.
- Traced the different places where we were trying to insert too large of a number into a field that wasn’t expecting it, and changed the database field to accept BIGINT numbers, not just INT numbers.
Once that was done and we could confirm most of the site was working, we turned the site back on, but left treasuries and parts of activity feeds disabled because we weren’t yet in a place where we felt confident about those pieces. Being able to disable some features is one of the things we do, precisely for situations like this. We don’t want to prevent shoppers from buying items just because the Treasury and Activity Feed weren’t behaving correctly, so we brought the site up without them.
We then tracked down what ended up being about 100 treasuries that were affected by the outage, and that weren’t able to take any comments. We fixed these, and then started the process of re-calculating the activity feeds, which had gone stale during the time we were confirming everything.
What are the things we’re doing in order to prevent this from happening in the future?
This is what we call a “latent condition.” The ID numbers are ever increasing, and until it passed the maximum INT mark, all was well.
First, we’re making some changes to the database to become even stricter about not allowing values that are larger than they are expected, in order to fail in a louder, safer way. We would rather a member get an error message than allow for data to be corrupted. Another thing we are doing is to make sure that in our development environment (where we test changes) we bring all of the ID numbers in the future to be above the max INT, so that any accidental breakage will happen in that environment, not the live Etsy production environment.
We’re also building automated tests to prevent database fields from being created that can be of the wrong type. These are the tests that are run when developers are writing new code, used to make sure that it will work as expected.
If the code tries to store a number into a database that is larger than the INT maximum (again, 2.1 billion) and the database is only told to accept numbers smaller than that, then the test will fail in a very obvious way, so that developers can notice it. Think of this as yet another “safety net” to prevent the same issue from occurring with new code in the future.
We’ve also been looking through the code to confirm that there are no other places where there are mismatched ranges like this. We’re also going to plot on a graph the values of all of the unique ID numbers, and send an alert to our 24×7 on-call team if we are anywhere near the maximum limit again. Alerts are another form of “safety net.”
Other Outages and Degradations
Since the outage on August 10, we’ve spent a good deal tracking down the various places in all of the databases where numbers could have the wrong range assigned to them, and fix them. While during the outage, we fixed the places that were broken, we still had a number of places where the numbers weren’t yet at 2.1 billion, but would break in the future if we didn’t change them to allow the 9 quintillion.
On August 18, while doing behind-the-scenes altering of some data to have the expanded 9 quintillion range, the process was much harsher on one of the databases than we expected. This particular database table was the largest one we have at Etsy — the billing table. We design the process of altering the ranges (from INT to BIGINT) to be as gentle and as slow as it needs to be, in order to do it without any members noticing slowness or blips during the operation.
In addition, we chose the lowest part of the week, traffic-wise, to do it: early Saturday morning. At about 5 a.m. ET, the database hit a part of the process that wasn’t expected: it needed to re-create all of its indexes.
What does that mean? Think of an index at the end of a large book. Now imagine pulling out those index pages, and rewriting that index all over again. It would mean that you’d have to reread the whole book, picking up keywords and phrases to put into your new index, making sure to put them into the index in alphabetical order. In the meantime, to search for anything in the book, you couldn’t use the index, because it’s not yet written.
This is a simplified version of what happened on Saturday morning. While the database was happily recreating its indexes, any queries that came in to it were so slow that they piled up like a traffic jam. We took the site offline in order to prevent the prolonging of the index rewriting. Once it was done and we were able to confirm everything was okay, we brought the site back up.
We weren’t yet confident about viewing your bill at that time, so we left that disabled for a short time while the rest of the site was up. Once we confirmed all was well, we re-enabled the feature.
In many of the situations I listed above, unexpected outages can present a decision:
- Keep the site online and risk either being too slow to be usable or taking in bad data.
- Take the site offline in order to fix the slowness and verify that the data is correct.
In each of the cases, we decided to take option #2, because it’s safer for the community.
After every outage we have, we have a “postmortem” meeting, where we aim to come up with as many things as we can, to:
…from the outage, in order to prevent similar incidents from happening in the future. We take those items and we then give them priority, even over new features. Some of them are quite simple, while others can be more complex. In either case, we want to make sure that we learn from each and every event, even if the community doesn’t notice them.
I wrote this blog post to give you the confidence you deserve that we take outages seriously, are willing to give detailed information about them, and that our aim is to learn from each one in order to lessen the possibility of another in the future.