Amazon Web Services  »  Service Health Dashboard  »  Amazon Simple Storage Service (US) Event: July 20, 2008

Amazon Simple Storage Service (US) Event: July 20, 2008

9:05 AM PDT We are currently experiencing elevated error rates with S3. We are investigating.
9:26 AM PDT We're investigating an issue affecting requests. We'll continue to post updates here.
9:48 AM PDT Just wanted to provide an update that we are currently pursuing several paths of corrective action.
10:12 AM PDT We are continuing to pursue corrective action.
10:32 AM PDT A quick update that we believe this is an issue with the communication between several Amazon S3 internal components. We do not have an ETA at this time but will continue to keep you updated.
11:01 AM PDT We're currently in the process of testing a potential solution.
11:22 AM PDT Testing is still in progress. We're working very hard to restore service to our customers.
11:45 AM PDT We are still in the process of testing a series of configuration changes aimed at bringing the service back online.
12:05 PM PDT We have now restored communication between a small subset of hosts. We are working on restoring internal communication across the rest of the fleet. Once communication is fully restored, then we will work to restore request processing.
12:25 PM PDT We have restored communication between additional hosts and are continuing this work across the rest of the fleet. Thank you for your continued patience.
12:51 PM PDT The restored hosts are stable and we are moving forward in restoring communication between additional hosts.
1:17 PM PDT We continue to make incremental progress and communication between additional hosts has been restored. We are continuing with the plan to restore communication across Amazon S3's large fleet of hosts.
1:38 PM PDT At this point, we are accelerating progress on restoring internal communication as all signs continue to look good.
2:03 PM PDT We have restored all internal communication between hosts in the EU and we are continuing to make progress in the US. Once all internal communication has been restored, we will start a multi-step process to begin accepting requests across Amazon S3 locations.
2:19 PM PDT A quick update to let you know that we have now also restored all internal communication between hosts in our West Coast facilities in the US.
2:36 PM PDT We have restored all internal communication across Amazon S3 hosts. We have started the multi-step process to begin accepting requests across Amazon S3 locations.
3:07 PM PDT We are attempting to bring EU back up now, followed by our US locations. EU will be first due to the smaller number of hosts. No data has been lost during this incident.
3:23 PM PDT EU service has been fully restored. We have been working on the US in parallel but restoration will take longer due to fleet size.
4:03 PM PDT We are in the process of restoring Amazon S3 US, and will shortly be bringing the web servers back online. We expect the service will be restored within 1 hour.
4:22 PM PDT US service has been partially restored. We continue to work to fully restore the service.
4:42 PM PDT We continue to restore US service. We expect that request processing will be fully restored within 15 minutes.
5:00 PM PDT Amazon S3 service has been restored and is returning to normal. We will continue to monitor closely.
5:12 PM PDT We are confirming that service in both the US and EU has been fully restored. We appreciate your patience. We will provide more detail on this event once we have completed a full investigation.
July 21, 7:19 PM PDT We wanted to share a brief note about what we observed during yesterday's event and where we are at this stage. As a distributed system, the different components of Amazon S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide to which redundant physical storage server to route a request. In order to share this state information across the system, we use a gossip protocol. Yesterday, we experienced a problem related to gossiping our internal state information, leaving the system components unable to interact properly and causing customers' requests to Amazon S3 to fail. After exploring several alternatives, we determined that we had to temporarily take the service offline so that we could clear all gossipped state and restart gossip to rebuild the state.

These are sophisticated systems and it generally takes a while to get to root cause in such a situation. We're working very hard to do this and will be providing more information here when we've fully investigated the incident. We also wanted to let you know that for this particular event, we'll be waiving our standard SLA process and applying the appropriate service credit to all affected customers for the July billing period. Customers will not need to send us an e-mail to request their credits, as these will be automatically applied. This transaction will be reflected in our customers' August billing statements.


Conditions of Use | Privacy Notice     © 2006-2008 Amazon Web Services LLC or its affiliates. All rights reserved.