Since yesterday I see exceptionally high ratio of failed or significantly delayed S3 operations. As you may know, uKeeper internals runs on AWS and make use of several services like S3, SQS, DynamoDB and some others. S3 especially important for distributed data retrieval process. Practically, all article-related resources like images and other attachments coming from S3.
S3 works as eventually-consistent storage, which means a simple thing - in some cases the data wrote to S3 not immediately visible to all readers, which is fine if readers expects such behavior and could wait enough time till updated data available. uKeeper one of such smart readers and worked this way since day one. Usually, the waiting period was 100-200ms, in very rare cases - 500ms. As a paranoid developer I put in place a waiting period for up to 15sec, but since Apr 2012 I have seen 4 cases only with unusually high latency (~3sec).
However, since yesterday S3 has been acting differently - relatively large part of submitted objects still not available for hours after the write! I have informed AWS support, and looks like they are trying to fix it. I see number of such incidents decreasing dramatically and for last 12 hours I got just 4 delayed writes.
On uKeeper side this issue initially caused “request rejected” to some users. As soon as problem was detected I put in place a hot-fix allows to process articles even if one of resources failed / delayed by S3. In this case user will get an article, but it may have a missing picture. Please note - this is really, really rare case now and hopefully AWS will get it fixed completely very soon.
From this indecent I learned a few important things about “what to do if AWS acting strange” and going to implement a new set of backup strategies for cases like this.
UPD: 01/05 13:57 CDT – The problem with S3 was resolved completely.