For last two weeks we had two crash-events. This is totally unacceptable, even keeping in mind the current beta-status of uKeeper. The service was not supposed to fail in any circumstances and has to be a rock-solid and highly reliable. Both problems were caused by some unexpected events leaded to extremely high memory utilization and, in result – kernel termination of some uKeeper’s processes.
This is what we doing to get it resolved and protect users against such unpleasant events:
- Investigating the reason for memory utilization spikes. Looks like we have it already identified and addressed, but still monitoring the cure.
- Adding full, end-to-end automatic monitoring with self-recovery options.
- Switching most of important elements of uKeeper (queues, task state and others) to fully persistent model. This will allow auto-healing to be completely transparent and painless for users.
- Adding extremely heavy set of stress-tests to our existing integration tests.
- Getting ready for the new major 0.10.x release with much higher level of components isolation and wider / deeper distribution across multiple servers.
High availability and resilience have been are our major goals from the day one, and such type of issues getting the highest priority in uKeeper’s to-do list.