Hi,
We're sorry to announce that because of a faulty component upgrade yesterday, we've had to restore BigDB data that is about 10 hours old.
If you're affected by this rollback, need further assistance, have questions or similar, please contact us directly at info@player.io, and we'll work with you directly to resolve your issues and figure out compensation.
Sorry,
Player.IO.
Technical Details:
We're running part of BigDB on top of a piece of software called Cassandra. Yesterday we upgraded our Cassandra cluster to a newer version (cassandra is still very much under active development), ensured that everything worked correctly, and, as per our usual operations, we also closely monitored the system for the next few hours.
However, what we didn't realize was that our Cassandra upgrade left the cluster in a slowly deteriorating state, causing it to silently fail more and more until it finally died completely. Upon identifying the issue, we were faced with a tough question: Do we try and fix the upgraded version of Cassandra, and hopefully saving most of the data, or do we rollback to the previous version along with the database backup we took around 10 hours ago, thus restoring systems to known working state but losing some data? After discussing various ways we might try to fix the upgraded version, we decided that since some data had already been lost, and since there were so many unknowns, the best solution would be to rollback to the old cluster with the latest backup.
How we're going to prevent this type of issue in the future:
We did extensive testing of the new Cassandra cluster in closed environment, including performing a test upgrade on a complete replica of the live cluster. However, since the problem manifested itself slowly and over many hours, that testing didn't identify the problem.
In the future, we'll be adapting an approach where we perform shadow writes from our live systems to a test cluster running the newer version, until we know that the new version is stable enough for our usage.