Forum BigDB Outage Details

Discussion and help relating to the PlayerIO database solution, BigDB.

Outage Details

Postby Oliver » May 4th, 2011, 12:13 pm

Hi,

We're sorry to announce that because of a faulty component upgrade yesterday, we've had to restore BigDB data that is about 10 hours old.

If you're affected by this rollback, need further assistance, have questions or similar, please contact us directly at info@player.io, and we'll work with you directly to resolve your issues and figure out compensation.

Sorry,
Player.IO.

Technical Details:
We're running part of BigDB on top of a piece of software called Cassandra. Yesterday we upgraded our Cassandra cluster to a newer version (cassandra is still very much under active development), ensured that everything worked correctly, and, as per our usual operations, we also closely monitored the system for the next few hours.

However, what we didn't realize was that our Cassandra upgrade left the cluster in a slowly deteriorating state, causing it to silently fail more and more until it finally died completely. Upon identifying the issue, we were faced with a tough question: Do we try and fix the upgraded version of Cassandra, and hopefully saving most of the data, or do we rollback to the previous version along with the database backup we took around 10 hours ago, thus restoring systems to known working state but losing some data? After discussing various ways we might try to fix the upgraded version, we decided that since some data had already been lost, and since there were so many unknowns, the best solution would be to rollback to the old cluster with the latest backup.

How we're going to prevent this type of issue in the future:
We did extensive testing of the new Cassandra cluster in closed environment, including performing a test upgrade on a complete replica of the live cluster. However, since the problem manifested itself slowly and over many hours, that testing didn't identify the problem.

In the future, we'll be adapting an approach where we perform shadow writes from our live systems to a test cluster running the newer version, until we know that the new version is stable enough for our usage.
User avatar
Oliver
.IO
 
Posts: 1159
Joined: January 12th, 2010, 8:29 am

Re: Outage Details

Postby cjcenizal » May 4th, 2011, 5:01 pm

Thank you for explaining what happened and for adjusting your methodology for preventing it from happening again. That's all I could ask for.
cjcenizal
Paid Member
 
Posts: 115
Joined: March 29th, 2011, 12:31 am

Re: Outage Details

Postby wbsaputra » May 5th, 2011, 3:09 am

Finally i found issue why i'm getting complain from some user in my game about their stats is rolling back.
Its okay, i know is not easy to build multiplayer shared server cluster. Cheer up.
wbsaputra
Paid Member
 
Posts: 150
Joined: June 29th, 2010, 4:38 pm

Re: Outage Details

Postby wbsaputra » May 14th, 2011, 10:19 am

I think the bug with BigDb new object come again... :cry:
wbsaputra
Paid Member
 
Posts: 150
Joined: June 29th, 2010, 4:38 pm


Return to BigDB



cron