Hi All,
As you all know, the servers had some errors!
I wanted to share exactly what happened… a TL;DR version, and then a longer version for the tech folks out there, just to keep you all informed.
TL;DR
Our new servers behave differently during login to our old servers… they create a bunch of stuff and don’t delete it… we missed that it was happening, and our database grew too large, ate up all our memory, and caused a problem.
Longer (Technical) Version
We were previously using a service for our servers called Parse. Parse closed down today, but we had successfully migrated off of it… the database was moved off last May to the infamous MongoDB (which is actually pretty awesome… MongoDB errors you have received have actually been an issue with the servers, not Mongo), and the actual servers were moved away over the last 2 weeks to something called “Parse Server” - written by the guys who made Parse to emulate how Parse operates.
What we didn’t realise is that Parse Server, during the LOGIN operation creates an object and doesn’t delete it.
Parse also created an object during LOGIN, but was automatically revoking it on the next LOGIN.
This was undocumented, and frankly for most applications they don’t get enough logins to matter… GEMS OF WAR ON THE OTHER HAND has rather a lot of users, 10’s upon 10’s of thousands of them logging in every day, in many cases multiple times!
As any of you who work with databases will know, you have a LOT of telemetry to examine to ensure your database is healthy. We have access to all of that and look at it regularly… we’d been a little concerned about the increase of disk space, but hadn’t figured out what was causing it, thought perhaps maybe it was the new server storing some extra data… made a note to investigate it if it kept growing… and then BAM! we started causing page faults like crazy last night… turns out that one table SESSIONS, which we hadn’t looked at, because it had never been a problem, had almost 50 million entries, and its indices were filling our memory.
Finally…
We haven’t solved the problem yet… though our tech leads assure us it’s well underway. We’ve added extra memory to our database servers while we fix this over the next couple of days, and we’ll keep you all posted how it goes.
I apologize that we weren’t around to keep you filled in last night… I’d seen the initial spikes at 2am, done some maintenance, watched them go back down and called it a day. We’ll try to keep a presence online a little more regularly while we’re fixing things this week.