Development Blog

Where we pretend to know how to code.

Maintenance is hard and what we learned, Pt 1.

Published: 2017-10-03

Author: Teddi

So the most recent weekend (2017/9/29-30) was quite interesting! The servers were down for approximately 27 hours with the physical machine down for about 16 hours. So what went wrong and why did we need to do the work we did?

In terms of  being up-front about one of the causes which fuelled this; we here at [BB] have been fairly silent on a breach that we experienced roughly around May. For those that are perceptive and keep track of the general infosec environment (or maybe you were just flat out affected by this), we were struck by a variant of WannaCry. This variant instead of encrypting and holding our files ransom; instead installed a bitcoin miner on the server and turned it into a zombie machine that would respond to commands from a remote service.

Just to clear one thing up - No actual data loss / data breach occurred. No data was stolen.  We hold no personal details or sensitive details (such as payment / paypal details) on marruuk (Texas) (And even then, we only ever hold tokens about information, we never see your CC / personal details ever).

What happened was part of an automated attack with no direct interest in [BB] itself and thankfully due to the odd behaviour this particular variant caused; my suspicions were roused fairly quickly and we went into triage mode, attempting to isolate exactly what had happened and ways to detect it. Unfortunately due to the nature of how it was embedded in the system we weren’t able to clean the machine in a way we knew for sure meant the infection was gone. However the fact that it was receiving instructions on Port 80 (standard website port) meant we could block it from receiving instructions.  Some time passes and we tentatively unlocked port 80 with what seemed to be no adverse effect and throughout summer we allowed it to remain as is. We always held a plan in place to wipe the server - once a machine has been compromised; you cannot guarantee you’ve fully cleaned it up, so going nuclear is the only option you have.

What brought this plan forwards was the fact that the server started accepting instruction again in September. However this time due to the nature of the instructions and what they were doing - I was actually able to discern what the cause was - a poorly documented system of Microsofts’ designed to assist in automation and administration. I removed the offending entries and all issues subsided immediately.


However this wasn’t good enough - we couldn’t guarantee that we had totally eradicated any potential backdoor or problems. The decision was made to take the server down, wipe it and improve our underlying system. As of the current time writing this, we’ve seen a complete success and we’ve been able to restore everything to normality fairly quickly (give or take a few lesser played servers + the API).

But why were the servers down for so long, I thought you assured 12 hours was the maximum?

It’s a well known fact I, alongside Kaiden and Santamon are all English, living in the United Kingdom.  None of us have physical access to the machine so instead we need to remote in or instead, use a KVM over IP (basically a remote screen, mouse and keyboard etc over the internet). I had queried the datacentre to mount on the KVM an install image of an operating system to cut down on the time and fiddling of other methods we could do this by. In theory the entire process to get this part done should have taken no longer than an hour and that’s being overly generous.

However the .iso image kept unmounting itself; killing the installation constantly. I’d have to put in a ticket in, attempt to resume the installation and if that was impossible - restart from scratch. Some people define insanity as doing the same thing over and over again and expecting different results and this was certainly it. The network speeds were abhorrently slow and it seemed to unmount the .iso what seemed to be no more than 30 minutes. I had been trying since roughly 10am to get this process done and by the time I had decided to take matters in to my own hands, it was about 9:30pm. Not a bad thing that I’d taken an entire day off from my usual day-to-day activities to see this through.

The image that the datacentre was attempting to used was about 5.4GB in size; streamed from a remote location. I decided screw it - we can make this work better, we have to! Through the KVM; I manually set a 200mb .iso of the OS. It was far lighter and nowhere near complete however it gave me enough of what was required to see the job through. 20 minutes later and we had a working OS. It took another couple of hours to get the initial networking put together and by 2:30, or maybe 3am Marruuk was finally alive. Not ready to start hosting servers, but she was alive.

To be continued in Pt. 2 where we elaborate on the new setup, a few other reasons as to why we did it, somewhat of how we did it and the overall conclusion.

Historical Posts