Development Blog

Where we pretend to know how to code.

Maintenance is hard and what we learned, Pt 2.

Published: 2017-10-05

Author: Teddi

Apologies for the day break between these two posts. I couldn’t quite find how I wanted to convey what I wanted to easily, so I gave myself an extra day.

The new setup we have hasn’t changed the hardware in any way, shape or form; bar the fact that we can now access the full 2TB of the disks we have instead of the 1TB due to the Raid 1 configuration we had at the time (for clarity, when we first brought this server online years ago; it was built with 1TB discs which have both in time failed, replaced with 2TB). Operating System-wise, we were previously running Windows Server 2008 R2 which proved to be an interesting learning experience. In essence it comes with your standard consumer windows features but with a few extra features out of the box.

For the time (and experience) this was the logical and the best option we had.  It gave a familiar environment with a neat GUI and most importantly gave us the ability to run and install Garrysmod Servers. Back in 2008 GMod could only run on Windows, no OSX or Linux options were available (well, Linux via WINE sure but this was hardly optimal) and over time I’ve come to learn about Windows own various shortcomings whilst also branching out. A really glaring issue for example is the fact that Windows handles all networking on Core 0. This means if you get some sort of Denial of Service, Windows can be overwhelmed simply by just throwing enough data for it to crunch regardless of the fact we had a 1Gbit pipe (at the time, your average attack was 30Mb - 250Mb). In a time where Denial of Service attacks were still treat primarily with disdain and nullroutes by Datacentres, this proved to be a glaring hole; the only fixes were to use a linux based OS which was capable of dealing with it better or installing a hardware firewall at a greater cost. Given the time, cost, experience and finances of those involved it was decided that the only option was to ride it out.

However it’s not all negativity. Windows Server 2008 R2 served us from 2009 until September 2017, an entire 8 years. 8 years in tech time is an incredibly long time and there’s been some interesting leaps and bounds in the server space. Cloud computing! Virtual machine prominence! Containers!11!

An impromptu meeting between various [BB] staff members was held on where we would go, how we would upgrade. While some people prefer to maintain the status quo and stick with what they know - we aim to upgrade to be bigger, better and easier. The less we have to worry about backups, redundancy and so on the better! So our plan became this

  1. The host machine will change from Windows Server 2008 R2 to Debian 9 Stretch.
    • This gives us all of the benefits of Linux with little to no downsides.
    • For example, Linux and Debian are both open, malleable systems. Open source means more eyes on it which in tandem means bugs and security issues get patched incredibly quick, often within hours of their announcements (as opposed to waiting for Patch Tuesday for Microsoft).
    • Linux is incredible with some of the networking features that we plan to leverage, especially in terms of creating greater access the most popular servers.
  2. We’ll deploy virtual machines for applications that require them. So for our Garrysmod servers, we deploy a virtual machine via KVM / QEMU that run Windows Server 2016
    • Running it in a Virtual Machine allows us to take snapshots of the OS, meaning we can rollback the entire VM if there’s a bad update, if something goes wrong
    • It also means if we swap to new hardware, we can toss it straight over without having to reinstall everything, reconfigure and so on (for the most part at least).
    • We can use manage and filter the connection between Linux and Windows. Denial of Service attack that isn’t being picked up by the Datacentre firewall? We can filter it with greater effectiveness and block it from ever hitting the Windows VM.

I’m paraphrasing most of the benefits and keeping it short. It’s something that realistically could be fleshed out more for those interested in its own blogpost. The key point I’m trying to bring across is this is fantastically better. It gives us that desired redundancy we’ve been after for a long time and neat bits of control.

When the server was initially configured, there were some misunderstandings initially from myself on how the networking actually occurred. On the second day; this is what took up the majority of the time (in total, about 2-3 hours) to configure and have working as intended. Once this aspect was sorted and I was able to connect in externally to both the Linux aspect and the Windows aspect at the same time it meant the actual workload was complete. The next few hours were based around re-uploading all Player data (130MB compressed down to 20MB) and re-configuring the software. The last remaining 3 hours of work ended up being a case of waiting on a license to be transferred and re-activated correctly so we could get the updates we needed to have everything running as previously. To remedy the fact that this was taking much longer than expected -  a manual, more crash-prone system was set up which allowed us to resume service and allow players to get back online and playing their favourite servers! Thankfully we only had to run the less-stable state for a few hours before we swapped it over.

All in all, if we ignore the trials and tribulations of the first day, it seems this would have taken about 6-8 hours to complete given the best case scenario, well within the estimated time-frame initially allocated. As shown though wrenches can be thrown in to the plan which delay these motions and make things not so much harder, but instead just plain longer. If I had to give advice to anyone doing this in future, it’d probably be these points -

  • Compartmentalisation is good. The more your systems can be spread apart (without intense fracturing / splitting) the better. This means you only need to move exactly what needs to be moved and no more.
    • An example of this is the Forums, Donation server and so on remained online without any issue due to the fact we don’t put everything on a single server.
  • Smaller input actions for bigger processing actions is ideal. A 200MB .iso is far less likely to fail compared to a 5.4GB iso, especially on a tentative connection.
  • Double, triple, quadruple check you have everything you need. Even if you expect a system to carry it over as part of its database - double check this again to ensure that this is the case.
    • A few minor configuration files were lost in the process as a result of me anticipating they were packed up elsewhere. Turns out this wasn’t the case.
    • Thankfully, that is all that was lost.
  • In the event things are taking longer than expected even though the processes themselves are realistically straightforwards and without issue, look how to deal with it faster and better instead of driving yourself insane with the same problem over and over.

Hopefully this post and the previous were of interest to those that read these. I’m hoping to hit the sweet spot between being technical yet conveying that technicality to those who may have less experience in these matters. While we’re still not 100% up and running with a couple of servers, this should be done before the week is over. Thanks to everyone for bearing with us while we did this work and for holding on while we do the last few tweaks!

Historical Posts