While the trigger of this post may have been a series of unfortunate events and an ugly flame bait that followed on my inbox some time last week, the inspiration is a certain German customer of Bytemobile.
So what exactly is the “German style” of upgrading? Here are a few rough notes based on more than 5 on-site visits I’ve had to the aforementioned customer to support the upgrade of Bytemobile’s products.
- Test environment: This isn’t an environment with a bunch of (virtual) machines where new features are experimented with. With the exception of scale it actually mirrors every single detail of the live environment. Nothing goes live before being installed at the test environment, the installation being documented to the finest and most ridiculous detail you can think of (i.e. UTP cable colors), rigorously tested in both a positive and negative manner (yes, this includes stuff like pulling PSUs, including redundant ones to test failover). And once it goes live the test environment for both the production and the “failover” release remain in the test environment. There is no such thing as a “simple configuration change”.
- Night shifts: No upgrade takes place in business hours. A bunch of people from both the customer and the vendors involved actually stay up all night. The upgrade process is kicked off after midnight. It happens in a gradual manner, a small but measurable amount of users being transitioned to the new environment. All elements and units related to the new system are closely monitored for any abnormalities or alarms. A full barrage of end-to-end tests is being performed to spot anything out of the ordinary.
- Rollback: rollback is always a possibility accounted for in the project schedule. In fact the rollback window is scheduled before the sun rises and traffic volumes start increasing (typically some time at 6). In the event that something goes wrong the fewer people notice the better.
- Day shift: Regardless of the outcome people that stayed up all night get to enjoy lots of sleep, handing over to the day shift, either physically or over a conference call if remote personnel is involved. Traffic volumes are now starting to build up, hence the day shift has its hands quite full: they continue to run tests and actively monitoring the system to make sure that it handles the load gracefully.
- Failure is an option: When complex systems and multiple vendors are involved failure is not only an option but part of the business. Everyone does a hard work of avoiding it but it may and will occasionally happen. When it does, it’s not dealt with hastily. People make sure that they have ample time to understand what and why it went wrong as well as why they failed to capture it in their rigorous prior testing and planning. Unless the issue was a really minor one and easily fixed (i.e. the network administrator neglecting to set up an alarm clock; not that this has ever happened), the next activation is planned for at least a week into the future, more if the root cause is hard to address or other upgrades take place at that time.
Dedicated to the unnamed engineers who feel that flipping a switch and working long hours to clean up the mess caused by poor in-advance testing and planning has any “German” qualities