Providing production support to an application is one of the most challenging aspects of software development. Developers are assigned to the maintenance team and work on patching bugs on the application. They are, however, also available on-call in case a production outage happens, in which case they work to get the application back on track as quickly as possible.
This article aims at providing a set of curated recommendations so that you can prevent bugs in production, and find issues much quicker. Handling these applications in production is a complicated task: Often, there is no documentation available, the application has been written in a legacy technology stack, or both. There are very few training sessions, and it’s common to be called in to provide support for an application about which you know little.
Many developers do not have experience handling an application in production. There is an array of issues that happen in production environments that cause bugs and outages, generally causing thousands and sometimes millions of dollars in lost revenue to the company. Moreover, since the majority of developers have no exposure to the environment they keep making some mistakes that will, in turn, cause those issues. This list of tips should make your job less painful by teaching from production experience.
Tip #1: Remove or automate all the configuration needed for the application to run.
How much configuration is required to get the software installed on a new server? In the past, this could sometimes take three days to complete every time there was a new developer on the team. Installing the application would require many steps that have to be performed manually. Over time, software evolves to new versions which become incompatible with those instructions, and of course, instructions aren’t usually updated. Suddenly, you’re spending way more time than necessary simply to get the application up and running.
With the advent of containerization, it has become much easier to provide a way to get an application up and running in no time, with zero configuration and with the added benefit that, since the Docker image is self-contained, you run a much lower risk of running into issues with different versions of the operating system, languages, and frameworks used.
Likewise, simplify developer setup, so it does not take much time to be up and running, including IDE setup. A developer should be able to go from zero to hero in less than 30 minutes.
When a production issue happens, sometimes your best experts might not be available (e.g., vacation or sickness) and you want whomever you throw at the problem to be able to solve it, and quickly.
Tip #2: Don’t fall into the tech stack soup trap.
The fewer technologies used, the better. Of course, sometimes, you have to use the right tool for the job. However, be careful not to overload on “right tools.” Even drinking water can result in serious health issues if you do it too much. Every new language and framework added to the tech stack has to go over a clearly defined decision-making process with careful consideration of the impacts.
- Do not add a new framework dependency just because you need a
- Do not add a completely new language just because you need to write a quick script to move files around.
A big dependency pile can make your life miserable when libraries become incompatible or when security threats are found either the frameworks themselves or on their transitive dependencies.
Moreover, remember, added stack complexities make it challenging to find and train new developers for the team. People move on to new roles in other companies, and you have to find new ones. Turnover is very high in engineering teams, even in companies recognized for having great perks and work-life balance treats. You want to find the new team member as quickly as possible. Every new technology added on top of the technology stack increases the time to find a new candidate and has the potential of making new hires more and more expensive.
Tip #3: Logging must guide you to find the issue, not drown you with useless details.
Logging is very similar to comments. It’s necessary to document all the critical decisions being taken plus all the information to use in your debugging techniques. It isn’t simple, but with a little bit of experience, it’s possible to map out a few possible scenarios of production outages and then put in the necessary logging to solve at least that. Of course, logging evolves together with the codebase depending on what kind of issues show up. Generally speaking, you should have 80% of your logging on the most important 20% of your code—the part that will be used the most. Important information, for instance, is values from arguments passed into a method, runtime types from children classes, and important decisions taken by the software—that is, the time when it was at a crossroads, and it chose either left or right.
Tip #4: Handle unexpected situations.
Map out very clearly what the assumptions of the code are. If a certain variable should always contain the values 2, 5, or 7, make sure it’s of an enum type, not int. The number one source of large production outages is when a certain assumption fails. Everybody is looking for the problem at the wrong place because they take a few things for granted.
Assumptions should be documented explicitly, and any failures to those assumptions should raise enough alarms that the production support team can quickly rectify the situation. There should also be code to prevent data from going in an invalid state, or at least creating some sort of alert in that case. If certain information should be stored in one record, and suddenly there are two records, a warning should be fired.
Tip #5: It should be straightforward to replicate an issue happening to a customer.
One of the hardest steps is always to replicate the issue faced by the customer. Many times, you will spend 95% of the time trying to replicate the issue, and then the moment you can replicate it, it’s a matter of minutes to patch, test, and deploy. As such, the application architect should make sure that it’s tremendously simple and quick to replicate issues. A lot of this happens because, to get to the same situation the customer is in, the developer has to do a significant amount of application configuration. There are many records stored that together compound the situation the customer is in—the problem being that you as the developer have to guess exactly what the customer did. And sometimes, they have performed a sequence of steps, of which they only remember the last one.
Also, the customer will explain the issue in business terms, which the developer has to then translate to technical terms. And if the developer has less experience with the application, they will not know to ask for the missing details, since they don’t even know the missing details yet. Copying the entire production database to your machine is infeasible. So there should be a tool to quickly import from the production database only the few records necessary to simulate the situation.
Say the customer has an issue with the Orders screen. You might have to import a few of their orders, their customer record, some order detail records, order configuration records, etc. Then you can export that into a database within a Docker instance, launch that instance, and just like that, you are seeing the same thing the customer is seeing. All of this, of course, should be done with the appropriate care to ensure no developer has access to sensitive data.
Tip #6: It should be obvious where to place the breakpoints in the application.
If you have a Customer screen, there should be some Customer object where you can place the breakpoints to debug an issue on that screen. Sometimes developers fall into abstraction fever and come up with some incredibly smart concepts on how to handle the user interface events. Instead, we should always rely on the KISS principle (Keep it Simple, St— er, Silly) and have one easily locatable method per UI event. Likewise, for batch processing jobs and scheduled tasks—there should be an easy way to spot where the place breakpoints to assess whether that code is working or not.
Tip #7: Make sure all the external dependencies are explicitly documented.
Ideally, do this in the README file within the source control system so that the documentation cannot be lost. Document any external systems, databases, or resources that must be available for the application to run properly. Also, note which of these are optional and add instructions on how to handle when they’re optional and not available.
Beyond Debugging Techniques
Once these recommendations are followed while creating new features or providing maintenance to a system, production support will become a lot easier, and your company will spend a lot less time (and money). As you already know, time is of the essence while troubleshooting production bugs and crashes—any minute that can be saved makes a big difference on the bottom line. Happy coding!