I’ve been a software engineer for more than 4 years now, on the last job I had, I was doing a kinda “software escalation engineer role” in which I had to be the lead of issues occurring in the production systems of our customers, train our team and also support team, my daily bread was always weird and hard issues that I needed to resolve in a timely manner when possible, providing daily updates to Management, support, and customer.
And this has helped me a lot to be more open-minded and find more creative ways to troubleshoot different types of issues, from network, database, and code to weird distributed application performances. Today, in this article I’ll show you some tips and tricks in how I narrow down an issue, from the beginning to the end.
A weird issue appeared, what can we do?
This is always how an escalation case come into our hands, and as software engineers, we will be fixing new stuff that is not documented anywhere, imagine that there is an escalation that said “My customer cannot perform remote searches to their geo-distributed applications” this may sound tricky, challenging and scaring at the beginning but here I’m to show how I like to work with such scenarios. Do not worry we will have a detailed section of what to do from each specific network, performance, code, database issue at the end, let’s just focus on the initial approach of the example above;
1- Aks questions! Understand their scenario…
When a case like that comes in, I always ask questions since it will speed up the investigation, some questions that may arise are: In which specific data center is their application located? Is it just impacting searches? How frequently does this occur? Which application version are they using? Is it old or outdated? How’s their architecture? Are they using on-premises systems or the cloud? Which is the error that they are seeing? Is it a timeout? or another type of exception? As you can see there are a lot of questions that can be asked in such a situation (and I think that is why some support folks had a hard time following up) since you can lose the reason for those questions, which leads me to the second point
2- Explain what you are trying to achieve
When working on a worldwide team, there is always a cultural barrier as well as the English language skill (I talk Spanish, English is my second language) so I can understand that not all people have the same facility to the language, so always explain, give a context to try to get better feedback from your team and your customers, if you show up confidence and explain them in details you will ace that interaction!
3- Now you understand the issue… What’s next?
Let’s assume you understand the issue, they have a distributed application based on 10 worker applications and 3 master nodes in which the masters perform remote searches to the workers and you can see the data on the masters, they are located in the US in different places such as Chicago, New York, and Texas. The version is up to date so they have all the latest patches, and the logs show that they are having several timeouts…. so… what could be wrong here? You are right! Since they are geo-located in different places the network could be the problem here, or maybe a mix of latency and code.
I hope this preliminary analysis was clear enough for you and you got the point of the importance of understanding the issue, this is important since now we are going to move to the more technical part, how to troubleshoot this based on the evidence that we already gathered?
Troubleshooting tips & tricks
We are going to continue the issue that I’ve just described earlier, in which we were able to narrow down the issue to something related to the network or application level.
Troubleshooting network-related issues!
Here you must understand how the network connections works, TCP and UDP is key, identifying how the remote calls are done as well in your application is key since not all the applications have the same way to communicate, some of them use a Pub/Sub architecture, other ones RPC calls, and others just on endpoints.
I recommend as well to get familiarized with the OSI Model, OSI is our friend to troubleshoot issues like this, in our example, the issue might be related to the Network or Transport layer, as well to the application (since it was also receiving and processing information)
At this point, we have identified 3 possible layers in which our issue in the example might be wrong The Network, Transport, and Application… so what’s next? Well, let’s identify a feasible test for each of them, we need to discard/probe that our approach is good enough which this leads me to another tip
Always get metrics!
When troubleshooting any type of issue related to network, performance, etc the metrics are gold! How will you identify you are facing a problem if you do not have trustworthy data to compare against it? A good way of getting metrics (depending on your issue) is to gather OS metrics such as CPU, DISK I/O, RAM, Threads Dumps, etc! If you do not have the tools to do it, create your own scripts!
But how we will do it for the issue that we are troubleshooting? Well since it is a distributed application, all the requests go over the network so we can create scripts to gather metrics of the ICMP traces (pings) and save that with a timestamp and parsing the duration of each ping, in this case, it should be placed on a search master and ping each machine of the workers so we a confirm or discard the issue with the network layer.
To discard the application issue (webserver) we may follow the same approach but maybe send a GET request to an endpoint so we can identify if the webserver is dying or not. If there is a gap between our metrics it will be the issue of the transport layer.
Troubleshooting performance issues!
As I said before we always need metrics, before and after any change, in order to provide good performance metrics we should also monitor the server behavior and consumption maybe with grafana and Prometheus to be able to have very good visibility of our application consumption as well using the network debugger of every browser and track the time each request takes.
One common error that I noticed is that everyone does performance metrics with the happy path, but not with the worst-case scenario, for example: If the application is fully loaded how those numbers are affected? Can we improve how it behaves under load? This is really important.
Troubleshooting non-replicable issues in the lab!
These are the hardest ones since we do not have a local lab to play with and everything must be planned to avoid unnecessary time loss. Here we must understand in which circumstances it is occurring and doing like a “history” of all the issues faced with the given application or machine, I know that usually, those issues not replicable in our lab are due to specific issues affecting the specific machine of the customer.
So it is important to understand if there was any crash before, or if they added new software that could affect their system, I remember a time in which the customer deleted data while moving an NFS and he wanted us to fix that, well… sometimes it is not possible
Other tips that I want to share with you are:
- Understand the user use meant to do
- Always read logs and isolate classes
- Check network and communication protocols
- Do not forget ciphers
- Always get metrics!
- Create scripts and more scripts when needed!
- Do not trust what the customer says and always double-check
- The most basic thing could lead to incredible headaches, so always check the basis
- Check every module that plays a role in the communication, from the code to the network and hardware systems
This was a high-level overview of how to face those challenging issues it does not contain all the details but I wanted to share this as an example of how you can divide and conquer in order to achieve progress.