A software troubleshooting guide — Tips & Tricks

A weird issue appeared, what can we do?

This is always how an escalation case come into our hands, and as software engineers, we will be fixing new stuff that is not documented anywhere, imagine that there is an escalation that said “My customer cannot perform remote searches to their geo-distributed applications” this may sound tricky, challenging and scaring at the beginning but here I’m to show how I like to work with such scenarios. Do not worry we will have a detailed section of what to do from each specific network, performance, code, database issue at the end, let’s just focus on the initial approach of the example above;

1- Aks questions! Understand their scenario…

When a case like that comes in, I always ask questions since it will speed up the investigation, some questions that may arise are: In which specific data center is their application located? Is it just impacting searches? How frequently does this occur? Which application version are they using? Is it old or outdated? How’s their architecture? Are they using on-premises systems or the cloud? Which is the error that they are seeing? Is it a timeout? or another type of exception? As you can see there are a lot of questions that can be asked in such a situation (and I think that is why some support folks had a hard time following up) since you can lose the reason for those questions, which leads me to the second point

2- Explain what you are trying to achieve

When working on a worldwide team, there is always a cultural barrier as well as the English language skill (I talk Spanish, English is my second language) so I can understand that not all people have the same facility to the language, so always explain, give a context to try to get better feedback from your team and your customers, if you show up confidence and explain them in details you will ace that interaction!

3- Now you understand the issue… What’s next?

Let’s assume you understand the issue, they have a distributed application based on 10 worker applications and 3 master nodes in which the masters perform remote searches to the workers and you can see the data on the masters, they are located in the US in different places such as Chicago, New York, and Texas. The version is up to date so they have all the latest patches, and the logs show that they are having several timeouts…. so… what could be wrong here? You are right! Since they are geo-located in different places the network could be the problem here, or maybe a mix of latency and code.

Troubleshooting tips & tricks

We are going to continue the issue that I’ve just described earlier, in which we were able to narrow down the issue to something related to the network or application level.

Troubleshooting network-related issues!

Here you must understand how the network connections works, TCP and UDP is key, identifying how the remote calls are done as well in your application is key since not all the applications have the same way to communicate, some of them use a Pub/Sub architecture, other ones RPC calls, and others just on endpoints.

Figure #1 OSI is my friend

Always get metrics!

When troubleshooting any type of issue related to network, performance, etc the metrics are gold! How will you identify you are facing a problem if you do not have trustworthy data to compare against it? A good way of getting metrics (depending on your issue) is to gather OS metrics such as CPU, DISK I/O, RAM, Threads Dumps, etc! If you do not have the tools to do it, create your own scripts!

Troubleshooting performance issues!

As I said before we always need metrics, before and after any change, in order to provide good performance metrics we should also monitor the server behavior and consumption maybe with grafana and Prometheus to be able to have very good visibility of our application consumption as well using the network debugger of every browser and track the time each request takes.

Troubleshooting non-replicable issues in the lab!

These are the hardest ones since we do not have a local lab to play with and everything must be planned to avoid unnecessary time loss. Here we must understand in which circumstances it is occurring and doing like a “history” of all the issues faced with the given application or machine, I know that usually, those issues not replicable in our lab are due to specific issues affecting the specific machine of the customer.

  • Understand the user use meant to do
  • Always read logs and isolate classes
  • Check network and communication protocols
  • Do not forget ciphers
  • Always get metrics!
  • Create scripts and more scripts when needed!
  • Do not trust what the customer says and always double-check
  • The most basic thing could lead to incredible headaches, so always check the basis
  • Check every module that plays a role in the communication, from the code to the network and hardware systems

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Josué Carvajal

Josué Carvajal

Sr. Security software engineer working in the DevSecOps area. CompTIA Sec+, C|EH