A software troubleshooting guide — Tips & Tricks

I’ve been a software engineer for more than 4 years now, on the last job I had, I was doing a kinda “software escalation engineer role” in which I had to be the lead of issues occurring in the production systems of our customers, train our team and also support team, my daily bread was always weird and hard issues that I needed to resolve in a timely manner when possible, providing daily updates to Management, support, and customer.

And this has helped me a lot to be more open-minded and find more creative ways to troubleshoot different types of issues, from network, database, and code to weird distributed application performances. Today, in this article I’ll show you some tips and tricks in how I narrow down an issue, from the beginning to the end.

A weird issue appeared, what can we do?

1- Aks questions! Understand their scenario…

2- Explain what you are trying to achieve

3- Now you understand the issue… What’s next?

I hope this preliminary analysis was clear enough for you and you got the point of the importance of understanding the issue, this is important since now we are going to move to the more technical part, how to troubleshoot this based on the evidence that we already gathered?

Troubleshooting tips & tricks

Troubleshooting network-related issues!

I recommend as well to get familiarized with the OSI Model, OSI is our friend to troubleshoot issues like this, in our example, the issue might be related to the Network or Transport layer, as well to the application (since it was also receiving and processing information)

Figure #1 OSI is my friend

At this point, we have identified 3 possible layers in which our issue in the example might be wrong The Network, Transport, and Application… so what’s next? Well, let’s identify a feasible test for each of them, we need to discard/probe that our approach is good enough which this leads me to another tip

Always get metrics!

But how we will do it for the issue that we are troubleshooting? Well since it is a distributed application, all the requests go over the network so we can create scripts to gather metrics of the ICMP traces (pings) and save that with a timestamp and parsing the duration of each ping, in this case, it should be placed on a search master and ping each machine of the workers so we a confirm or discard the issue with the network layer.

To discard the application issue (webserver) we may follow the same approach but maybe send a GET request to an endpoint so we can identify if the webserver is dying or not. If there is a gap between our metrics it will be the issue of the transport layer.

Troubleshooting performance issues!

One common error that I noticed is that everyone does performance metrics with the happy path, but not with the worst-case scenario, for example: If the application is fully loaded how those numbers are affected? Can we improve how it behaves under load? This is really important.

Troubleshooting non-replicable issues in the lab!

So it is important to understand if there was any crash before, or if they added new software that could affect their system, I remember a time in which the customer deleted data while moving an NFS and he wanted us to fix that, well… sometimes it is not possible

Other tips that I want to share with you are:

  • Understand the user use meant to do
  • Always read logs and isolate classes
  • Check network and communication protocols
  • Do not forget ciphers
  • Always get metrics!
  • Create scripts and more scripts when needed!
  • Do not trust what the customer says and always double-check
  • The most basic thing could lead to incredible headaches, so always check the basis
  • Check every module that plays a role in the communication, from the code to the network and hardware systems

This was a high-level overview of how to face those challenging issues it does not contain all the details but I wanted to share this as an example of how you can divide and conquer in order to achieve progress.

--

--

Sr. Security software engineer working in the DevSecOps area. CompTIA Sec+, C|EH

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Josué Carvajal

Sr. Security software engineer working in the DevSecOps area. CompTIA Sec+, C|EH