This post is going to have fewer technical examples and be more about my troubleshooting methodology. The concepts I’m going to describe may seem rudimentary to some, intuitive to others, and eye opening to a few. I’ve watched enough junior engineers wrestle with solving vague problems I felt it was worth documenting my approach.
When you’re a junior developer or a computer science it’s easy to get used curated problems — that is bugs or issues that are scoped for you and may come with some guidance toward a solution. We’ve all gotten assignments like, “Use Dijkstra’s algorithm too….”, “User Jim Bob is having trouble printing, please call him and have him update his print driver”, “When you get this page in the middle of the night follow this troubleshooting document to restart x job.”
Don’t get me wrong, this problems can still be frustrating and can take a lot of work to resolve. Dijkstra’s algorithm is not simple, and walking a user through updating a print driver over the phone is something I still have nightmares about. But those problems give you a starting point like which algorithm to use, looking at a print driver, or following a document. I classify those problems differently than what I think of as “vague issues”.
Vague issues are problems that don’t come with any guidance, and don’t scope themselves for you. Thinks like, “This application is acting slow”, or “we’re getting consistent reports of an error we can’t seem to reproduce”, or “a lot of users are different locations are complaining they can’t print”. Problems like these don’t have any scope for you, they don’t have a documented solution, and they aren’t solved easily. As a DevOps practitioner I’ve seen these problems both with my development hat, and my Ops hat on. The specific tools you use with each are different, but my general approach is the same. I’ll give an example of each.
The steps to my routine are
- Grasp the severity of the problem
- Pick wide, useful outer bounds for the broken system
- Divide the system into components based on your insight points
- Assess the insight point closest to the middle of the system and decide if the problem is closer to the symptom, or closer to the back end
- Shrink the system based on your decision, picking new components if it makes sense
- Repeat until you are confident you’ve found the problem
Let’s dive in!
Grasp the severity of the problem
All troubleshooting and debugging is done on a mix of qualitative and quantitative data. By that I mean every time you solve a problem you are working with a mix of numbers (“The API is generating 50% 500 errors”, “90% of print jobs are never printing”) and feelings (“Users are furious, support is overwhelmed!”, “The app is so much slower than it was earlier today I can hardly use it!”). In most cases qualitative data is much easier to get on the fly.
Because you’ll often be working with qualitative data you’ll want to take a temperature on the people giving it to you before you dive in. A small problem can be presented as a world ending event if an important customer calls in. Ask questions that push users to pull quantitative data out of the anecdotes they’re telling you like
- How many users have called in?
- How long ago did this start?
- Are the users all from one location or several?
And some questions that help you understand the emotional state of the person answering the questions
- How serious do you feel this problem is?
- How frustrated are the users?
This a soft skill interaction rather than a technical one, but it will help you understand how much time the person reporting the problem has spent working on it, and how calmly they are relaying what they’ve learned.
This is also a good time to decide if the problem is a “drop everything now” type of issue, or a “I’ll put this on my todo list for next week”
Pick wide, useful outer bounds for the broken system
This is your change to brain storm everything that could be related to the problem. Here are a few examples I’ve done recently
A page of our application is reported as being slow
The outer bounds for the system are the users web browser (they might be using an old or poor performing machine) all the way back to the disk drives on the database the page connects to.
Users at our corporate office couldn’t get to the internet
The outer bounds are the users themselves (I can’t rule out that a user is typing in a wrong web address), all the way to our ISP hand off (I trust that if the ISP had a problem I would’ve seen an email).
A few general rules for picking your outer bounds when you’re having a problem
- If you can’t confidently rule out that a component is contributing to the problem then include it in the system
- Check your confidence level on every component. Classify your knowledge into “I thinks” and “I knows”
Divide the system into components based on your insight points
Any technology system is made up of dozens, possibly hundreds of components. You could think about a bug in terms of bits and bytes, but it’s not helpful. Instead divide the system on your insight points, or places you can gather logs or metrics, or inject an active test.
For our slow web page example the components are
- The load balancer that routes traffic to the web server — I can see load balancer metrics
- The SSL off loader that decrypts and re-encrypts the traffic for inspection — this is in nginx and we can see logs for timeouts
- The IPS/IDS device that does layer 7 inspection on the app — I can see if a rule suddenly started going off
- A downstream API that handles a synchronous call — I can see request completion times in the logs
- The disk drives the webserver writes logs to — I can see read/write latencies
- The database the application queries — I can have a DBA pull query plans and check for row locks
- The disk drives the database uses to store data — I can see read/write latencies
For users are our corporate office not being able to access the internet the insight points are
- The DHCP server that hands out IP addresses to clients — I can inject an active test by renewing my laptop’s DHCP address
- The DNS server that resolves hostnames to IPs for our office — I can do an nslookup for an active test, or look at DNS caches
- The core switches that route traffic to our ISP — I can look at switch logs or traffic statistics on a port
- The firewall we use to allow traffic outbound — I can look at firewall CPU or logs
- Our ISP who handles the traffic outside of our building — I can use a vendor portal to check traffic stats, or call them to find out if there is a regional problem
Assess the insight point closest to the middle of the system and decide if the problem is closer to the symptom, or closer to the back end
The idea here is that you are doing a binary search for the source of the problem. Looking in the middle of the system can tell which direction to move in.
Keep in mind the hardest problems to diagnose are the ones without a single root cause. Problems with two or three small contributors. While you’re troubleshooting keep an open mind, and don’t dismiss a problem because it isn’t the only source.
For our slow web page example
In this instance I would start on the web server, and look at request counts and latencies. If I see a smaller number of requests, it’s likely the problem is before the traffic hits my app server, and it could be an inbound IPS/IDS, SSL off load, load balancer, or even the internet connection into my data center.
If I see normal numbers of requests with high latency, I’ll move backwards towards the disks where the logs are stored, or the database itself.
For users are our corporate office not being able to access the internet
I start with a user’s machine and see if they are getting a DHCP address. If they are, can they resolve DNS entries? Can they print their next hop on the way to the internet? If they can, the problem is closer to the internet. If they can’t, the problem is closer to the user’s laptop, like a DHCP server or a DNS server being down.
Shrink the system based on your decision, picking new components if it makes sense
At this point in either scenario you’ve shrunk your problem domain pretty well, which is great! You have fewer things to look at, and you’re probably moving in the right direction.
Important things to remember at this stage are
- Don’t shrink your problem domain quickly. Use what you’ve observed to pull in a little closer without skipping components you haven’t examined yet
- Reference what you’ve seen, but watch for your own bias. When you’re troubleshooting, it’s easy to blame something you don’t understand. Always question assumptions about pieces you feel you own
- Be expedient, but deliberate. When you’re working on a problem and people are hounding you for an ETA on a fix, it’s easy to get agitated. Don’t let urgency rush quality troubleshooting. It can make a problem drag out if you get spooked and move too quickly.
Repeat until you are confident you’ve found the problem
Rinse and repeat these steps! You’re recursively looking for a solution. There is no substitute for steady, consistent, calm troubleshooting. Wild guesses based on not enough information can muddy the waters.
Troubleshooting and debugging are hard. Most engineers, whether software engineers or IT engineers, would rather create something from scratch. But there is a lot of value in being able to take a system you aren’t familiar with, break apart it’s components, and start digging into it’s core functions and problems.)