It was 5:17pm today, just as I was wrapping up work for the day, and my manager pinged me with the following chat:
<manager>: Hi Jeremy - we have a <other team> ticket - escalated to <leader>, <leader>, etc. <principal> is on trying to advise as well. Are you available this evening if needed for diagnostics? <coworker> is on the call now
No obligation; just checking in to see what my availability is. Quickly thinking it over – I didn’t have any plans tonight, nothing in particular on my agenda. Why not? If I can help then someone else on the team won’t have to, and I don’t have anything better to do tonight.
<Jeremy>: yes <Jeremy>: i'm free and available all evening
I synced up with my coworker and then joined the bridge line where the front line tech team was troubleshooting.
Last week I was chatting with a few software engineers who I work with, and I remember sharing my opinion that the most interesting problems are not obvious ones likes crashes. In the grand scheme of things, crashes (usually generating a core dump) are rather easy to debug. The truly sinister problems are more like vague brown-outs. They never trigger your health monitoring or alarming. All your synthetic user transactions work. There are no errors in the logs. But one particular function or module of your application that usually completes in a few seconds suddenly starts taking 10 minutes to finish. This might be caused by, perhaps, a single SQL query that suddenly and inexplicably starts taking longer to complete. Whenever you run the SQL yourself, it completes in seconds. Yet you can see the pile-up of database connections, and unrelated parts of the application start experiencing delays acquiring new connections themselves… and nothing quite fails all the way, yet something isn’t working for your end users. An ominous backlog starts building up in some work queue which threatens much bigger problems if it’s not dealt with soon. You see the freight train coming straight towards you…. tortuously slow and unavoidable.Continue reading