A recent online experience with a large department in my state government left me scratching my head and wondering who was watching what.
The relevant page was easy to find and, while I had to click through two more pages to get to the SaaS application, I was able to get through the first four pages without any problems. But the next page presented me with a “something went wrong” notice.
So I tried it again. Same message. I tried it in a different browser. Same message. I went away for a while, refreshed my cache and tried it again. Same message. I tried it on mobile. Same message.
Looking in the HTML showed that this was likely the result of a back-end issue (500 is “Internal Server Error”), so I went looking for a Status page to determine the status of this SaaS application. Nothing.
I looked for a mechanism to report problems with the application and found a link for reporting Customer Service items, but the URL included a domain name that wasn’t registered.
I found an IT Helpdesk phone number for this department and called it, but was met with a few seconds of dead air followed by a very short message that they were “currently experiencing system problems” before being disconnected.
I finally reached out to a C-level office in their IT area and had a conversation with a very patient person who at first, blamed the user (“you’re on the wrong page”), but had enough patience with me to follow along, validate that I was in fact clicking the right buttons and going to the right pages and encountered the same error that I did. She was still a little suspicious, but listened to my two other experiences (unresponsive helpdesk and DNS error), took notes and said she’d look into them.
We’ve all heard of CEOs who randomly (and anonymously) call their company’s customer service number and work a problem through the system to “walk a mile” in their customer’s shoes. It’s a good practice.
But there’s so much more . . .
- Be curious. Assuming this SaaS application is producing metrics1, did anyone notice that no new business was flowing through the system yesterday?
- Watch the logs. The message I encountered indicated that some application log somewhere was collecting errors during my experience. There are many, many tools out there that can watch those logs and raise alerts that something is wrong.2
- Experience the flow. Set up a synthetic transaction monitor that clicks through the SaaS application and raises an alert when something unintended occurs.3
- Monitor the domains. Configure automated checks for each and every managed domain and subdomain that exists that raise an alert when something unintended occurs.4
- Test the calls. There may not be many ways for your robot to dial a number and test a call center or menu tree, but options do exist. At a minimum, two other things should have alerted this department that something was wrong:
- Be curious (Why is service desk volume much lower today?)
- Watch the logs (If the “system problems” situation isn’t throwing errors in the logs, configure the system to do so.)
- Build a Status page. There are automated services that will do this,5 but even a hand-coded HTML page at a separate domain on a separate platform will do the job. “The system is up”, “The system is down” and contact information—that’s all that most users want to know. (Knowing where to place links to this page is also important, so that users can find it.)
Arm-chair system administration and system management is admittedly easy and gutless, and no doubt there is much more going on with this department than meets my external eye.
I would start with monitoring and alerting on log files, while assigning an owner to research and configure synthetic transaction monitoring, quickly pivoting to the automated monitoring of DNS entries and URLs. Building a Status page might need to be coordinated with other state-level departments, but doesn’t have to be terribly complicated. It might be possible to find an appropriate synthetic transaction monitoring tool to supply input to the Status page.
On the subject of Curiosity, keep watching these pages. I’ve long been interested in how to grow, foster, inculcate, encourage or teach curiosity, with varying levels of success. At a very simple level, what are the differences between the one person who asks the question “why” and another who walks right past? More to come.
I wish this department the best in shoring up their infrastructure.6
- It should be delivering key performance indicators, at least. ↩
- My favorite is Splunk, but there are many others. ↩
- A good monitor will also shine light into application performance and usability beyond the “something went wrong” level. ↩
- Check DNS redirects perhaps twice a day and main entrance domains and URLs multiple times an hour, beyond the synthetic transaction monitoring. ↩
- Possibly even your synthetic transaction monitor. ↩
- If somehow my contact with them connects back to me, I’d be happy to consult with them—please reach out. ↩