In honor of SysAdmin Day, we share IT pros' memories of heroic deeds from the data center.
They say love is a battlefield, but as any system administrator knows, so is the data center! Whether it’s vanquishing pesky performance bottlenecks or preventing data loss, application downtime, and service outages, sysadmins are the first line of defense for today’s businesses. And as the sysadmin role continues to evolve with the growth of hyperconvergence, hybrid IT and cloud computing among other new technology trends, so too do the threats and challenges these unsung heroes must defeat day in and day out.
To celebrate System Administrator Appreciation Day 2016 and honor sysadmins everywhere for all the times they've come to the rescue to keep everything running smoothly, the folks at SolarWinds asked the THWACK community of 130,000 IT professionals for feedback on their most heroic sysadmin moments. On the following slides, enjoy a selection of their impressive tales of data center heroism along with illustrations of their superhero alter egos.
Happy SysAdmin Day 2016!
“One of my greatest sysadmin moments happened almost a decade ago. I was adding redundancy to our HP RX8600 server by adding two cells for RISC9000. Despite making only a small change, the next day our DBA reported ‘fuzzy checkpoints,’ and write operations were taking a minute, rather than a few seconds.
Our provider technicians suggested an OS patch, which as I suspected, resulted in a major failure and severe outage that left the business decommissioned for days. We eventually found a backup that could be read to successfully restore the system, but the slowness issue persisted.
I remembered that the rather large cell boards that had been added were slid in through the back of the chassis, dangerously close to the SAN fiber cables. After I fiddled with the cables on a hunch, our DBA burst into the data center exclaiming that the performance had bounced back. Now I knew it all boiled down to a problem with the fiber and replaced it immediately. After a close inspection, a barely visible crush spot could be seen on the cable, not severe enough to cause errors to be reported, but enough to cause packets to retransmit.
At the end of the day, I had successfully backed up a potentially disastrous payroll system crash, solved the database performance issue and everyone got paid on time!”
“I'll never forget the day I put an end to a ransomware siege. The only information I had to go on was that 30,000 critical files were encrypted, each with its own ransom note. We had no idea what the source of the infection was or if there was anything else happening that hadn't yet been reported. Cue shutdown of all servers across the network.
After spending (read: wasting) lots of time looking at the encrypted files and searching the web with a fine-tooth comb to find out if there was a decryption tool available, I realized that the key to this problem lay in the ransom notes: each ransom note was created by one user. We now had the source. After a quick deletion of all the encrypted files and the associated ransom notes, followed by a restore of all the original data from the previous night’s backup, we were back in business.”
“We were in the middle of a production migration when our SQL DBA called to say, 'Oops, I ran your ODS scripts against your production database.' We were already hours into the migration and did not have time to do a database restore and still meet our deadline to be up and operational at the start of business hours. But a few 'CTRL + F’s' later, we were scripting the statements to fix the database and make the deadline with enough time to spare for a few hours of sleep before business resumed.”
“We had a client a few years ago who routinely experienced website downtime due to frequent IIS server crashes. Many of their technical team’s solutions included throwing more resources at the system to see if that would help. Needless to say, it didn't. Once I was pulled in, I spent about an hour going over all the data available from the system in our monitoring tool, trying to see what I could correlate to the time when the reported issues were happening. I ultimately ended up finding huge spikes in current connections to IIS using an IIS SolarWinds Server & Application Monitor template that we had applied to the system. Turns out that the client’s web code would choke if the number of current connections went above a specific level."
“In 2009, I was working with the Marine Corps in Iraq. One night, the generators went out and when everything kicked back on, we discovered our primary classified SAN had been totally reset. Even NetApp techs had no idea how to recover the data. Imagine losing information that could possibly mean life or death, with no hope of any easy fix!
After about eight hours, an uncountable number of Rip It energy drinks and threatening the SAN with a large wrench (I kept it for that express purpose), I was able to get the SAN to begin reading disk data and rebuild itself with no reported data loss. I can’t for the life of me remember how I did it, but I received a Navy Achievement Medal and was never without Rip Its again.”