Site Outage continues
Site went down for an extended period, the root cause remains unknown though it doesn't look like it was an attack, it has not been fixed, it is not over, a string of techs at the web host have made minimal progress towards a diagnosis, I need to move us to another server & upgrade vBulletin, a long & painful process, the first step of which has been severely crippled by the current server's condition.
On November 19th, URC suddenly and without warning began to experience intermittent outages -- periods where pages would not load. This rapidly worsened to the point where the server was at a complete standstill.
Traffic was normal and there was no evidence of any sort of malicious activity, but the server's CPU load was unimaginably high. I worked for hours on the backend repeatedly restarting all services I could to allow brief windows of accessibility to diagnostic tools.
By late afternoon (still the 19th) I felt I had ruled out anything I had control over, and I sought help from level 1 technical support at our web hosting provider (the company with the server we lease). After a long chat the tech concluded that we were not under attack and nothing in the hardware had failed, but we were seeing a lot of search engine crawler traffic that could be directly tied to processes causing load spikes.
I reduced searchbot traffic significantly, a process which is fairly quick to initiate, but slow to bear results, and nothing changed with our server load.
What followed was six days of the most exasperating interactions I have ever had with any sort of technical or customer support. Three different techs in a row, on a seemingly endless loop impenetrable to logic, reasoning, and facts, told me the following:
The load is the fault of my forum software
The load is the fault of my forum database
The load is the fault of search engine crawlers
Actually nothing is wrong, load is "within normal limits"
During this time, here were the actual facts of the matter:
The forum was not only disabled, I had taken it completely offline, off the Internet entirely, inaccessible, untouchable to anyone who didn't have command line level access to the server.
The forum database was sitting completely idle, untouched for days, save rare and very brief tests where it was re-enabled for minutes.
Search engine traffic, which multiple logs proved to be normal right through the 19th, had subsequently been decimated through my efforts, down to an ambient trickle (averages below one hit every three seconds).
With the forum non-existant, database idle, traffic decimated, server CPU load was now hovering between 10x & 25x normal. By "normal" here I mean where it had been with the forum & database running at full production load with completely unrestricted search crawler traffic.
At several points, techs refused to provide further service on the issue with CPU load at "only" 10x to 12x normal, because the server we're on is so overpowered for the application that 10x to 12x normal is within what they consider to be "acceptable limits."
With near-zero *nix knowledge I've struggled to navigate around the command line with Google in another window to discover and use tools that all of the techs who are paid to do nothing but deal with these exact servers all day seemed to know nothing about. I distilled the situation down to the very most basic and irrefutable facts that the server itself could directly report, yet I still struggled to bypass the walls of "it's your forum software" and "everything looks normal." Meanwhile, I had begun taking a fresh local backup of the entire site to my desktop. Due to the ongoing problems with the server, this download process would be repeatedly interrupted, timed out, and errored out.
November 25th update: The latest tech working on the problem claimed to have restarted a misbehaving service, and logs show that when he did this, load returned to normal. It took a full day of turnaround for me to hear back exactly what he restarted, and I immediately requested that this auxiliary service, which is absolutely non-essential and has nothing to do with the forums or database, be permanently terminated.
November 26th update: Response on the ticket remains very slow, and there have been more exchanges with finger-pointing at our forum as the source of all evil since I momentarily re-enabled the forum and it didn't work since the underlying problem has not been solved. Hours after that brief test (followed by restarting/resetting all services I have access to) server load continued to spike to extremely unhealthy and irrefutably abnormal levels (more than 50x normal), with the forums nowhere to be found. I am now waiting for a tech to remove the aforementioned misbehaving service, which continues to be active and a top contributor to moment-to-moment CPU load, even though it's supposed to only activate once/day for a maintenance task. Great news is that at long last, my latest local backup of the site is complete.
As you can imagine, words alone cannot express the level of aggravation & disgust I have experienced, so I will keep the depths of my rage on this side of the keyboard.
Here is what is going to happen:
I need to get us the hell off this server. The first step in this process is backing up the last snapshot of the filesystem & database, which would normally happen overnight, but as explained below, ended up taking almost a week instead.
When we move, there will be more downtime. It will be painful, again.
Because we've had so much downtime and lost so much traffic & so many users, now is the best time we've ever had for an overdue forum software upgrade. This will bring still more pain, and we may lose some data and some functionality in the process, but we will also gain a ton of new functionality, including dramatically improved usability on mobile devices.
My sincerest apologies to everyone for all of this inconvenience and mess. I've been doing my best to fix this, but my best is not nearly enough, and clearly the expert technicians' best has been still worse for most of this life-shortening ordeal.