Outage - Database Stuck
... I can't get 1 day without issues apparently...
When I was asleep, it looks like the database got stuck in some sort of optimize process with everything stuck waiting on that.
Restarting the database server forced that process to clear and cleaned things up.
I suspect what happened was I got overzealous after things started working great and I set the background worker count too high (these workers automatically run in the background doing things like updating contacts, downloading posts, etc).
I tuned that setting down, and increased the number of connections the database allows.
Dear god I feel like a newb... I used to do this stuff professionally and I feel the fact that it's been years.
Biggest performance issue I've had for a little while with this server turns out to be because the hypervisor copied the MAC address when I copied the server. It shouldn't have taken me nearly this long to identify this problem!
Aaaargh!
Performance Issues - Tentatively Resolved?
I feel really dumb for not noticing the cause sooner, I most attribute it to the rareness of the problem and the inconsistency at which it occurred.
I use a virtual server environment for my servers, and one of my measures to improve reliability and performance was to have two instances of the webserver behind a load balancer. In laymen's terms, whenever you're connecting you're assigned to whichever has the least connections and they're otherwise identical (same files, connect to the same database, etc).
Well... turns out when I duplicated the server initially, the software decided to not change the MAC address. Laymen's: The ip address of the server is different, but the network uses the mac address to map ip addresses to boxes, if two boxes have the same mac address then traffic is going to sporadically and randomly switch between them... but also with a bad IP address which means half of the traffic is always getting rejected.
So your connection to the load balancer was fine, but it was struggling to connect to the webservers and the webservers were struggling to connect to the database.
Once I changed that the server immediately became quite zippy!
My sincere apologies for the impact.
midɲa likes this.
Outage - Self Resolved - Investigating
The server went down for a few hours today and resolved before I could look at it. It's also been a really bad day for my physical health so I've had very little capacity.
I am looking into why it happened but have no firm answers at this time.
As far as I can tell, the biggest most obvious culprit is the database backup which was happening at that time.
It looks like the database is big enough that backups are no longer simple and I'll need to change my backup method.
Revised method of backups set up, tomorrow I'll test that it's working and then attempt a restore (on a second server so no impact here).
I've also refreshed the virtual network interface as I learned it may have been dropping packets for some unknown reason and it's working fine now.
Server News
in reply to Server News • •