Outage / Hardware Failure And Recovery
I'm going to start with my apologies for the 24 hour outage. Thankfully there should be little impact beyond that.
I've been running all of this on a single box, and had plans (last night in fact) to expand it to two with a NAS backend to help protect it from failures and improve performance.
Ironically, the morning in which I planned to make those updates was when I experienced some sort of disk failure on the server. I don't know exactly what happened and the exact level of failure (diagnosis of that will be tonight, mostly to see if I need to get a warranty replacement, or if there was some software cause).
The disk failure presented itself as bad blocks and mild data corruption which prevented multiple services from running.
The good news is that it appears nothing of real importance was corrupted. We lost the majority of one database table, but all the table contained was a list of every single activity-pub contact the server could see and it's contents should be reconstructing automatically now (though obviously may take a while).
It might be worth checking your friends/followers to make sure there are no major absences, I don't know if this impacts who you're following or just the basic contact info.
Beyond that, here's what's changed and changing followed by accountability for my mistakes:
* The server now has a NAS backend with disk redundancy which will protect against drive failures, it also helps to share resources between this server and a second server once I have the drive issues figured out.
* Once the original server is fixed/cleared it will be running a load balanced second copy of the webserver (and eventually a copy of the database) to improve performance and reliability. (The nas will allow two copies of the server to share the same media files)
For accountability:
One of the things that made the failure take longer to recover from was the fact that my backups of the database had failed to run due to a typo.
I could have sworn I had checked it to ensure it was running, but clearly I had not.
I fixed the error and confirmed it ran successfully this morning. Database backups are run twice daily and backed up to a remote **encrypted** backup (borgbase.com). I will be checking periodically over the next week or two to confirm that it continues running.
Additionally, with the NAS now set up and available I am running full system snapshots of the database twice a day as well. This means I should have two avenues for recovery across two different methods going forward, which should significantly increase reliability.
Fuzzy Thumbnails
I'm not entirely certain what's causing this, but I do know what kicked it off.
I'm doing some migrations of the media files between boxes and it seems to have somehow messed up the thumbnails and some cached remote images. However, I've seen no problems with any uploaded files (aside from thumbnail views of those files).
This appears to be resolving itself over time as the server updates contacts and recaches many of these files.
Migrations should be finished in the next day or two and there should be no significant downtime.
Server Crash
Regarding the downtime that happened last night (while I was trying to sleep which is why it went on for so long).
The short version is a bunch of stuff clogged up the pipes, hung, and just needed a good ol' fashioned restart to fix. I've made some changes to help reduce the chance of that happening again, as always I'm sorry for the trouble.
The longer version is that the php worker processes hung and bogged down the database which brought the whole thing to a screeching halt.
I've changed the limits on those workers so they should have less impact and hopefully not do that again (the downside is that they'll be a little slower on federating updates, but most of the time that shouldn't be noticeable).
I've also taken advantage of the existing downtime to migrate the database over to a second machine with more memory. I originally intended to upgrade the memory of the machine it was on, but unfortunately made the mistake of buying the wrong chips. That plan is still pending. However by migrating it I was able to expand the memory usage significantly which should help performance, the downside is that it's a busier system so the CPU is occasionally busier and can sometimes have a negative performance impact (it's likely negligible but I'm not super confident of that).
In the next week or two I plan to do further hardware upgrades, but with the database migrated already it should be negligible if any downtime.
Once that's done, I'm hoping to implement some high availability options to further reduce downtimes.
If you're experiencing particularly slow load times on the network page (the default homepage with your main feed), one thing on your end you can tune is how many items it tries to load at one time.
Go to Settings -> Display -> Content/Layout and you can change "Number of items displayed per page".
Especially as an item includes a post and all of it's comments as one item, this can make a drastic performance difference (on my personal feed 40 takes >30 seconds to load sometimes, but 20 takes < 3 seconds)
Server News
in reply to Server News • •Below is the text that was shown on the site during the outage with the updates, just for accountability:
Hardware Failure - Recovery Attempts In Progress
The primary disk used by the webserver, database, and some of my other servers appears to be failing critically.
I am attempting recovery efforts, but also have to work my dayjob. Automated backups have been running daily on the database so the server should be restorable.
I'll update this page with notes as possible (it's on the same box, so can potentially fail as well). I get off work at 5pm US CT time, this is being written at 9:50am.
If you'd like to reach out, shiri [at] bailem.me for email and shiri:beeper.com for Matrix (if you prefer XMPP shiri_beeper.com@aria-net.org should work to use the public bifrost bridge).
Update 7PM: everything is migrated and recovery is underway. Database is attempting to recover on it's own and I'm just waiting on it. Once it's done restoring, I will try to re-enable the server. If the database fails from that point, I'll restore the databse from the backup made early this morning.
Update 8PM: Database is just taking a long time to recover, didn't help that default settings timed out the recovery and I had to restart it. Database is running on MariaDB which has the good sense to design with various safety logs, so it's able to backtrack and rerun commands to recover data. I expect it to be back up in the next hour or two.
Update 8:30PM: Didn't help that I had a momentary power outage reset my progress.
Update 9:45PM: Upfront honesty, looks like the backups were broken and I should have taken a closer look at them. However, the database is only mildly corrupted and I'm going to be able to do a dump and restore. This takes time as it's 20GB in size, but I feel confident that it'll be good after this.
Update 1AM: we're just about to the end of it, I just have to figure out why I'm getting an odd session data error? Unfortunately I have to sleep and that means calling it for the night.
Server News
in reply to Server News • •Looks like there's bigger issues with the lost APContact table, I'm investigating to see what I can do about that.
*Hopefully* this is self-resolving as it refinds all the users.
Shiri Bailem
in reply to Server News • •@Server News Looks like it's resolved?
It was causing issues searching contacts and pulling up profiles. But I incidentally hit the db update button in Friendica (usually for major version updates) and it abruptly worked again, so it might have just been an issue with indices on the table.
Please let me know if you see any issues so I can investigate.