Outage / Hardware Failure And Recovery

Server News

server@foggyminds.com

(News)

Outage / Hardware Failure And Recovery

I'm going to start with my apologies for the 24 hour outage. Thankfully there should be little impact beyond that.

I've been running all of this on a single box, and had plans (last night in fact) to expand it to two with a NAS backend to help protect it from failures and improve performance.

Ironically, the morning in which I planned to make those updates was when I experienced some sort of disk failure on the server. I don't know exactly what happened and the exact level of failure (diagnosis of that will be tonight, mostly to see if I need to get a warranty replacement, or if there was some software cause).

The disk failure presented itself as bad blocks and mild data corruption which prevented multiple services from running.

The good news is that it appears nothing of real importance was corrupted. We lost the majority of one database table, but all the table contained was a list of every single activity-pub contact the server could see and it's contents should be reconstructing automatically now (though obviously may take a while).

It might be worth checking your friends/followers to make sure there are no major absences, I don't know if this impacts who you're following or just the basic contact info.

Beyond that, here's what's changed and changing followed by accountability for my mistakes:

* The server now has a NAS backend with disk redundancy which will protect against drive failures, it also helps to share resources between this server and a second server once I have the drive issues figured out.
* Once the original server is fixed/cleared it will be running a load balanced second copy of the webserver (and eventually a copy of the database) to improve performance and reliability. (The nas will allow two copies of the server to share the same media files)

For accountability:

One of the things that made the failure take longer to recover from was the fact that my backups of the database had failed to run due to a typo.

I could have sworn I had checked it to ensure it was running, but clearly I had not.

I fixed the error and confirmed it ran successfully this morning. Database backups are run twice daily and backed up to a remote **encrypted** backup (borgbase.com). I will be checking periodically over the next week or two to confirm that it continues running.

Additionally, with the NAS now set up and available I am running full system snapshots of the database twice a day as well. This means I should have two avenues for recovery across two different methods going forward, which should significantly increase reliability.

in reply to Server News

Server News

in reply to Server News • 1 year ago •

Below is the text that was shown on the site during the outage with the updates, just for accountability:

Hardware Failure - Recovery Attempts In Progress

The primary disk used by the webserver, database, and some of my other servers appears to be failing critically.

I am attempting recovery efforts, but also have to work my dayjob. Automated backups have been running daily on the database so the server should be restorable.

I'll update this page with notes as possible (it's on the same box, so can potentially fail as well). I get off work at 5pm US CT time, this is being written at 9:50am.

If you'd like to reach out, shiri [at] bailem.me for email and shiri:beeper.com for Matrix (if you prefer XMPP shiri_beeper.com@aria-net.org should work to use the public bifrost bridge).

Update 7PM: everything is migrated and recovery is underway. Database is attempting to recover on it's own and I'm just waiting on it. Once it's done restoring, I will try to re-enable the server. If the database fails from that point, I'll restore the databse from the backup made early this morning.

Update 8PM: Database is just taking a long time to recover, didn't help that default settings timed out the recovery and I had to restart it. Database is running on MariaDB which has the good sense to design with various safety logs, so it's able to backtrack and rerun commands to recover data. I expect it to be back up in the next hour or two.

Update 8:30PM: Didn't help that I had a momentary power outage reset my progress.

Update 9:45PM: Upfront honesty, looks like the backups were broken and I should have taken a closer look at them. However, the database is only mildly corrupted and I'm going to be able to do a dump and restore. This takes time as it's 20GB in size, but I feel confident that it'll be good after this.

Update 1AM: we're just about to the end of it, I just have to figure out why I'm getting an odd session data error? Unfortunately I have to sleep and that means calling it for the night.

in reply to Server News

Server News

in reply to Server News • 1 year ago •

Looks like there's bigger issues with the lost APContact table, I'm investigating to see what I can do about that.

*Hopefully* this is self-resolving as it refinds all the users.

⇧

Server News

Server News 1 year ago •

Outage / Hardware Failure And Recovery

Server News

Server News

Server News
1 year ago •