Outage / Hardware Failure And Recovery
I'm going to start with my apologies for the 24 hour outage. Thankfully there should be little impact beyond that.
I've been running all of this on a single box, and had plans (last night in fact) to expand it to two with a NAS backend to help protect it from failures and improve performance.
Ironically, the morning in which I planned to make those updates was when I experienced some sort of disk failure on the server. I don't know exactly what happened and the exact level of failure (diagnosis of that will be tonight, mostly to see if I need to get a warranty replacement, or if there was some software cause).
The disk failure presented itself as bad blocks and mild data corruption which prevented multiple services from running.
The good news is that it appears nothing of real importance was corrupted. We lost the majority of one database table, but all the table contained was a list of every single activity-pub contact the server could see and it's contents should be reconstructing automatically now (though obviously may take a while).
It might be worth checking your friends/followers to make sure there are no major absences, I don't know if this impacts who you're following or just the basic contact info.
Beyond that, here's what's changed and changing followed by accountability for my mistakes:
* The server now has a NAS backend with disk redundancy which will protect against drive failures, it also helps to share resources between this server and a second server once I have the drive issues figured out.
* Once the original server is fixed/cleared it will be running a load balanced second copy of the webserver (and eventually a copy of the database) to improve performance and reliability. (The nas will allow two copies of the server to share the same media files)
For accountability:
One of the things that made the failure take longer to recover from was the fact that my backups of the database had failed to run due to a typo.
I could have sworn I had checked it to ensure it was running, but clearly I had not.
I fixed the error and confirmed it ran successfully this morning. Database backups are run twice daily and backed up to a remote **encrypted** backup (borgbase.com). I will be checking periodically over the next week or two to confirm that it continues running.
Additionally, with the NAS now set up and available I am running full system snapshots of the database twice a day as well. This means I should have two avenues for recovery across two different methods going forward, which should significantly increase reliability.
Fuzzy Thumbnails
I'm not entirely certain what's causing this, but I do know what kicked it off.
I'm doing some migrations of the media files between boxes and it seems to have somehow messed up the thumbnails and some cached remote images. However, I've seen no problems with any uploaded files (aside from thumbnail views of those files).
This appears to be resolving itself over time as the server updates contacts and recaches many of these files.
Migrations should be finished in the next day or two and there should be no significant downtime.
@Friendica Admins I could use some help cleaning up something on my instance.
I'm moving around the storage, and for some reason a lot of the thumbnails (not all) seem to have gotten corrupted. Regular files seem fine, but basically full size profile pictures are good but small versions are blurry.
Is there any good way to clear the storage of everything that's cached both in terms of thumbnails and data from other servers?
Friendica Admins reshared this.
I live together with my sibling in a two bedroom, I'm the only one with a consistent income and doing my best to care for them. They struggle with borderline personality disorder, ptsd, bipolar, and bad anxiety. They aren't able to maintain a stable living situation on their own.
They've been in a relationship for most of the past year with someone who was great on the surface but insecure and shitty about it underneath. They kept taking their insecurities out on my sibling and isolating them, and started helping them with their meds and then stopped multiple times making their situation so much worse (and getting shitty if anyone else helped, so I wasn't even in the loop to help because he would have shit on them for even telling me about it).
The break up went catastrophically bad because he kept escalating until my sibling was in the worst episode I've ever seen and eventually antagonized them to the point of shoving him... then the cops showed up and arrested my sibling. (seriously, ACAB.)
Long story short, skipping a lot of detail elements, this has put us in the position to be a few hundred dollars behind in budget to cover rent and expenses, especially the court costs (ex isn't pressing charges, but in Texas that doesn't matter apparently). They have to get into alcohol counseling (they were in a bad way and went out with friends that night, the bar severely overserved them), as well as get back on their meds, and pay their bond... all within the next few days.
We've got leads on a lot of resources, but it's not going to make ends meet in the short term and we could really use help.
If you could send some money to help out, the best method is Cash App to $ShiriBailem (benefits of a unique name!)
If you can't, a boost/reshare would be greatly appreciated!
reshared this
Server News
in reply to Server News • •Below is the text that was shown on the site during the outage with the updates, just for accountability:
Hardware Failure - Recovery Attempts In Progress
The primary disk used by the webserver, database, and some of my other servers appears to be failing critically.
I am attempting recovery efforts, but also have to work my dayjob. Automated backups have been running daily on the database so the server should be restorable.
I'll update this page with notes as possible (it's on the same box, so can potentially fail as well). I get off work at 5pm US CT time, this is being written at 9:50am.
If you'd like to reach out, shiri [at] bailem.me for email and shiri:beeper.com for Matrix (if you prefer XMPP shiri_beeper.com@aria-net.org should work to use the public bifrost bridge).
Update 7PM: everything is migrated and recovery is underway. Database is attempting to recover on it's own and I'm just waiting on it. Once it's done restoring, I will try to re-enable the server. If the database fails from that point, I'll restore the databse from the backup made early this morning.
Update 8PM: Database is just taking a long time to recover, didn't help that default settings timed out the recovery and I had to restart it. Database is running on MariaDB which has the good sense to design with various safety logs, so it's able to backtrack and rerun commands to recover data. I expect it to be back up in the next hour or two.
Update 8:30PM: Didn't help that I had a momentary power outage reset my progress.
Update 9:45PM: Upfront honesty, looks like the backups were broken and I should have taken a closer look at them. However, the database is only mildly corrupted and I'm going to be able to do a dump and restore. This takes time as it's 20GB in size, but I feel confident that it'll be good after this.
Update 1AM: we're just about to the end of it, I just have to figure out why I'm getting an odd session data error? Unfortunately I have to sleep and that means calling it for the night.
Server News
in reply to Server News • •Looks like there's bigger issues with the lost APContact table, I'm investigating to see what I can do about that.
*Hopefully* this is self-resolving as it refinds all the users.
Shiri Bailem
in reply to Server News • •@Server News Looks like it's resolved?
It was causing issues searching contacts and pulling up profiles. But I incidentally hit the db update button in Friendica (usually for major version updates) and it abruptly worked again, so it might have just been an issue with indices on the table.
Please let me know if you see any issues so I can investigate.