Downtime
Sorry for the downtime, server crashed and I wasn't able to address it quickly or smoothly in my current mental state.
I took advantage of the downtime, and probably exacerbated it a little, and finally did the migration back to local hardware... so it *should* be more reliable.
What honestly stretched it out a lot is that I also applied the most recent stable Friendica version and that took forever to update the database.
It's with a heavy heart that I'm letting y'all know that I'm disabling open-registrations and *encouraging* all users to find a new home.
I'm not kicking anyone off my server, but unfortunately due to the political situation here in the US the chance for things to go *very bad very quickly* I can not vouch for this server as reliable.
I live in Texas and need to figure out plans to evacuate at this point. I was holding out hope that we'd at least have status quo (as monstrously awful as it is) for longer.
And to make things worse, I'm a trans-woman, they actively want to make my very existence (let alone presence online) illegal and have been building the machinery to make that a very fast process once Trump assumes office.
So I do not recommend this server any longer for those reasons, if you choose to stay I'll continue to run it and support y'all.
Shiri Bailem reshared this.
Sorry for the downtime
Housing situation has been a pain, still on the temporary server environment and been hitting major resources bottlenecks.
I'm hoping to get a place sooner, but I've hit some roadblocks that very likely will push it drastically further out... in which case I'll need to spend more money on this environment to return things to stable.
Reminder that I'm covering this entirely out of my own pocket and these constraints are because I'm between homes and can't use my own much cheaper hardware. If you appreciate this instance I would very much appreciate a donation.
Vanessa likes this.
Shiri Bailem reshared this.
<Insert Profanities>
So the temporary system had some sort of failure, I'm not even 100% sure what caused it to be honest. It went down sometime yesterday and some of the virtual drives got corrupted, which caught the database and the virtual gateway device.
I was able to restore the system... most of the way. Thankfully there are backups of the database, but some of them were also flawed as well, the most recent intact one was from 5/16, so 5 days were lost.
To be clear, this problem was exacerbated by the fact that there's not as much redundancy in the temporary setup (sadly it looks like it'll be a few more months before I have a place of my own and can spin up my own hardware again). But I'm going to still look at how I might get those in better shape.
As far as how long it took: I had a busy day yesterday and didn't see that the server was down until I was too exhausted to do anything about it, so it had to wait until I got off work today... each attempt at restoring the database takes around an hour, so that took *a while* to get restored.
Vanessa likes this.
Shiri Bailem reshared this.
Image Upload Trouble
If y'all have had issues uploading images... sorry about that.
I missed a setting when re-configuring the server after transfer to it's current environment and it's fixed now.
Longer Explanation: there are two servers involved, a reverse proxy and then the actual Friendica server. The Friendica server accepted uploads up to 100MB... the proxy didn't have that setting... so it would just go for a bit, then timeout. Added that setting to the proxy and all solved.
Server Migration Complete
Thank you for your patience!
The server has been migrated to a new dedicated remote server (hosted with OVH) and to the newest version of Friendica (2024.03, previously it was 2023.12).
Things should stabilize mostly for the time being, but I will be on the lookout for bugs.
Vanessa likes this.
Server Migration Update
The downtime for server transfer is coming up sooner now, I've had to twist a few things to make it work and my apologies for any inconvenience.
I deleted a large chunk of the media on the server, primarily focusing on data that hasn't been accessed in 60 days, but it does look like it hit a few more recent pieces.
If you've been with us for a few months you might have lost some old photo uploads, some old contacts might have blurry profile photos for a while until the system decides to redownload them later.
To be clear, this *only* impacted images and we don't offer any guarantees against data loss on this server, especially with uploaded images. There's simply too much data there for me to reasonably back up at this time, and worse yet there really isn't any distinction in Friendica that allows me to backup only local media as opposed to cached remote media.
Upcoming Short Downtime
I'm going through some financial crisis right now and forced to couch surf for the next few months, so I'm not going to have a place or the server ***BUT*** it won't be offline more than a few hours.
There's going to be a short downtime sometime in the next few days. Things are a mess so I don't have an exact set time.
But I've temporarily rented a dedicated server from OVH for my needs and will be migrating the server there so that it can stay online while I work on getting a new place.
Additionally, when the time comes I will also be applying the new Friendica 2024.03 update.
Heads Up For Possible Outage
A reminder that I'm located in the US, and more particularly in Texas and this is a server run out of my home.
With the massive freeze incoming this weekend, there is a decent chance of a significant and extended power outage (Texas has a notoriously awful and poorly managed power grid, notably run completely separate from the rest of the country).
Vanessa likes this.
DDOSed by... facebook chat?
Apparently some facebook interface decided to DDOS the site a little over an hour ago.
It's not overwhelming the network, just an absolutely ridiculous number of requests.
I've solved it by instituting a global rate limit. It should be high enough to not affect anyone actually using the server.
Basic gist is that any more than 10 requests a second gets a 429 error (Too Many Requests, like all error codes with this site it'll give you a cute cat picture specific to that error). This is purely per second, so if you see that error at any point the time it takes you to refresh again the limit will already be reset.
Vanessa likes this.
Aaaaaand We're back
So that one bad hard drive that was left went completely kaput and managed to throw the whole array into an unstable state. I couldn't boot the server until I got the replacement for the replacement drive.
Got that this morning, did a few hours of tinkering to get the array to accept the new array while the old drive was completely removed (it didn't like that lol). But once I got that in, everything came right back up.
Tomorrow I should be getting a replacement for the impaired server and I should be back to 100%.
After that, I intend to use the refund for the old one to get some extra SSDs into the two servers. That'll let me arrange things so that this site doesn't rely on the network storage and can be both faster and less prone to failure.
I made an oopsie
My apologies to everyone for the long extended outage yesterday.
For background, 2 of the hard drives in the storage array this site is using have gone bad. I got replacements in and set to work migrating storage.
Unfortunately I was overconfident in the process as I had never actually needed to perform such a migration before, let alone in a live environment used by others.
I made assumptions in how the tools would operate in swapping out the drives (I used 'pvmove'...) and didn't realize that the tool I selected would lock the entire filesystem until it was done. (it took ~10 hours to transfer a single disk...)
This was made worse by the fact that the second replacement drive was DOA (it actively prevented the system from booting, so I spent a couple hours troubleshooting that before I realized I hadn't knocked something loose... the system was basically just rejecting the new drive).
There's sadly more downtime to come before this is resolved *but* it should be drastically shorter. Next time I'll be using a different tool to transfer without locking the system, so the downtime will just be 2 reboots (1 to put in the new drive, 1 to take out the old).
The replacement to the replacement drive will be arriving on Tuesday.
Reduced Performance / Reliability
One of the servers went down from hardware failure, thankfully since I run this across multiple boxes with failover it means the site is (obviously) still up.
It might occasionally get a little spotty on connection and especially on performance until that server gets replaced as it means the remaining server is a tad bit overloaded.
It'll probably be a few weeks unfortunately as I don't have the spare funds to pre-purchase a replacement (the protection plan I purchased will cover it, but I've got to mail off the unit, wait for the money, then wait for financial stresses to pass enough that I can order the replacement... then a good week or two delivery time after that)
Confused likes this.
Short Planned Maintenance Tonight
My apologies if this is inconvenient, I opted to do it on shorter notice without a set hour because (a) there's not a lot of activity on the server and (b) I'm really impatient.
I'm doing a hardware upgrade that requires rebooting the network storage backend which will bring down everything for a short time. It should take well under 30 minutes to do the hardware swap and most of the downtime is just going to be the database starting back up (which often takes in the range of another 30 minutes).
As part of this I'll also be deploying some software updates that require a reboot to take effect.
WTF?
I honestly haven't the foggiest idea how this happened, but apparently the DNS settings got changed a few days ago on the servers with absolutely no explanation (and to junk nonsense settings for some reason). I'm going to keep an eye on them to make sure they don't change again.
Additionally I think that created a cascade that caused the other problems.
Any posts you've made over the past 2-3 days haven't been sent to other servers, but will start sending now.
As far as the other problems, I think when that happened it caused so many processes to lag and take way longer and more resources than usual as any time it tried to contact another server it timed out on the dns request.
DOS Overload
There's been some recent outages of the server, the root cause I've tracked down to the server getting overloaded with requests (mostly updates from other servers). Those updates have been coming in faster than the server can process them and preventing other requests from coming through.
I've made some tweaks that I believe have resolved it, fingers crossed.
Technical explanation:
The servers ran out of php-fpm threads to handle requests. It was configured with static count of 30 each (60 total). They were definitely impacted significantly by memory leaks which kept the count low.
I've changed it from static to ondemand and increased the count to 100 each, I'll probably go in and increase it again since it's still pegged at that limit almost constantly. But thankfully running on-demand seems to be keeping the memory usage per thread drastically lower.
Where the static assignment of 30 was eating up 8GB of ram, 100 on-demand threads is only taking up 1.3GB.
I'm going to increase it until it's either hitting memory constraints or it's no longer constantly at full capacity.
There's definitely some sort of time and code problem involved as it hit again this morning even with the previous changes, though this time it only impacted updates (making posts/comments/likes, getting new posts). I think reading was unaffected because those operations are faster and require significantly less memory.
For whatever reason, sometime around midnight the server gets hit with a bunch of requests that all seem to lock up, eating up large quantities of memory and then won't exit. (With on-demand the threads exit after 10s of being idle, there was over 100 threads running continuously from midnight until I killed them around 9am). Likewise there was a very massive flood of updates from other servers corresponding to that, so I think it might just be a bunch of large servers sending bulk updates or some such.
New tuning to handle that: I put firmer time limits into PHP to prevent threads from running forever, there's two options for setting max times and the first was getting ignored (I think friendica overrode it? the second should override that and kill any threads going too long)
In addition to that, I set up a rate limiter to the inbox endpoint (where other servers send updates to), this should help keep that from overloading the server (majority of the time it'll just be slowing them down by a second or two unless the server is overloaded, at which point the rate limit should help get it accessible for users)
Oops
Made a performance tweak that shouldn't have had an impact and resulted in a non-error being flagged as an error (I was getting 302 which really just means "look at this other address").
Fixed the tweak, otherwise should be a tiny bit better. I've got it recognizing a lot of the potential errors and better skipping between servers if one of them acts up.
Public Status Page
I finally went and set up a public facing status page for the site.
You can go to https://status.foggyminds.com to see the site uptime. This will tell you if the site is currently reporting up, as well as every time it's gone down.
I've had StatusCake set up for a while so it emails me whenever the site goes down, but I haven't had the public page set up.
I'll try to keep up on putting notes on any downtime, though I can't promise they'll always be helpful (the most recent 15 minute downtime has me stumped as it resolved right as I sat down at my computer to look it up, and the error logs were very uninformative as to what may have happened).
Minor Caching Issue
I was notified of a little glitch causing some pages to show the admin view and excess two-factor prompts. This appears to have been a server caching issue. This would not have exposed any data or granted any special access, it was just a static cache of content being shown instead of sending it to the server code to save time.
The caching was only supposed to impact image files, but it looks like it somehow grabbed some page files too.
As server performance is now much better than it was when I implemented caching, and the performance difference is now negligble, I have opted to turn off caching for the time being.
If you experienced this problem, it should go away with just a refresh of the page.
Outage - Database Stuck
... I can't get 1 day without issues apparently...
When I was asleep, it looks like the database got stuck in some sort of optimize process with everything stuck waiting on that.
Restarting the database server forced that process to clear and cleaned things up.
I suspect what happened was I got overzealous after things started working great and I set the background worker count too high (these workers automatically run in the background doing things like updating contacts, downloading posts, etc).
I tuned that setting down, and increased the number of connections the database allows.
Performance Issues - Tentatively Resolved?
I feel really dumb for not noticing the cause sooner, I most attribute it to the rareness of the problem and the inconsistency at which it occurred.
I use a virtual server environment for my servers, and one of my measures to improve reliability and performance was to have two instances of the webserver behind a load balancer. In laymen's terms, whenever you're connecting you're assigned to whichever has the least connections and they're otherwise identical (same files, connect to the same database, etc).
Well... turns out when I duplicated the server initially, the software decided to not change the MAC address. Laymen's: The ip address of the server is different, but the network uses the mac address to map ip addresses to boxes, if two boxes have the same mac address then traffic is going to sporadically and randomly switch between them... but also with a bad IP address which means half of the traffic is always getting rejected.
So your connection to the load balancer was fine, but it was struggling to connect to the webservers and the webservers were struggling to connect to the database.
Once I changed that the server immediately became quite zippy!
My sincere apologies for the impact.
Outage - Self Resolved - Investigating
The server went down for a few hours today and resolved before I could look at it. It's also been a really bad day for my physical health so I've had very little capacity.
I am looking into why it happened but have no firm answers at this time.
As far as I can tell, the biggest most obvious culprit is the database backup which was happening at that time.
It looks like the database is big enough that backups are no longer simple and I'll need to change my backup method.
Revised method of backups set up, tomorrow I'll test that it's working and then attempt a restore (on a second server so no impact here).
I've also refreshed the virtual network interface as I learned it may have been dropping packets for some unknown reason and it's working fine now.
Login Screen Issue
There was an issue for a little while that I didn't notice until I rebooted my desktop in which visiting foggyminds.com/ would give you the login screen regardless of your login status (as opposed to foggyminds.com/network or any other address).
Apparently the load balancer was mistakenly caching the root page which caused it to always show the login page.
Server Performance
Just acknowledging that the server has had some spotty performance recently. I'm unable to figure out the cause but continuously investigating.
We now have two load balanced servers, and I can establish that it's not specific to either server. I've updated the load balancer with caching settings which should help alleviate a little (public images now will get cached and not have to go through PHP and database queries).
The database is showing no performance issues that I can see, when the page lags the database queries are not.
Additionally, the server lag appears to be random and per request (as in another identical request made at the same time doesn't lag).
My focus is going to be on examining the PHP-FPM service on both nodes, they're both reporting slow execution.
Nitter Update
You may or may not have noticed that Twitter/X links on this site get rewritten. I have the Nitter addon installed that rewrites twitter.com links to use proxy pages that don't profit Twitter and don't expose you to the rest of Twitter's nonsense.
The Nitter site I was using (notabird.site) is no longer functioning, I've updated it to a similarly compatible site traittor.net/.
Hardware Status
Outage / Hardware Failure And Recovery
I'm going to start with my apologies for the 24 hour outage. Thankfully there should be little impact beyond that.
I've been running all of this on a single box, and had plans (last night in fact) to expand it to two with a NAS backend to help protect it from failures and improve performance.
Ironically, the morning in which I planned to make those updates was when I experienced some sort of disk failure on the server. I don't know exactly what happened and the exact level of failure (diagnosis of that will be tonight, mostly to see if I need to get a warranty replacement, or if there was some software cause).
The disk failure presented itself as bad blocks and mild data corruption which prevented multiple services from running.
The good news is that it appears nothing of real importance was corrupted. We lost the majority of one database table, but all the table contained was a list of every single activity-pub contact the server could see and it's contents should be reconstructing automatically now (though obviously may take a while).
It might be worth checking your friends/followers to make sure there are no major absences, I don't know if this impacts who you're following or just the basic contact info.
Beyond that, here's what's changed and changing followed by accountability for my mistakes:
* The server now has a NAS backend with disk redundancy which will protect against drive failures, it also helps to share resources between this server and a second server once I have the drive issues figured out.
* Once the original server is fixed/cleared it will be running a load balanced second copy of the webserver (and eventually a copy of the database) to improve performance and reliability. (The nas will allow two copies of the server to share the same media files)
For accountability:
One of the things that made the failure take longer to recover from was the fact that my backups of the database had failed to run due to a typo.
I could have sworn I had checked it to ensure it was running, but clearly I had not.
I fixed the error and confirmed it ran successfully this morning. Database backups are run twice daily and backed up to a remote **encrypted** backup (borgbase.com). I will be checking periodically over the next week or two to confirm that it continues running.
Additionally, with the NAS now set up and available I am running full system snapshots of the database twice a day as well. This means I should have two avenues for recovery across two different methods going forward, which should significantly increase reliability.
Looks like there's bigger issues with the lost APContact table, I'm investigating to see what I can do about that.
*Hopefully* this is self-resolving as it refinds all the users.
@Server News Looks like it's resolved?
It was causing issues searching contacts and pulling up profiles. But I incidentally hit the db update button in Friendica (usually for major version updates) and it abruptly worked again, so it might have just been an issue with indices on the table.
Please let me know if you see any issues so I can investigate.
Fuzzy Thumbnails
I'm not entirely certain what's causing this, but I do know what kicked it off.
I'm doing some migrations of the media files between boxes and it seems to have somehow messed up the thumbnails and some cached remote images. However, I've seen no problems with any uploaded files (aside from thumbnail views of those files).
This appears to be resolving itself over time as the server updates contacts and recaches many of these files.
Migrations should be finished in the next day or two and there should be no significant downtime.
Server Crash
Regarding the downtime that happened last night (while I was trying to sleep which is why it went on for so long).
The short version is a bunch of stuff clogged up the pipes, hung, and just needed a good ol' fashioned restart to fix. I've made some changes to help reduce the chance of that happening again, as always I'm sorry for the trouble.
The longer version is that the php worker processes hung and bogged down the database which brought the whole thing to a screeching halt.
I've changed the limits on those workers so they should have less impact and hopefully not do that again (the downside is that they'll be a little slower on federating updates, but most of the time that shouldn't be noticeable).
I've also taken advantage of the existing downtime to migrate the database over to a second machine with more memory. I originally intended to upgrade the memory of the machine it was on, but unfortunately made the mistake of buying the wrong chips. That plan is still pending. However by migrating it I was able to expand the memory usage significantly which should help performance, the downside is that it's a busier system so the CPU is occasionally busier and can sometimes have a negative performance impact (it's likely negligible but I'm not super confident of that).
In the next week or two I plan to do further hardware upgrades, but with the database migrated already it should be negligible if any downtime.
Once that's done, I'm hoping to implement some high availability options to further reduce downtimes.
If you're experiencing particularly slow load times on the network page (the default homepage with your main feed), one thing on your end you can tune is how many items it tries to load at one time.
Go to Settings -> Display -> Content/Layout and you can change "Number of items displayed per page".
Especially as an item includes a post and all of it's comments as one item, this can make a drastic performance difference (on my personal feed 40 takes >30 seconds to load sometimes, but 20 takes < 3 seconds)
Some updates since it's been a messy few days and I really should have said something sooner.
After a post of mine went viral, server encountered some load issues that kinda cascaded.
I've been gradually reducing the impact through tuning, and have ordered more ram sticks to expand the available ram on the server, as well as give me room on another server to provide redundancy (so it can try to spread the load between boxes).
Ram sticks for the second box will arrive tomorrow, I'll review how I want to approach things from there. No planned downtime for tomorrow, but a planned downtime for next Tuesday with the other ram sticks arrive.
Side note: I've made a custom error page to at least make it less painful, no blaring white, automatic attempts to refresh the page, and my email address to reach out about it if needed.
Planned Outage
Power company is coming to work on the power transformer next Monday (7/17).
I've been notified that the planned outage is for 9am to noon US Central Time.
Outage Report 2023-06-20
Server was down for a few hours because internet was down for that time, had a water leak in a neighboring apartment and had to cut the fuse that the internet was on.
Obviously got that situation resolved.
Outage Report
Server went down as I was asleep, it ran out of storage.
I've expanded the storage and have adjusted the settings to reduce storage consumption. The biggest factor being that I had user profile caching turned on and it looks like my server discovered a bunch of new servers resulting in a large uptick in cached avatars.
This setting has now been turned off, it should make for a minor performance hit on new users but otherwise the big impact is significantly reduced storage space usage.
I'll pretty this up later, but I figured it's a good idea to have an account specifically for server updates and information separate from my personal account.
First update is server blocks:
poa .st, cawfee .club, wolfgirl .bar, noagendasocial .com - I verified that admins on all of these share and permit explicit hate speech (mostly homophobia)
detroitriotcity .com - Hate speech ***and*** Nazis
Vanessa
in reply to Server News • • •Thanks for that