WTF?
I honestly haven't the foggiest idea how this happened, but apparently the DNS settings got changed a few days ago on the servers with absolutely no explanation (and to junk nonsense settings for some reason). I'm going to keep an eye on them to make sure they don't change again.
Additionally I think that created a cascade that caused the other problems.
Any posts you've made over the past 2-3 days haven't been sent to other servers, but will start sending now.
As far as the other problems, I think when that happened it caused so many processes to lag and take way longer and more resources than usual as any time it tried to contact another server it timed out on the dns request.
DOS Overload
There's been some recent outages of the server, the root cause I've tracked down to the server getting overloaded with requests (mostly updates from other servers). Those updates have been coming in faster than the server can process them and preventing other requests from coming through.
I've made some tweaks that I believe have resolved it, fingers crossed.
Technical explanation:
The servers ran out of php-fpm threads to handle requests. It was configured with static count of 30 each (60 total). They were definitely impacted significantly by memory leaks which kept the count low.
I've changed it from static to ondemand and increased the count to 100 each, I'll probably go in and increase it again since it's still pegged at that limit almost constantly. But thankfully running on-demand seems to be keeping the memory usage per thread drastically lower.
Where the static assignment of 30 was eating up 8GB of ram, 100 on-demand threads is only taking up 1.3GB.
I'm going to increase it until it's either hitting memory constraints or it's no longer constantly at full capacity.
Server News
in reply to Server News • •There's definitely some sort of time and code problem involved as it hit again this morning even with the previous changes, though this time it only impacted updates (making posts/comments/likes, getting new posts). I think reading was unaffected because those operations are faster and require significantly less memory.
For whatever reason, sometime around midnight the server gets hit with a bunch of requests that all seem to lock up, eating up large quantities of memory and then won't exit. (With on-demand the threads exit after 10s of being idle, there was over 100 threads running continuously from midnight until I killed them around 9am). Likewise there was a very massive flood of updates from other servers corresponding to that, so I think it might just be a bunch of large servers sending bulk updates or some such.
New tuning to handle that: I put firmer time limits into PHP to prevent threads from running forever, there's two options for setting max times and the first was getting ignored (I think friendica overrode it? the second should override that and kill any threads going too long)
In addition to that, I set up a rate limiter to the inbox endpoint (where other servers send updates to), this should help keep that from overloading the server (majority of the time it'll just be slowing them down by a second or two unless the server is overloaded, at which point the rate limit should help get it accessible for users)