Aaaaaand We're back

So that one bad hard drive that was left went completely kaput and managed to throw the whole array into an unstable state. I couldn't boot the server until I got the replacement for the replacement drive.

Got that this morning, did a few hours of tinkering to get the array to accept the new array while the old drive was completely removed (it didn't like that lol). But once I got that in, everything came right back up.

Tomorrow I should be getting a replacement for the impaired server and I should be back to 100%.

After that, I intend to use the refund for the old one to get some extra SSDs into the two servers. That'll let me arrange things so that this site doesn't rely on the network storage and can be both faster and less prone to failure.

Content warning: Discussion of Israel/Palestine

I would argue that if you believe that Israel should become a "state for all of its citizens" instead of "the state of the Jewish people", and include both Arabs and Jews as equal citizens, then you are an anti-Zionist. You don't need to call for the ethnic cleansing of Jews in Israel/Palestine to be an anti-Zionist. You just have to reject territorial expansionism, settler colonialism, and "Blood and Soil".

I consider myself anti-Zionist and believe the idea of an ethnonationalist settler colony to be absolutely ridiculous to me, regardless of where it is. I am also Jewish so there's that.

Content warning: Discussion of Israel/Palestine

Nothing like making a small mistake that knocks all your servers offline for 10 hours... dear god that was slow...

I made an oopsie

My apologies to everyone for the long extended outage yesterday.

For background, 2 of the hard drives in the storage array this site is using have gone bad. I got replacements in and set to work migrating storage.

Unfortunately I was overconfident in the process as I had never actually needed to perform such a migration before, let alone in a live environment used by others.

I made assumptions in how the tools would operate in swapping out the drives (I used 'pvmove'...) and didn't realize that the tool I selected would lock the entire filesystem until it was done. (it took ~10 hours to transfer a single disk...)

This was made worse by the fact that the second replacement drive was DOA (it actively prevented the system from booting, so I spent a couple hours troubleshooting that before I realized I hadn't knocked something loose... the system was basically just rejecting the new drive).

There's sadly more downtime to come before this is resolved *but* it should be drastically shorter. Next time I'll be using a different tool to transfer without locking the system, so the downtime will just be 2 reboots (1 to put in the new drive, 1 to take out the old).

The replacement to the replacement drive will be arriving on Tuesday.

Dear lord I can't win lately... NAS went down and I basically had to kick it to get it back up, likely related to a dead disk...

Here's the tally of my environment:
* 2 Proxmox servers
- 1 server failed, I need to send it in for a refund under the protection plan
- I bought a server to replace it already... that server came with a busted ram slot meaning it's already on degraded performance (need to refund it, but want to get a replacement before I send it off...)

  • 1 old desktop running as a NAS
  • 1 hard drive is throwing non-critical block errors, threatening to fail... I got the warranty exchange set up already but...
  • 1 hard drive failed entirely after setting up that warranty exchange and now I don't have the parity to spare a drive, so I need to spare drives before I can start exchanging drives (those drives should arrive on Saturday)

And in all of this... spending way too much money I really shouldn't be spending at all...

And before someone thinks I'm throwing around a few K here... everything I'm getting is renewed, which is probably part of the problem but I sure as hell can't afford new. (Proxmox servers are around $100 each, drives around $45 each)

Oh, and the NAS going down is critical because those $100 boxes (a) don't have enough disk space for everything and (b) it's how I'm maintaining the ability to high(-ish) availability transfer between the two nodes (this saved me when the one proxmox system died).

Those are tiny form factor systems, but they have room for a second hard drive. I need to get a couple of hard drives to drop into them so they have better local storage, afterwards I can set them up as ZFS and use proxmox's features to sync the drives between the two nodes... that way they're not relying on the nas for anything that's not just large.

Reduced Performance / Reliability

One of the servers went down from hardware failure, thankfully since I run this across multiple boxes with failover it means the site is (obviously) still up.

It might occasionally get a little spotty on connection and especially on performance until that server gets replaced as it means the remaining server is a tad bit overloaded.

It'll probably be a few weeks unfortunately as I don't have the spare funds to pre-purchase a replacement (the protection plan I purchased will cover it, but I've got to mail off the unit, wait for the money, then wait for financial stresses to pass enough that I can order the replacement... then a good week or two delivery time after that)