My homeserver keeps killing SSDs for no reason
It's happened twice now, which is two times too many. But it is pretty fucking funny
For those who haven’t been keeping up for the last little bit, I’ve built a nice lil homeserver name of WuffServ which has been running since late September. It’s great! It’s been fun gardening my own little corner of the internet, hoarding data and media 😉, and just getting my paws dirty with system adminning little self-hosted things for myself and my cool pals. Right now, it’s running my Plex server, a fileserver, camera roll backups, adblocker DNS, a bunch of miscelaneous automation stuff, a project management tool, and some game servers, including a Garry’s Mod server which has hosted the Jazztronauts collab that has had critics raving!
However, in the background away from prying eyes, WuffServ has been racking up a kill count. Two different forms of SSDs have died now, one that had been previously running in a different spot doing minimal work for a while, and another that was just stuck in a couple of months ago. It’s funny as but also a little concerning. And so, I’m gonna write about it! Becuase fuck knows I need something to write about on this webbed site lmao
The chip death box
The server itself is just regular old PC hardware I built myself. An Intel i5-14600K, a Z790 mobo for maybe some Thunderbolt shenanigans later, 64GB of RAM for future-proofing (luckily acquired before the RAMpocalypse), contained within a Fractal Define R5 so I can have a modestly sized ATX server with plenty of 3.5 in bays for future storage expansion. Software-wise, it’s got Proxmox VE as the host OS, which has been incredible for running up VMs and Linux containers, and would go on to serve as the beginnings of exorcising Windows from my life (a story for another poast). At the same time, I picked up a 2TB Kingston NV3 NVMe drive to act as the container and VM storage drive. This drive has been totally fine. It’s good and it’s great! Yipee!
The drive that I have been having problems with are all the ones that I’ve been using to store the host OS. I will now chronicle the life and times of these drive, all taken from this world far too soon.
(Quick author’s note: I didn’t actually name the drives like this, but for the sake of telling a funny story, I’ll give them descriptive names to personify them a bit. This is a healthy thing to do I’m pretty sure.)
The life and times of WS_Fast: February 2025 - December 2025
The first host OS drive was a recycle job. We’ll call this drive WS_Fast, because it’s fast, compared to all of the 3.5in HDDs that would eventually go into the mix. For the longest time, I was running Pi-hole and a couple of other thigs off a Raspberry Pi, which since February 2025 was using a Patriot P300 attached to an NVMe hat. Completely overkill for what I was doing with it, but somehow cheaper than a comparable MicroSD card. Once I decided to acquire all the rest of the parts for WuffServ, I decided to repurpose this drive for the host OS, to save money and also given it had already spent most of its time as a simple OS boot drive. That, and once the big boy server was up and running, I wouldn’t have much use for the Raspberry Pi anymore, until I decide to move
For a time, it worked great! Installing Proxmox VE was no problem, and it ran that (as well as stored a couple of ISOs and CT templates) without any problems for the first few months. The trouble started one fateful night in December, just after Chrissy. I was trying to figure out why Plex was taking so long to transcode stuff for unrelated reasons. Then without warning, the web frontend for Proxmox just seized up and wouldn’t load up any of its content when clicking into submenus. Weird, I thought. Maybe it just needs a power cycle?
Nope. Turns out the boot SSD fried itself. This was confirmed by pulling apart my daily driver PC, sticking the drive in it as well (and for good measure, swapping boot drives between machines just to confirm the NVMe slot didn’t cark it), and not ever seeing it POST again in either machine. Very weird, as there was no indication of any issue with the host or the drive (at the time, I didn’t have any SMART monitoring set up, something which I have since remedied). I wasn’t even able to discern how many
Even more tragic is this was the only boot drive for the entire system. While I did have backups of the containers and VMs via PBS, it doesn’t do anything around backing up the host OS itself. After thinking about it, I ordered in two new sets of SATA drives; a pair of Patriot P210 256GB drives, with a plan to reinstall PVE using a ZFS mirror configuration. We’ll call these drives WS_NotAsFast_1 and WS_NotAsFast_2, because they’re fast, but not as fast as WS_Fast. Luckily, the other NVMe (the 2TB one with all the VMs and CTs on it) was totally fine, so I was able to just copy over the configurations that were backed up from PBS (after re-adding it from scratch), and everything was back up and off to the races after a couple of hours.
RIPRIG WS_Fast. Gone too soon. Maybe if you had have fared better in the silicon lottery, you would have served for years to come. o7
What did we learn from WS_Fast?
Redundancy is important! Because the boot drive was the one to fail, there was no hypervisor or operating system at all that the system could run, so the whole tower might as well be an ornament on the coffee table. If you’re feeling adventurous or frugal, you can totally get away with using a single drive for the host OS, and you’ll almost definitely be fine. Unless you get extremely unlucky with a some bad hardware, or other unpreventable disasters like power surges. If you want to add a little bit of redundancy, PVE makes it easy to install the OS on a ZFS mirror that you set up at install time, which is exactly what I did here. That way, if one drive dies (foreshadowing…), the system will still run in a degraded state, and you can replace the failed drive as soon as you possibly can.
Backups are also important! I had a pretty decent backup strategy for the containers and VMs, the actual important bits, but no backups for the host OS. For Proxmox VE in general, this isn’t a big deal, because ideally, you shouldn’t be doing much (if anything) on the host OS. It’s just there to run the hypervisor, and all of the actual work should be done within the containers and VMs. For me, that just meant I had to also reconfigure stuff like Tailscale and networking. Annoying bullshit, but whatever. Supposedly, PBS can do back up the host as well with some CLI magic, but it’s not something I’ve tested thoroughly. Once again, not too important. Might be something for another poast though, watch this space 😉
The life and times of WS_NotAsFast_2: December 2025 - April 2026
Fast forward to earlier this week. I was doing some routine maintenance on the server, updating the hypervisor and about 20 containers, and containers-within-containers. I decided to have a poke around the drives just to see how they were faring. Shockingly, one of the new SATA drives, the second in the new ZFS mirror pool, was showing as faulted with some read and checksum errors. Weird? I did a quick check of the SMART data, and it showed a couple of reallocated sectors, but nothing alarming. Okay that’s weird but nothing alarming for the moment, so a quick zpool clear later to mark the drive as healthy again, and I went about the rest of my day, with a mind to keep an eye on things.
A few days later, and I noticed the entire system was running like garbage. CT operations that should have only taken a few seconds were taking minutes, across multiple containers. Something on the hypervisor was getting in the way. Another check of the drive status, and it’s faulted again, this time with a ton more write and read errors. First thing I decided to check was the SMART data once again. Only once I tried, I was greeted with a fun message: “No SMART values”. Huh? How can there be no SMART values? I only just checked a couple days ago and there were definitely values there. I tried smartctl directly through the shell, same deal. Okay, that’s fine, maybe a restart will fix it?
Nup. Once the system restored (without any indication that it was trying to shut down), the drive was showing up as “UNAVAIL” in the ZFS pool, and the disk was no longer being listed on the system. It’s fucking dead, and WuffServ killed it again. Knowing that there was bad sectors already even this early on, I decided to just pull it for RMA, and order in a replacement. Luckily, my foresight to use a mirror for the host OS meant that the other drive in the mirrored pair, WS_NotAsFast_1, was still happily chugging away without any issue! In the end, using the SMART data from this drive, I’ve determined that this had only done about 95WS_Fast had done about 500GBW, and it lasted a good 10 months. This one did less than 1/5th of that, and lasted around half the time. That’s fucked.
RIPRIG WS_NotAsFast_2. I feel nothing for you. Piece of shit drive I hate you.
What did we learn from WS_NotAsFast_2?
Fucked if I know lmfao. Maybe Patriot drives aren’t all they’re cracked up to be durability wise. They’re pretty cheap, and as “they” say: You get what you pay for. Maybe I’m just plain unlucky and got a couple of duds. Genuinely I have no idea whether that’s to blame, or something crazy is going on with this system. The use of ZFS might rear its ugly head later on when it comes to write amplification on the other drive, but that’s a problem for Future Umby.
There was also the problem of “I wasn’t alerted earlier that the boot pool was degraded”. I had been using the ProxMan iOS app for notifications, and it should have been set up to send every email bound for root to my phone, but for whatever reason, it didn’t. The only way around that is to get actual email alerts set up so that any SMART and ZFS alerts should actually end up in my inbox, and not at the whims of Apple’s APNs system. That ended up being its own oddysey to get it working with Proton Mail, that might also be worth writing about at some point. Not now though, too tired.
The end?
Okay this was a bit ranty, but I just wanted to document the porblems I’ve been having, in case I really just am that unlucky, or if there’s a super secret hidden flag I missed somewhere in the PVE configuration. Hopefully this will be the last I write of sudden SSD failures for a while. After all, first time’s unlucky, second time’s a coincidence. Third time? Something’s wrong.
About the Author
Umby
Abandoned as a pup, Umby found a computer one day and started clacking his paws on the keyboard. He eventually used his insane comedy skills to grow a cult following as a 'funny internet dog thing'.