My NAS server went wrong

I was gifted a little HP Microserver last year as a treat for helping a friend out. He said he got it cheap, but couldn't get it going, and that it had 4 1TB spinning rust disks in it and 8GB of RAM. It was a nice thought, and he knows I like to tinker with cyberjunk. I'd been wanting a backup solution for some time too, so I was well pleased.

It was originally running Windoze, and I'm not a fan, so I stuck the TrueNAS OS on it, and set up NextCloud under that. I got my laptop backing up to it using the NextCloud sync agent, and kind of let it get on with it. The NC agent wouldn't run on my wife's PC beause her OS was too old, but after I updated it to the latest Debian recently, I decided I should really get her set up to backup to the NAS now too.

TL;DR

As this is getting to be a loooong tale, I thought an upfront TL;DR was in order: after lots of trouble-shooting and testing, I ended up rolling my own backup solution instead of perservering with TrueNAS + NextCloud, kinda because "roll your own" is my other middle name, but mainly because the alternative was stable where the original was not.

TL;RA (Too Long; Read Anyway version)

So, if you're up for a long tale with more detail than necessary, here we go...

The agent started up OK, and I started off an initial sync. This was when things started to go wrong; it kept getting stuck, and then restarting a few minutes later. I started a manual sync on my laptop as a test, and it did the same thing. I tried opening up the TrueNAS console, and noticed that the uptime was about 1 minute; then I noticed that there were a bunch of alerts on the little bell icon - all of the messages stated there was an "unscheduled reboot", not a good sign. While I was looking, the console interface became unresponsive, and a few minutes later when I was able to reconnect, the uptime was back to "less than a minute".

I couldn't find anything in the logs to indicate what was wrong, even when looking via the CLI console; just the normal log chatter interspersed with startup messages. No core dumps; plenty of free space on all the volumes; plenty of memory free. The only thing I could find were some pending updates for both NextCloud and TrueNAS; so I hit the button on the nextcloud updates. It ran through partially, then rebooted, and NextCloud refused to start, coming up in maintenance mode. I think this is because the box had rebooted during the update. After some goggling round, I found that you can get back to normal mode by tweaking a setting in NC's config file - accessed only via the TrueNas console CLI for the "jail" NC runs in. After more futzing round, I got it out of maintenance mode, and got the update installed successfully.

But it still kept rebooting - in fact, it seemed to be doing so even more frequently than before. I started wondering if it was overheating, so I worked out how to display the CPU temperature, and although it looked like it was a bit hot - about 70 degrees C - it didn't look like it would be critical enough to cause a reboot.

I popped the case open, checked the heatsinks for dust buildup, finding hardly anything - not even worth vacuuming out. I reseated the RAM and reconnected all the leads, but it still kept rebooting. I noticed that the system SSD volume wiring was a bit of a "bodge" - an M.2 SSD stick on a SAS carrier card, with a SAS-to-SATA adapter shim connecting it to the single SATA port. Suspicious, but not definitively bad.

At this point, I decided to go for the TrueNAS updates; I managed to get these installed after a bit more messing round, but it definitely made things worse - the box was now struggling to stay up for more than about five minutes. It was beginning to look to me like there was some kind of hardware problem.

It's a good job I quite like this kind of puzzle, but I was getting rather frustrated by now. I started looking on the ebays for a cheap replacement, but I was shocked to find these more than 10 year old machines still go for up to £100, often with less memory and no disks.

So I had to persist - if I could work out which part was faulty, maybe I could just replace that. Mind you, the mainboard still goes for about £40 on its own, but it would be better than trashing the whole box.

I booted into single-user mode and ran some basic torture tests - reading the whole disk's ZFS volumes end to end with the "dd" utility (no errors or reboots); creating heavy CPU load to see if I could get it to overheat, again using "dd", I could get about 100% CPU usage, running about 75 degrees C, and let it run for over an hour without reboots; I tried to test it with MemTest86, but couldn't get it to boot whichever version I downloaded (it worked on other machines). Whatever I did though, I couldn't get it to reboot.

After taking a break for a few days, I decided I should try replacing the system SSD, and running with a fresh OS. I already had a Debian Bookworm installer burnt on a memory stick, and I found an old laptop spinny disk. This would cut out the M.2 to SAS to SATA adapter chain too, so an additional test. Anyway, Debian installed without any issues, but obviously couldn't access the ZFS volume on the main storage array, until I'd installed ZFS-on-Linux. It took about half an hour to install, absolutely thrashing the spinny disk AND the CPU, but no reboots occurred, and once done, I was able to successfully mount the ZFS volume. Curious.

I was beginning to think it wasn't a hardware problem. Similar tests to the ones I did in single-user mode on TrueNAS confirmed that I couldn't provoke a reboot on Debian.

Maybe it's just me, but I don't really fully trust ZFS-on-Linux. I know it's pretty proven, but it still feels kind of wrong - maybe it's the "illegal" mixing of GPL'd Linux kernel and the incompatibly licensed CDDL ZFS code; maybe it's because I first used ZFS on FreeBSD, and because TrueNAS is based on FreeBSD...

So I was basically wondering if pure FreeBSD would work OK - maybe I could install NextCloud directly on it? A quick check brought up a little how-to that showed what you needed to do to get it going, so it was worth a try. I downloaded the latest FreeBSD Stable USB image, burned it on top of the failed MemTest86 stick, and installed it over Debian on the spinny disk. Not surprisingly, this went without issue, no unscheduled reboots, and I was immediately able to mount the big ZFS volume. Working through the install tutorial for NC, I hit a few snags, mainly due to it being for an earlier version of FreeBSD. I didn't really need any more excuses to come to the (probably misguided) realisation that I could roll my own in less time than it would take to work through the snags. This is not without precedent, as I have crafted quite a few backup systems in past jobs, so I'm not totally clueless in this department.

So that's what I did, and obviously, this yarn wouldn't be complete without explaining what I did! Don't forget, that by reading past the TL;DR, you set yourself up for this - you can't back out now! And by this point, I certainly couldn't back out, either.

I ended up with a backup script of essentially TWO functional lines of code, and a relatively simple arrangement using SSH key authentication between the computers and the NAS. As I have ZFS on the main volume, I can use its extensive snapshotting system to provide a versioning system for backups, and I can do incremental "changes only" backups using rsync. I scheduled the backups to run on my and wifey's computers at different times using cron, and gave us a ZFS filesystem each, so I can snapshot them individually. This means that I can essentially pick old versions of the backups out "at will" (though I need to work out the details of that) and each backup only takes up the size of the changed/added files. I might consider adding compression at some point too, if it seems important. After getting all this set up, I realised I probably ought to automate backups of the RPi that runs this site too, as I'd just been copying everything manually onto my laptop when it occurred to me - not a very clever method, so I had to modify the scheme somewhat to account for the possibility of users with multiple computers. The script now looks like this:

#!/bin/bash

# Relies on having an SSH key installed on the backup server
# Each user must have their own zfs filesystem the same as their name under /silo/backups
# 

SERVER=10.1.3.0.9
HOSTNAME=`hostname`

echo "Starting backup to $SERVER at `date +%Y%m%d%H%M`..."

# Create zfs snapshot named after date/time
ssh $SERVER "zfs snapshot silo/backups/$LOGNAME@`date +%Y%m%d%H%M`"

cd $HOME

# Load list of dirs to sync from backup.spec file
SPEC=`cat backup.spec`

# Rsync everything over
# MAKE SURE THERE ARE NO TRAILING SPACES AFTER SOURCE DIRECTORIES 
rsync -av $SPEC $SERVER:backups/$HOSTNAME --delete

# Copy backup.log over so it's available on the server
scp backup.log $SERVER:backups/$HOSTNAME

NOTE 1: I need that reminder above the rsync command because I keep forgetting and end up making a mess when rsync subsequently dumps all the files in the named directory into the root of the destination.

NOTE 2: Updated environment variables to reflect what's actually set by cron: $USER -> $LOGNAME; set $HOSTNAME ourselves, because it doesn't seem to be set in all systems "better safe than sorry". Also now copies the logfile over to the backup server.

NOTE 3: Separated list of directories to backup into a separate file (backup.spec) so that I don't overwrite them when I update the backup.sh script and copy it over to the other hosts.

It gets run by a cron job entry like this:

00 23 * * * /home/andy/backup.sh >> /home/andy/backup.log 2>&1

NOTE: Added "2>&1" to make sure errors are also included in the log file.

In backup.log, I get a date stamp, the name of the snapshot, and an activity report from rsync, showing which files were synchronised, so I can easily identify which snapshot will contain the data I need. That's a manual process for now, and will inevitably need some tweaking - for instance, I might change each of the $HOSTNAME directories to be its own filesystem for tidiness of snapshots, but I'm not entirely sure it's necessary. Gonna run with this for a while and see where it goes. It's certainly a lot more stable than it was before.

I call that FIXED.

Rebuilding my NAS server

Tags:

My NAS server went wrong

TL;DR

TL;RA (Too Long; Read Anyway version)