custom gentoo nas, part 6: inotify-watch, write detection

I want to minimize writes to the boot/root filesystem for a bunch of bullshit reasons:

  • It’s a flash drive, so repeated writing will eventually wear it out,
  • “security”,
  • keeping myself honest from a reproduceability perspective, in terms of modifying the in-place system vs. the virtual machine image.

Some idealized version of the system would employ a really-read-only boot drive with some sort of overlay/union filesystem to capture changes to reasonable non-volatile storage (such as the RAID array).

In any case, the following has been useful to log writes, in order to determinate what needs to be moved to spinning rust:

$# inotifywait --daemon -e modify,attrib,move,create,delete -r / --exclude='^/(dev|run|proc|sys|data|tmp|home)' --outfile=/data/system/var/log/file-modifications

Since this creates watches on the individual inodes, you will probably need to do something like this, first:

$# echo $(( 8 * (2 ** 20) )) > /proc/sys/fs/inotify/max_user_watches

While I’d already proactively moved /var/log/ and /var/lib subdirectories for most services, the following were missed as being very write heavy:

/var/lib/samba/private/msg.sock
/var/cache/samba
/var/cache/man
/var/lib/mlocate

After moving/symlinking these to /data/system/ on the RAID, there’s just a handful of regular writes that I will need to do further investigation to understand how appropriately to mitigate:

/var/lib/private/systemd/timesync/clock
/var/lib/systemd/timers/stamp-cron-hourly.timer
/var/log/{lastlog,wtmp}
/root/.bash_history

custom gentoo nas, part 4: single-user, hostname changing, nas.target

When I boot the virtual machine with the NAS system software, I usually interrupt the boot to start in single-user mode. This gives me an opportunity to:

$ echo terra > /proc/sys/kernel/hostname

so that the virtual machine comes online with a distinct network identity.

For a while, after then returning to the normal boot, once the machine was online I would quickly ssh in to terminate a handful of services (tor, plex, sonarr, &c.) as quickly as possible to prevent them from generally … interferring with the world.

After a while, I got tired of this, so I looked into creating a specific runlevel to separate multi-user from “running as a NAS”.

In the traditional runlevel system, runlevels 1 (single-user), 3 (multi-user), 5 (graphical) and 6 (reboot) are spoken for, while 2 and 4 are “free” to be used for other purposes.

systemd takes a different (imho better) tack, with a bunch of semantically-defined “targets”, to which other units can be associated.

In my case, I created a nas.target, which you can think of as runlevel 4 if it makes you feel better.

# /etc/systemd/systemd/nas.target
# 2018-04-06, jsled: create a new target!
[Unit]
Description=Network Attached Storage
# Documentation=nfs://earth/data/system/ChangeLog
Requires=multi-user.target
Conflicts=rescue.service rescue.target
After=multi-user.target rescue.service rescue.target
AllowIsolate=yes

$ rm /etc/systemd/system/default.target
$ ln -s /etc/systemd/system/nas.target /etc/systemd/system/default.target

At this point, the system will boot to nas.target by default.

After this, one needs to modify /lib/systemd/system/{whatever}.service to replace

[Install]
WantedBy=nas.target

And then re-systemctl enable whatever.service to get it linked appropriately.

(Unfortunately, it seems that adding /etc/systemd/system/whatever.service.d/override.conf will add the target to the WantedBy set, not replace the value. :/)

At this point, booting the virtual machine becomes:

# boot to single user: add "single" or "systemd.unit=rescue.target" to the boot line.
# wait for single-user...
$# echo terra > /proc/sys/kernel/hostname
$# systemctl isolate multi-user

Then the boot will continue to multi-user, without further loading all the services associated with nas.target.

And, yes, I should figure out how to configure grub / grub-mkconfig to add a boot-menu entry for single-user rather than editing the damn boot every time.

(And, it seems like there should be a kernel param for hostname, but I can’t seem to find it.)

In my case, the following services are associated with nas.target:

root@earth [~]# ls -l /etc/systemd/system/nas.target.wants/
total 0
lrwxrwxrwx 1 root root 40 Apr  6 18:25 avahi-daemon.service -> /lib/systemd/system/avahi-daemon.service
lrwxrwxrwx 1 root root 36 Apr  6 18:25 collectd.service -> /lib/systemd/system/collectd.service
lrwxrwxrwx 1 root root 35 Apr  6 18:25 grafana.service -> /lib/systemd/system/grafana.service
lrwxrwxrwx 1 root root 38 May  6 09:02 lm_sensors.service -> /lib/systemd/system/lm_sensors.service
lrwxrwxrwx 1 root root 38 Apr  6 18:25 nfs-server.service -> /lib/systemd/system/nfs-server.service
lrwxrwxrwx 1 root root 41 Apr 12 18:01 node_exporter.service -> /etc/systemd/system/node_exporter.service
lrwxrwxrwx 1 root root 38 Apr 15 09:16 nut-driver.service -> /lib/systemd/system/nut-driver.service
lrwxrwxrwx 1 root root 38 Apr 15 09:14 nut-server.service -> /lib/systemd/system/nut-server.service
lrwxrwxrwx 1 root root 45 Apr  6 18:25 plex-media-server.service -> /lib/systemd/system/plex-media-server.service
lrwxrwxrwx 1 root root 38 Apr  6 18:25 prometheus.service -> /lib/systemd/system/prometheus.service
lrwxrwxrwx 1 root root 36 Apr  6 18:26 sabnzbd@default.service -> /lib/systemd/system/sabnzbd@.service
lrwxrwxrwx 1 root root 34 May  6 09:02 smartd.service -> /lib/systemd/system/smartd.service
lrwxrwxrwx 1 root root 32 Apr  6 18:25 smbd.service -> /lib/systemd/system/smbd.service
lrwxrwxrwx 1 root root 34 Apr  6 18:27 sonarr.service -> /lib/systemd/system/sonarr.service
lrwxrwxrwx 1 root root 31 Apr  6 18:25 tor.service -> /lib/systemd/system/tor.service
lrwxrwxrwx 1 root root 47 Apr  6 18:25 transmission-daemon.service -> /lib/systemd/system/transmission-daemon.service

Basically, everything related to physical hardware (lm_sensors, nut), or proactive networking (nfs, samba, avahi, plex, sonarr/sabnzbd, tor).

custom gentoo nas, part 3: syslog-ng, systemd journal in memory

systemd, as is its nature, subsumes into itself a number of facilities that historically have separate processes and utilities.

In particular, system logging is now part of systemd, recorded in binary files, and accessible via the journalctl utility.

And while systemd has some compelling aspects, logging is somewhere I wanted to continue to use syslog-ng and logrotate to have a familiar experience and easily tailable, grepable files.

Both of these packages have support for integration with systemd and journald.

As well, with this OS image, I wanted to minimize writes to the boot/root filesystem, to prevent excess wear on the flash storage.

A key component of this is to reconfigure journald (/etc/systemd/journald.conf): for Storage=volatile. This will ensure that the journal is only retained in memory. While you’re at it, set RuntimeMaxUse to something reasonable. Note that you really are only going to need this to be large enough to spool syslogs for the period of time before syslog-ng starts up and can start consuming the logs from journald to be put into persistent storage.

Then, simply configure syslog-ng with:

    # /etc/syslog-ng/syslog-ng.conf
    source src { systemd-journal(); [...] };

…to consume logs from journald once it starts. In my case, I have it configured to spool logs to:

    # /etc/syslog-ng/syslog-ng.confg
    destination messages { file("/data/system/var/log/messages"); };

custom gentoo nas build, part 2: build and burn-in

The build was uneventful. It’s screwing and slotting things into places they’re supposed to be.

(As an aside; this struck me to be quite true, as well: https://twitter.com/whitequark/status/986217575569735680 )

Burn-in proceeded in the usual phases: cpu and memory, then disks, with a focus on power consumption.

Initially I let memtest-x86 run for a day or so just in case anything was horribly broken.

Then, booting into the excellent system-rescue-cd, I ran a combination of stress, primes, pi and bonnie for a while. I was also curious to see the values from the kill-a-watt, since a property of this build was to lower power consumption from the previous server. It broke out like this:

  • idle-idle (no drives): 20W

  • memtest86 (no drives): 30W

  • sysrescuecd (5 drives, no load): 46W

  • stress –cpu 4 : 75W

  • stress –cpu 8, pi –cpu 8 @ load of 16: 75W

  • stress + pi + bonnie: 85W

    • stress –cpu 8 –io 4 –vm 4

    • sysstress-cli 8 threads, 32M digits,

    • bonnie++ -s 32g -n 512:1k:4g:100

I’ve never really monitored or ensured that disk activity was near “maximums”, and in the last few years for no particular reason, I was interested to make sure that this was true early on in the burn-in. Plus, I figured it would function as a good way to throw a lot of activity/load at the disks to see if they would fail early before I started relying on them.

Bonnie++ was useful, but the real tool to use here is fio:

/dev/sda
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
randread: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
seqread: (g=2): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
seqwrite: (g=3): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
randmix: (g=4): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
seqmix: (g=5): rw=rw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.9
Starting 6 processes
randmix: Laying out IO file(s) (1 file(s) / 1024MB)
seqmix: Laying out IO file(s) (1 file(s) / 1024MB)

randwrite: (groupid=0, jobs=1): err= 0: pid=28702: Sat Mar 17 04:34:56 2018
  write: io=1024.0MB, bw=1666.6KB/s, iops=416 , runt=629182msec
    slat (usec): min=2 , max=223550 , avg=87.24, stdev=2098.82
    clat (usec): min=970 , max=858733 , avg=153511.58, stdev=77298.22
     lat (msec): min=1 , max=858 , avg=153.60, stdev=77.39
    clat percentiles (msec):
     |  1.00th=[   21],  5.00th=[   68], 10.00th=[   77], 20.00th=[   95],
     | 30.00th=[  112], 40.00th=[  126], 50.00th=[  141], 60.00th=[  155],
     | 70.00th=[  174], 80.00th=[  196], 90.00th=[  243], 95.00th=[  306],
     | 99.00th=[  437], 99.50th=[  486], 99.90th=[  570], 99.95th=[  627],
     | 99.99th=[  685]
    bw (KB/s)  : min=  265, max= 3419, per=100.00%, avg=1681.68, stdev=366.95
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.09%, 20=0.86%, 50=1.33%
    lat (msec) : 100=20.94%, 250=67.68%, 500=8.70%, 750=0.39%, 1000=0.01%
  cpu          : usr=0.75%, sys=2.29%, ctx=257413, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=0/d=262144, short=r=0/w=0/d=0
randread: (groupid=1, jobs=1): err= 0: pid=29173: Sat Mar 17 04:34:56 2018
  read : io=1024.0MB, bw=2121.1KB/s, iops=530 , runt=494173msec
    slat (usec): min=2 , max=431 , avg=34.47, stdev=12.83
    clat (usec): min=641 , max=1257.7K, avg=120602.90, stdev=81188.62
     lat (usec): min=645 , max=1257.8K, avg=120638.47, stdev=81188.53
    clat percentiles (msec):
     |  1.00th=[   15],  5.00th=[   28], 10.00th=[   39], 20.00th=[   58],
     | 30.00th=[   75], 40.00th=[   90], 50.00th=[  104], 60.00th=[  120],
     | 70.00th=[  139], 80.00th=[  169], 90.00th=[  223], 95.00th=[  277],
     | 99.00th=[  408], 99.50th=[  469], 99.90th=[  611], 99.95th=[  668],
     | 99.99th=[  840]
    bw (KB/s)  : min=  815, max= 4063, per=100.00%, avg=2121.73, stdev=226.86
    lat (usec) : 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.27%, 20=2.23%, 50=13.43%
    lat (msec) : 100=31.15%, 250=46.04%, 500=6.53%, 750=0.31%, 1000=0.02%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.91%, sys=2.58%, ctx=262450, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=262144/w=0/d=0, short=r=0/w=0/d=0
seqread: (groupid=2, jobs=1): err= 0: pid=29532: Sat Mar 17 04:34:56 2018
  read : io=1024.0MB, bw=176647KB/s, iops=44161 , runt=  5936msec
    slat (usec): min=1 , max=797 , avg= 5.57, stdev= 8.90
    clat (usec): min=172 , max=426325 , avg=1442.31, stdev=3509.15
     lat (usec): min=192 , max=426329 , avg=1448.08, stdev=3509.02
    clat percentiles (usec):
     |  1.00th=[  478],  5.00th=[  580], 10.00th=[  708], 20.00th=[  860],
     | 30.00th=[  940], 40.00th=[  988], 50.00th=[ 1032], 60.00th=[ 1112],
     | 70.00th=[ 1208], 80.00th=[ 1352], 90.00th=[ 2224], 95.00th=[ 2640],
     | 99.00th=[10816], 99.50th=[11072], 99.90th=[18048], 99.95th=[34560],
     | 99.99th=[199680]
    bw (KB/s)  : min=158784, max=184200, per=99.92%, avg=176508.64, stdev=9110.59
    lat (usec) : 250=0.01%, 500=1.42%, 750=10.73%, 1000=31.44%
    lat (msec) : 2=45.13%, 4=8.72%, 10=0.67%, 20=1.79%, 50=0.05%
    lat (msec) : 100=0.02%, 250=0.01%, 500=0.01%
  cpu          : usr=7.95%, sys=22.91%, ctx=33402, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=262144/w=0/d=0, short=r=0/w=0/d=0
seqwrite: (groupid=3, jobs=1): err= 0: pid=29538: Sat Mar 17 04:34:56 2018
  write: io=1024.0MB, bw=174704KB/s, iops=43676 , runt=  6002msec
    slat (usec): min=1 , max=8910 , avg= 7.66, stdev=18.59
    clat (usec): min=215 , max=66990 , avg=1456.09, stdev=2359.73
     lat (usec): min=219 , max=75611 , avg=1463.97, stdev=2360.57
    clat percentiles (usec):
     |  1.00th=[  490],  5.00th=[  588], 10.00th=[  596], 20.00th=[  620],
     | 30.00th=[  788], 40.00th=[  948], 50.00th=[ 1064], 60.00th=[ 1144],
     | 70.00th=[ 1288], 80.00th=[ 1576], 90.00th=[ 2288], 95.00th=[ 2768],
     | 99.00th=[11200], 99.50th=[11712], 99.90th=[34560], 99.95th=[37120],
     | 99.99th=[67072]
    bw (KB/s)  : min=154003, max=189544, per=100.00%, avg=176280.18, stdev=9821.98
    lat (usec) : 250=0.04%, 500=1.13%, 750=26.85%, 1000=14.71%
    lat (msec) : 2=44.63%, 4=10.35%, 10=0.33%, 20=1.69%, 50=0.24%
    lat (msec) : 100=0.02%
  cpu          : usr=10.06%, sys=29.26%, ctx=28879, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=0/d=262144, short=r=0/w=0/d=0
randmix: (groupid=4, jobs=1): err= 0: pid=29546: Sat Mar 17 04:34:56 2018
  read : io=838560KB, bw=1237.4KB/s, iops=309 , runt=677719msec
    slat (usec): min=2 , max=298226 , avg=72.39, stdev=1985.39
    clat (usec): min=758 , max=949674 , avg=129640.81, stdev=100279.49
     lat (usec): min=793 , max=949709 , avg=129714.28, stdev=100350.24
    clat percentiles (msec):
     |  1.00th=[   15],  5.00th=[   26], 10.00th=[   36], 20.00th=[   54],
     | 30.00th=[   71], 40.00th=[   86], 50.00th=[  102], 60.00th=[  122],
     | 70.00th=[  147], 80.00th=[  186], 90.00th=[  265], 95.00th=[  347],
     | 99.00th=[  486], 99.50th=[  529], 99.90th=[  644], 99.95th=[  685],
     | 99.99th=[  791]
    bw (KB/s)  : min=  416, max= 2448, per=100.00%, avg=1242.22, stdev=496.71
  write: io=210016KB, bw=317323 B/s, iops=77 , runt=677719msec
    slat (usec): min=4 , max=159793 , avg=103.09, stdev=2361.00
    clat (usec): min=612 , max=1371.7K, avg=308029.22, stdev=253485.64
     lat (usec): min=660 , max=1371.7K, avg=308133.44, stdev=253483.12
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[   18], 10.00th=[   32], 20.00th=[   60],
     | 30.00th=[   87], 40.00th=[  153], 50.00th=[  265], 60.00th=[  363],
     | 70.00th=[  453], 80.00th=[  553], 90.00th=[  676], 95.00th=[  766],
     | 99.00th=[  930], 99.50th=[  988], 99.90th=[ 1106], 99.95th=[ 1139],
     | 99.99th=[ 1221]
    bw (KB/s)  : min=   23, max=  752, per=100.00%, avg=310.13, stdev=155.44
    lat (usec) : 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.10%, 10=0.61%, 20=2.72%, 50=14.29%
    lat (msec) : 100=27.90%, 250=35.18%, 500=13.47%, 750=4.58%, 1000=1.05%
    lat (msec) : 2000=0.08%
  cpu          : usr=0.69%, sys=2.03%, ctx=258281, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=209640/w=0/d=52504, short=r=0/w=0/d=0
seqmix: (groupid=5, jobs=1): err= 0: pid=30064: Sat Mar 17 04:34:56 2018
  read : io=839528KB, bw=11785KB/s, iops=2946 , runt= 71234msec
    slat (usec): min=1 , max=69913 , avg=20.52, stdev=312.63
    clat (usec): min=118 , max=230218 , avg=20261.42, stdev=32464.50
     lat (usec): min=163 , max=230232 , avg=20282.56, stdev=32465.65
    clat percentiles (usec):
     |  1.00th=[  458],  5.00th=[  644], 10.00th=[  764], 20.00th=[  932],
     | 30.00th=[ 1096], 40.00th=[ 1432], 50.00th=[ 2192], 60.00th=[ 5664],
     | 70.00th=[10432], 80.00th=[49408], 90.00th=[76288], 95.00th=[92672],
     | 99.00th=[119296], 99.50th=[129536], 99.90th=[154624], 99.95th=[164864],
     | 99.99th=[220160]
    bw (KB/s)  : min= 8806, max=16135, per=100.00%, avg=11801.80, stdev=1352.19
  write: io=209048KB, bw=2934.7KB/s, iops=733 , runt= 71234msec
    slat (usec): min=2 , max=69808 , avg=29.70, stdev=443.22
    clat (usec): min=228 , max=200468 , avg=5715.79, stdev=8921.41
     lat (usec): min=268 , max=200490 , avg=5746.14, stdev=8940.66
    clat percentiles (usec):
     |  1.00th=[  580],  5.00th=[  756], 10.00th=[  860], 20.00th=[ 1012],
     | 30.00th=[ 1224], 40.00th=[ 1528], 50.00th=[ 2128], 60.00th=[ 2992],
     | 70.00th=[ 5856], 80.00th=[ 9152], 90.00th=[12992], 95.00th=[21888],
     | 99.00th=[43264], 99.50th=[53504], 99.90th=[83456], 99.95th=[91648],
     | 99.99th=[160768]
    bw (KB/s)  : min= 2096, max= 4079, per=100.00%, avg=2938.51, stdev=348.91
    lat (usec) : 250=0.03%, 500=1.18%, 750=7.28%, 1000=15.20%
    lat (msec) : 2=24.52%, 4=10.97%, 10=12.79%, 20=6.50%, 50=5.55%
    lat (msec) : 100=13.27%, 250=2.72%
  cpu          : usr=3.80%, sys=9.68%, ctx=114107, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=209882/w=0/d=52262, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=1666KB/s, minb=1666KB/s, maxb=1666KB/s, mint=629182msec, maxt=629182msec

Run status group 1 (all jobs):
   READ: io=1024.0MB, aggrb=2121KB/s, minb=2121KB/s, maxb=2121KB/s, mint=494173msec, maxt=494173msec

Run status group 2 (all jobs):
   READ: io=1024.0MB, aggrb=176646KB/s, minb=176646KB/s, maxb=176646KB/s, mint=5936msec, maxt=5936msec

Run status group 3 (all jobs):
  WRITE: io=1024.0MB, aggrb=174704KB/s, minb=174704KB/s, maxb=174704KB/s, mint=6002msec, maxt=6002msec

Run status group 4 (all jobs):
   READ: io=838560KB, aggrb=1237KB/s, minb=1237KB/s, maxb=1237KB/s, mint=677719msec, maxt=677719msec
  WRITE: io=210016KB, aggrb=309KB/s, minb=309KB/s, maxb=309KB/s, mint=677719msec, maxt=677719msec

Run status group 5 (all jobs):
   READ: io=839528KB, aggrb=11785KB/s, minb=11785KB/s, maxb=11785KB/s, mint=71234msec, maxt=71234msec
  WRITE: io=209048KB, aggrb=2934KB/s, minb=2934KB/s, maxb=2934KB/s, mint=71234msec, maxt=71234msec

Disk stats (read/write):
  sda: ios=739698/536984, merge=203565/93548, ticks=60811136/56677146, in_queue=117488702, util=100.00%

After a couple of days of this, I was a/ impatient and b/ confident in the hardware. :) It was time to cut over to a new OS image.

nas build, part 1: software/OS plans

See part 0 for the previous in the series.


These posts are not going to be focused on the physical build, as that’s been readily covered (and better) by others.

I want to focus on the software side of the build, where I am (and everyone is, basically) doing something unique.

My goal has been the following:

  • Boot off a (ideally static/read-only) image…

  • …imaged to two thumbdrives (current & next)…

    • rollbacks become “just use the previous thumbdrive, not the newly-imaged one”
  • …derived from a virtual machine.

    • virtual machine can be checkpointed, backed up and speculatively modified

As such, I have a simple gentoo system on a 12GB virtual disk, with 5 1GB “drives” playing a placeholder role for the ZFS array in the real machine. I chose 12GB because while it’s easy to get 32GB thumbdrives, I wanted something that was even below a 16GB size, so that it could safely be (re)imaged onto such a thumbdrive without worrying about any sizing or boundry issues.

One thing that makes this setup work very well is the use of uuids for both grub command lines and /etc/fstab, instead of relative drive designations. Instead of having the (virtual disk) /dev/sda listed in /etc/fstab, and knowing that on boot, a thumbdrive is going to be reported as /dev/sdf, /etc/fstab simply has the UUID (or LABEL) of the drive/partition, and the mounter just figures it out.

    # /dev/sda2 /boot       ext2        defaults    0 0
    UUID=563893f3-c262-4032-84ac-be12fddff66b   /boot       ext2        defaults    0 0
    # /dev/sda3 /       ext4        noatime     0 0
    UUID=489dd7ad-a5e5-4727-8a9c-b11cca382038   /   ext4    noatime 0 0

Similarly, grub entries have root=UUID=[....], and devices are scanned to find the one that matches.

    echo    'Loading Linux 4.14.32-gentoo ...'
    linux   /vmlinuz-4.14.32-gentoo root=UUID=489dd7ad-a5e5-4727-8a9c-b11cca382038 ro init=/usr/lib/systemd/systemd
    echo    'Loading initial ramdisk ...'
    initrd  /initramfs-4.14.32-gentoo.img

The drives (both virtual and real) are in a ZFS raid-z2 configuration, as /dev/sdasde. A primary volume exists as zfs_data (/data), and a subvolume for /home, which is a cleaner version of the previous btrfs (and previous to that, mdadm/raid10) situation configuration where /data/home was rebind-mounted as /home.

As packages were installed, anything that wants to touch /var/{lib,log} has been symlinked to /data/system/var/{...} on the RAID array.

While I’d still like to have something approaching a read-only boot device, in practice, so far, I’ve found that having a read/write boot and OS drive is great, because I can tweak configurations and even install packages “in-situ”, and then just make a record of changes I need to apply to the virtual machine the next time I spin it up to make a batch of changes.

I have a file on the NAS storage (/data/system/ChangeLog) which I use to not only record the details of tweaks to the virtual machine, but also to record changes I’ve made to the real machine that need to be reapplied to the virtual machine. As well, I use this to record other general “TODOs” for later and notes/reminders about the process (like the sequence/details for installing a new kernel + zfs modules + dracut + grub, or a checklist of things to look for after package installs/upgrades (entries in /var/{lib,log}, systemd unit installs, &c.).

When I want to create an image of the boot device to copy onto a thumb drive, I do the following:

  1. VirtualBox clone the machine, selecting the “current state” option;

  2. Once cloned, remove the VirtualBox machine clone, electing to “retain files” (this prevents the next step from complaining about duplicate identifiers);

  3. Identify the 12GB disk .vdi file, run eg. VBoxManage clonemedium disk earth-2018-04-15-disk2.vdi earth-2018-04-15.img --format=raw;

  4. pv earth-2018-04-15.img > /dev/sdg (or whatever the usb drive is; one could probably do /dev/disk/by-label/{thumb-drive-device-label} or something, too.)

nas build, part 0: introduction

Sometime in – h*ck I don’t know, 2006? – I last built my personal computer.

At the time, it was a combination Serious Desktop Workhorse and File Server.

In time, I’ve replaced the former with a ${day_job}-provided machine, have relegated my personal computing to a dedicated VM isolated from those work concerns (plus devices), and the machine I had built has become a very unbalanced NAS server. It had far too much compute power (and thus baseline energy consumption) for the menial task it was left with. As well, the size of the RAID array was great at the time, but in an era of 4K video and TimeMachine backups, it’s limited.

So, I decided to replace it with a proper NAS device.

My goals were the following:

  • order-of-magnitude storage increase: the current box has 1.8TB; shooting for 18-24 T

  • a bit more redundancy: current box is RAID10, and had a regular issue with /dev/hdc initially that thankfully resolved after a few replacements. Shipping from newegg to VT is ~2d if a drive does fail, but I’d rather have 2 drive-fail capability, especially with that drive size / rebuild time.

  • serious power reduction: the previous box (old i7) drew 140W idle, and I was shooting for ~40W;

  • simplicity in OS management: the machine had been running linux-4.4.1 forever, and grub-0.99, because it’s too fragile to change and I’m a scaredy-pants

As such, the plan became:

  • 5-6 6±2 TB drives in a RAID5 or 6 or Z2 configuration

  • explicitly lower-power CPU, limited memory

  • (eventually-)read-only thumb-drive boot, OS managed via virtual machine


I want to take a moment as early as possible to recognize Brian Moses’ DIY NAS builds, which have been a significant guide for this build; by that I mean I stole his plans on the hardware side, with some minor tweaks.

I wound up with the following hardware/costs:

component description cost =2255
case SilverStone DS380B 150
psu Corsair SF450 85
motherboard ASRock Z270M-ITX 130
cpu Intel Core i5-7600T 255
memory Ballistix DDR4 2666 (2×8GB) 185
drives (a) ×3 WD Red 8TB NAS – WD80EFZX 725
drives (b) ×3 Seagate IronWolf 8TB NAS – ST8000VN0022 725

The SilverStone because of the hot-swappable drive bays.

The ASRock because of 6 SATA6 ports native.

The Core i5-7600T because of the balance of high benchmarking scores and modern features with a TDP of only 35W.


I decided to stick with Gentoo as the OS, because I love it and more importantly I’m comfortable with it.

In advance of building the new server (“earth”, to complement fire (the firewall), air (the wifi), and water (my personal VM)), I decided to at least upgrade the software side of my current server (phoenix) with a thumb-drive OS build.

So, this OS build is not only going to be the basis for the next server, but it is going to take over the current server’s OS.

I’ve leveraged UUID- and LABEL-based configuration in grub and /etc/fstab in order to have the OS image work in both the virtualized and real environments.

In particular:

In reality I have 4 1TB drives in a btrfs RAID10 configuration.

In the virtualized environment, I have 4 1GB “drives” in the same configuration.

Both mounted as “/data“, but in /etc/fstab, it’s:

    LABEL="DATA" /data btrfs defaults,noatime,compress=lzo 0 0
    /data/home /home none bind 0 0

So no matter which is booting, the same thing is mounted.

For the boot disks, it’s all:

    # /dev/sda2 /boot ext2 defaults 0 0
    UUID=563893f3-c262-4032-84ac-be12fddff66b /boot ext2 defaults 0 0
    # /dev/sda3 / ext4 noatime 0 0
    UUID=489dd7ad-a5e5-4727-8a9c-b11cca382038 / ext4 noatime 0 0

So that no matter where the image is booted (virtual, thumbdrive, whatever) the mounts work fine.


See part 1 for the next in the series.