Building a Long Term Data Storage Solution

Documenting my journey in devising a reliable solution for long-term data storage

  ·   11 min read

Introduction

Like most people, all of my data and backups in the past were stored on a single hard disk. Every time I ran out of space on the hard disk, I just went out and bought a bigger hard disk drive and copied everything over from the old disk to the new one. There was no fault tolerance or reliability involved in this process.

I am about to run out of space on my latest 2 TB disk and instead of buying a bigger hard disk, I’ve created a long term data storage solution that’s reliable, expandable, fault tolerant, encrypted and portable. This article documents the long term data storage system I’ve devised.

Architecture

FileSystem : ZFS

ZFS is both a filesystem and a volume manager which already is a good enough reason for me to use ZFS. This is just the tip of the iceberg; ZFS has a lot more to offer. I will publish my ZFS notes separately, but for now I’ll just mention some features that sweetened the deal for me.

Copy-on-Write Transactional Object Model

In ZFS, the blocks are arranged in a Merkle Tree structure. Block pointers contain checksums of the blocks they point to, which are verified during read. When a block is to be modified, ZFS does not modify the block in-place. Instead, the modified data is written to a newly allocated block and pointers are updated. A ZIL (ZFS Intent Log) is used to emulate sync writes.

Data Integrity Preservation & Fault Tolerance

ZFS has a few different ways to protect against data degradation. I will discuss some of the ways in which ZFS implements it when I talk about the storage system design.

Pooled Storage

ZFS creates storage pools (zpool) that is made up of virtual devices (vdev). Each virtual device contains multiple storage providers (HDDs) in one of the different possible redundancy configurations listed here.

  • Mirror : Blocks are written to all storage providers in the vdev (similar to RAID 1)
  • RAIDZ1 : dynamic striping with single parity (similar to RAID 5)
  • RAIDZ2 : dynamic striping with double parity (similar to RAID 6)
  • RAIDZ3 : dynamic striping with triple parity (no RAID equivalent)

You can create datasets (equivalent to filesystem) from the datapool. In order to increase the space available in a datapool, you just add more identical vdevs to the data pools. There are a few caveats in how vdevs and zpools can be configured with storage providers, but I will talk about them in my ZFS notes.

Encryption & Compression

ZFS offers native encryption & compression options. I don’t have to rely on separate tools like LUKS or VeraCrypt.

Hardware

Before I decided to use ZFS as the primary filesystem & volume manager, I was planning on using a hardware RAID setup, but since ZFS was the filesystem I ended up choosing, the hardware I chose is the hardware that works best with ZFS. It is not recommended to pair hardware RAID with ZFS. If the storage solution is being built on a home server or PC, it is recommended to use a HBA controller. My home server is a Mini PC and an HBA controller is not a viable option for me. Here is a list of my hardware choices.

Host System

The system that facilitates interaction with the storage system is a Mini PC I use as my home server. The Home server runs NixOS. NixOS is an excellent choice to run on servers. It lets me experiment without the fear of breaking my system.

The Storage system is connected to this Mini PC. The NixOS configuration.nix file contains the configuration to auto-import the ZFS pool and auto-mount the ZFS datasets at boot.

Configuration

there are a few ways I could’ve set up these HDDs in ZFS. Since reliability is the highest priority for me, I decided to go with the following setup.

Physical Disk Map

  • ZPOOL (ZPOOL0): contains a single VDEV with 8 terabyte capacity
    • VDEV (VDEV0): contains 2x 8TB physical disks in a mirror configuration
      • PHY (DISK0): Seagate Barracuda 8 TB 3.5" HDD
      • PHY (DISK1): Seagate Barracuda 8 TB 3.5" HDD

Here is the reasoning behind the choice.

  1. 2 disks in a mirror configuration under a single vdev provides ample redundancy.
  2. The zpool (ZPOOL0) can be expanded in the future by adding more vdevs that are identical to VDEV0.
    • This also improves the performance as ZFS distributes writes across the 2 vdevs in ZPOOL0.

ZPOOL0 was created with the following options

ZPOOL0 / VDEV0

zpool create \
    -O encryption=on \                  # Enable Encryption
    -O keyformat=passphrase \           # Encryption Key is a Passphrase
    -O keylocation=prompt \             # Encryption key to be received via CLI prompt
    -O compression=on \                 # Enable compression (LZ4)
    -O mountpoint=none \                # Disable Automount
    -O xattr=sa \                       # use system attributes to store extended attributes
    -O acltype=posixacl \               # use POSIX ACLs for permissions
    -o ashift=12 \                      # sector size = 2^12 bytes (4096 bytes), optimal for modern disks and SSDs.
    -O atime=off \
    zpool0 \                            # pool name
        mirror \                        # configure the disks below in a mirror vdev
        /dev/sda1 \                     # disk0
        /dev/sda2                       # disk1

[!INFO] the default recordsize of 128KiB is chosen here.

I then checked the status of the zpool using

zpool status

ZFS DATASETS

Here’s a diagram of the datasets on zpool0

Physical Disk Map

The Tags are self explanatory. Datasets were created with the following options.

zfs create -o mountpoint=legacy zpool0/media
zfs create -o mountpoint=legacy zpool0/media/videos
  • I disabled compression for the media dataset & set the recordsize to 1M to improve performance.

  • mountpoint=legacy means that ZFS will not automatically mount the dataset for us. We can continue using traditional methods like /etc/fstab and mount to mount the datasets to paths. This is important because, I wanted to make use of NixOS hardware-configuration.nix file to auto-import and auto-mount our datasets like regular partitions.

[!NOTE] Since I am encrypting the root dataset when the pool is created, I will not be encrypting the child datasets individually. But it is possible to encrypt datasets individually as follows.

zfs create -o mountpoint=legacy -o encryption=on -o keyformat=passphrase -o keylocation=prompt  zpool0/media
  • encryption=on enables encryption on the dataset and all the child datasets that live under this parent dataset. The default encryption algorithm since OpenZFS 0.8.4 is aes-256-gcm and its good enough for my usecase.

  • keyformat=passphrase signifies that a passphrase should be used to unlock the master key with which your data is encrypted.

  • keylocation=prompt lets ZFS know that the method of receiving the key would be through a prompt on the terminal

This configuration requires a user to enter the passphrase manually at each boot. This might be inconvenient for some users but I prefer this over automatic decryption.

The following code in my hardware-configuration.nix imports the pool automatically and auto-mount the datasets. I enter the passphrase at each boot.

  boot.supportedFilesystems = [ "zfs" ];
  boot.zfs.forceImportRoot = false;
  boot.zfs.extraPools = [ "zpool0" ];

  fileSystems."/mnt/data/media" =
    { device = "datapool/media";
      fsType = "zfs";
    };

The alternative to automatically have NixOS load the datasets is to manually import the zpool, load the keys and manually mount the datasets. I do not use the manual method, but I am documenting it here just in case I need to load the the ZFS storage on a Non NixOS machine in the future.

  1. Check the available zpools for import using
sudo zpool import
  1. Once you’ve identified the pool to import, use the following command to import
sudo zpool import <poolname>
  1. Now to decrypt the pool use the following command and you’ll be prompted to enter the passphrase
sudo zfs load-key <poolname>
  1. Manually mount the datasets using
sudo mount -t zfs <poolname>/media /mnt/data/media

Maintenance

There’s a couple more things to take care of with this ZFS setup. Snapshot Backups & Scrubbing.

Scrubbing

Scrubbing is a ZFS feature/tool to check the filesystem for errors and try to heal them. The recommended frequency for scrubbing ZFS running on consumer grade storage devices is weekly. There is no need to take the pool offline for scrubbing. Since scrubbing uses checksums of encrypted blocks, no decryption of datasets is required for scrubbing as well. Scrubbing has a lower priority over other operations, so any performance impact on reads and writes should be minimal when scrubbing is in progress. However, I still want the scrubbing to happen automatically at times when the read/writes to the pool is minimal. Using NixOS as the host, I have the advantage being able to define systemd units and timers within my configuration.nix file. Here’s what the code to setup automatic scrubbing using systemd & nix looks like.

  systemd =
    {
      services = {
        scrub-zpool0 = {
          Unit = {
            Description = "Scrub zpool0";
            Documentation = "info:zpool-scrub man:zpool-scrub(8) https://openzfs.org/wiki";
          };
          Service = {
            Type = "simple";
            ExecStart = "${pkgs.zfs}/bin/zpool scrub zpool0";
          };
        };
      };

      timers = {
        scrub-zpool0-timer = {
          Unit = {
            Description = "Timer for zpool0 scrubbing (every Wednesday at 3 AM)";
          };
          Timer = {
            OnCalendar="Wed *-*-* 03:00:00";
            Unit = "scrub-zpool0.service";
          };
          Install = {
            WantedBy = [ "timers.target" ];
          };
        };
      };
    };

Snapshot backups

Snapshot is a set of changes made to a dataset from a particular point in time. So a snapshot of a 1TB dataset where 100 MB of blocks have been changed would only have a size of 100 MB. It is possible to clone a snapshot to make it read-write, but snapshots themselves are read-only. At the time of writing this article, I do not have a system to save snapshots periodically. But I plan to have something setup soon. I will update this article to reflect the snapshot backup setup once its ready.

Network Component

Now that the storage space has been made available to my home server, I use a few different ways to store and retrieve data from the storage.

SSHFS

SSHFS allows me to mount certain datasets on my server via SSH/SFTP to my client devices. This is going to be my primary mode of interaction with the various datasets. clients would rely on SSHFS for all datasets except MEDIA & APPDATA datasets.

Media Services

The MEDIA dataset is a special kind of dataset. The files in this dataset are served by applications hosted on the home server via web interfaces. This is to ensure that the media files are accessible to more than just linux/unix clients. The applications serving the media files also take care of supporting different filetypes and codecs across devices. Here is a list of applications that serve media.

  • Photoprism : Photos
  • ArchiveBox : Archive Webpages
  • Kavita : E-Books & E-Magazines
  • FreshRSS : RSS Feeds
  • IceCast : Route Realtime Audio
  • Navidrome : Music
  • Jellyfin : Videos / IPTV
  • Baikal : Caldev

The Future

Expansion

Currently, I need just over 2 TB to store all my data. In the future, when I inevitably run out of space on the 8TB vdev in my zpool, I plan to add another identical vdev, as I have two more empty slots in my HDD enclosure. The additional storage space from the new vdev will be seamlessly integrated into the zpool, making it available to all datasets. ZFS will distribute the data evenly across both vdevs, ensuring they fill up simultaneously. However, this is a concern for the future and not something I need to worry about right now.

Updates

RAM Expansion

I was able to upgrade the RAM on the machine from 16GB to 64GB. People say ZFS will use as much memory as it can get. While the memory consumption idled at 14GB when I had 16GB, as soon as I upgraded to 64GB, the memory consumption currently on average is 38 GB.

Kernel Parameter Adjustments

Due to the fact that I am running really slow hard drives, sometimes heavy writes can cause an increased memory latency. This even caused my RAM usage to hit a 100% and crash the system with errors that look like:

INFO: task txg\_sync:4208 blocked for more than 60 seconds.

I played around with a few zfs kernel parameters to make the system more reliable with ZFS. Here are some of the parameters I tweaked.

  1. zfs.zfs_arc_max - This parameter has nothing to do with the write issues. The ARC is specifically meant for caching reads. But I limited the size of ARC on memory so that I have more free memory available for other things.

  2. zfs.zfs_dirty_data_max - This is the maximum size an open txg (ZFS Transaction group) is allowed to grow to in memory before requiring to be synced to disk

  3. zfs.zfs_dirty_data_sync - When there is at least this amount of dirty data, a transaction group sync is started.

  4. hung_task_timeout_secs - This article explains this parameter.

There are more parameters to explore and experiment with described on this page that are definitely worth exploring.

SSD Addition

Over time, I realised that the docker containers reading and writing from the APPDATA dataset were massively i/o throttled and were causing some stability issues. Since, I have removed this dataset and moved APPDATA to a separate data pool backed by much faster SSDs.