How to replace broken hard drive on a bare metal server? #786

onnimonni · 2024-09-20T13:19:10Z

Hey,

I'm preparing to the case where one or more of my harddrives will eventually fail. To simulate this I put one of my machines into rescue mode and completely wiped partitions of one drive with: wipefs -a /dev/nvme1n1 and rebooted (nvme1n1 contained the /boot ESP partition in my case and nvme0n1 had the fallback boot).

It booted up nicely and now I'm wondering what is the recommended way to recreate the partitions to a new drive and let the drive to join to the existing zfs pool?

I tried to just by deploying the same disko config again and it fails because of the missing /boot partition:

$ nix run nixpkgs#nixos-rebuild -- switch --fast --flake .#myHost --target-host root@$MY_SERVER_IP --build-host root@$MY_SERVER_IP

...

updating GRUB 2 menu...
installing the GRUB 2 boot loader into /boot...
Installing for x86_64-efi platform.
/nix/store/dy8a03zyj7yyw6s3zqlas5qi0phydxf2-grub-2.12/sbin/grub-install: error: unknown filesystem.
/nix/store/a6y2v48wfbf8xzw6nhdzifda00g7ly7z-install-grub.pl: installation of GRUB EFI into /boot failed: No such file or directory
Failed to install bootloader
Shared connection to X.Y.Z.W closed.
warning: error(s) occurred while switching to the new configuration

The text was updated successfully, but these errors were encountered:

Mic92 · 2024-09-20T13:26:32Z

Disko can run incrementally, we don't recommend it for users that don't have good recovery options since we have not tested all edge cases. But if you are testing, you can check if it works for your configuration.
In this case you would run the disko cli with the --mode format option instead of --format disko.

onnimonni · 2024-09-20T13:35:28Z

Thanks for the suggestion 🙇. And yes this machine doesn't have anything important yet so loosing all of my data is okay.

I tried your suggestion by moving my flake and all .nix files into the server /root/ path and got this error:

[root@localhost:~]# ls -lah
total 19
-rw-r--r-- 1  501 lp   5639 Sep 20 09:19 disko-zfs.nix
-rw-r--r-- 1  501 lp    891 Sep 18 20:06 flake.nix
-rw-r--r-- 1  501 lp    910 Sep 19 13:46 myHost.nix
-rw-r--r-- 1  501 lp    198 Sep 20 09:03 programs.nix
-rw-r--r-- 1 root root    6 Sep 20 13:11 test.txt

[root@localhost:~]# nix --experimental-features "nix-command flakes" run github:nix-community/disko -- --mode format ./disko-fzs.nix
aborted: disko config must be an existing file or flake must be set

iFreilicht · 2024-09-20T20:06:50Z

@onnimonni There's a typo in your command. You wrote fzs.nix instead of zfs.nix.

onnimonni · 2024-09-21T09:09:55Z

Ah that's true and thanks for the help. I guess this doesn't work because I needed to use configurable list of drives

$ nix run github:nix-community/disko -- --mode format ./disko-zfs.nix
warning: Nix search path entry '/nix/var/nix/profiles/per-user/root/channels' does not exist, ignoring
error:
       … while evaluating the attribute 'value'

         at /nix/store/p2zlnhfbwx66hmp4l8m3qyyj3yrfr9zh-9qq0zf30wi74pz66rr05zmxq0nv17q1p-source/lib/modules.nix:821:9:

          820|     in warnDeprecation opt //
          821|       { value = addErrorContext "while evaluating the option `${showOption loc}':" value;
             |         ^
          822|         inherit (res.defsFinal') highestPrio;

       … while calling the 'addErrorContext' builtin

         at /nix/store/p2zlnhfbwx66hmp4l8m3qyyj3yrfr9zh-9qq0zf30wi74pz66rr05zmxq0nv17q1p-source/lib/modules.nix:821:17:

          820|     in warnDeprecation opt //
          821|       { value = addErrorContext "while evaluating the option `${showOption loc}':" value;
             |                 ^
          822|         inherit (res.defsFinal') highestPrio;

       (stack trace truncated; use '--show-trace' to show the full trace)

       error: function 'anonymous lambda' called without required argument 'disko'

       at /root/disko-zfs.nix:17:1:

           16| # Only small modifications were needed, TODO: check if this could be srvos module too
           17| { lib, config, disko, ... }:
             | ^
           18| {

But I got it working by directly using the flake instead of just the disko config. This was probably because I was using the config and options in the disko config file.

$ nix run github:nix-community/disko -- --mode format --flake .#myHost

After this the partitions were created properly but it didn't mount the nvme1n1 ESP boot partition to /boot or join the degraded mirrored zpool:

[root@localhost:~]# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1     259:0    0  3.5T  0 disk
├─nvme0n1p1 259:2    0    1M  0 part
├─nvme0n1p2 259:3    0    1G  0 part /boot-fallback-dev-nvme0n1
└─nvme0n1p3 259:4    0  3.5T  0 part
nvme1n1     259:1    0  3.5T  0 disk
├─nvme1n1p1 259:5    0    1M  0 part
├─nvme1n1p2 259:7    0    1G  0 part
└─nvme1n1p3 259:9    0  3.5T  0 part

[root@localhost:~]# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot  3.48T  2.08G  3.48T        -         -     0%     0%  1.00x  DEGRADED  -

[root@localhost:~]# zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:

        NAME                       STATE     READ WRITE CKSUM
        zroot                      DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            disk-_dev_nvme0n1-zfs  ONLINE       0     0     0
            12202984158813695731   UNAVAIL      0     0     0  was /dev/nvme1n1

I did then try to replace the old partition with new one but it failed:

$ zpool replace zroot 12202984158813695731 nvme1n1p3
invalid vdev specification
use '-f' to override the following errors:
/dev/nvme1n1p3 is part of active pool 'zroot'
$ zpool replace zroot 12202984158813695731 nvme1n1p3 -f
invalid vdev specification
the following errors must be manually repaired:
/dev/nvme1n1p3 is part of active pool 'zroot'

onnimonni · 2024-09-21T09:26:09Z

I did get the "new" disk back to zpool by running:

$ zpool detach zroot 12202984158813695731
$ zpool attach zroot disk-_dev_nvme0n1-zfs nvme1n1p3

And then rebooted the machine and also the /boot partition from the "new" disk was now mounted properly:

[root@localhost:~]# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme1n1     259:0    0  3.5T  0 disk
├─nvme1n1p1 259:1    0    1M  0 part
├─nvme1n1p2 259:2    0    1G  0 part /boot-fallback-dev-nvme0n1
└─nvme1n1p3 259:4    0  3.5T  0 part
nvme0n1     259:3    0  3.5T  0 disk
├─nvme0n1p1 259:5    0    1M  0 part
├─nvme0n1p2 259:6    0    1G  0 part /boot
└─nvme0n1p3 259:7    0  3.5T  0 part

[root@localhost:~]# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot  3.48T  2.08G  3.48T        -         -     0%     0%  1.00x    ONLINE  -

[root@localhost:~]# zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 2.32G in 00:00:02 with 0 errors on Sat Sep 21 09:12:54 2024
config:

	NAME                       STATE     READ WRITE CKSUM
	zroot                      ONLINE       0     0     0
	  mirror-0                 ONLINE       0     0     0
	    disk-_dev_nvme0n1-zfs  ONLINE       0     0     0
	    nvme0n1p3              ONLINE       0     0     0

errors: No known data errors

Happy to hear feedback about this approach but I'm glad to see this worked out 👍

I'm willing to summarize this guide and do a PR to create docs/replace-broken-disk.md guide if you feel the steps here were the "right" ones.

iFreilicht · 2024-09-21T11:45:01Z

Glad to hear it!

Hmm, I would say that ideally, disko should be able to do this automatically. Like I mentioned in #107, modelling a degraded pool and re-attaching devices when disko runs.

Feel free to write a guide on this, I'll be happy to review it! Make sure that while writing, every step is extremely clear, and make sure to show a full configuration that allows readers to follow the exact steps. Ideally, go through all the steps again on your test machine and document that while you're doing it to make sure the guide actually works.

jan-leila · 2024-09-23T23:12:03Z

For if someone else wants to give a stab writing the documentation here the repo being used for your test configuration (and the step in its history) is this https://github.com/onnimonni/hetzner-auction-nixos-example/tree/45aaf7100167f08f417224fd6a1b1dac74795fb9 right @onnimonni?

jan-leila · 2024-09-24T00:21:49Z

I gave a stab at drafting up some docs on this but haven't been able to test because I don't have any unused hardware laying around to do so. Feel free to take as much or as little inspiration from them as you would like.

Mic92 · 2024-09-24T04:23:54Z

We usually simulate these steps with qemu's nvme emulation: https://qemu-project.gitlab.io/qemu/system/devices/nvme.html
You can just create use truncate to create multiple filesystem images for testing.

Mic92 · 2024-09-24T04:24:38Z

This is a script I had flying around:

#!/usr/bin/env nix-shell
#!nix-shell -i bash -p bash -p qemu_kvm -p iproute2

set -x -eu -o pipefail

VM_IMAGE=""
CPUS="${CPUS:-$(nproc)}"
MEMORY="${MEMORY:-4096}"
SSH_PORT="${SSH_PORT:-2222}"
IMAGE_SIZE="${IMAGE_SIZE:-10G}"

extra_flags=()
if [[ -n ${OVMF-} ]]; then
  extra_flags+=("-bios" "$OVMF")
fi

# https://hydra.nixos.org/job/nixos/unstable-small/nixos.iso_minimal.x86_64-linux
iso=/nix/store/xgkfnwhi3c2lcpsvlpcw3dygwgifinbq-nixos-minimal-23.05pre483386.f212785e1ed-x86_64-linux.iso
nix-store -r "$iso"

for arg in "${@}"; do
  case "$arg" in
  prepare)
    truncate -s"$IMAGE_SIZE" nixos-nvme1.img nixos-nvme2.img
    ;;
  start)
    qemu-system-x86_64 -m "${MEMORY}" \
      -boot n \
      -smp "${CPUS}" \
      -enable-kvm \
      -cpu max \
      -netdev "user,id=mynet0,hostfwd=tcp::${SSH_PORT}-:22" \
      -device virtio-net-pci,netdev=mynet0 \
      -drive file=nixos-nvme2.img,if=none,id=nvme1,format=raw \
      -device nvme,serial=deadbeef1,drive=nvme1 \
      -drive file=nixos-nvme1.img,if=none,id=nvme2,format=raw \
      -device nvme,serial=deadbeef2,drive=nvme2 \
      -cdrom "$iso"/iso/*.iso \
      "${extra_flags[@]}"
    # after start, go to the console and run:
    # passwd
    # than you can ssh into the machine:
    # ssh -p 2222 nixos@localhost
    ;;
  destroy)
    rm -f "$VM_IMAGE"
    ;;
  *)
    echo "USAGE: $0 (prepare|start|destroy)"
    ;;
  esac
done

jan-leila · 2024-09-24T04:31:15Z

oh that would be very useful for testing out the steps I have written, I'll give it a stab sometimes later this week (likely on Friday or Saturday when I'm stuck on a plane/layover)

onnimonni · 2025-01-13T18:00:22Z

@jan-leila thanks for writing the disko disk replacement docs. I tried to follow them and I have only Apple Silicon based laptop available but my server is x86-64.

I think I have working x86-64 linux builder available locally but when I ran the disko format command from my own machine it fails like this:

$ nix run github:nix-community/disko -- --mode format --flake .#myHost root@my-machine
error: flake 'github:nix-community/disko' does not provide attribute 'apps.aarch64-darwin.default', 'defaultApp.aarch64-darwin', 'packages.aarch64-darwin.default' or 'defaultPackage.aarch64-darwin'

jan-leila · 2025-01-13T18:04:58Z

looks like the tool only supports building on:

      supportedSystems = [
        "x86_64-linux"
        "i686-linux"
        "aarch64-linux"
        "riscv64-linux"
      ];

assuming the remote builds thing you are trying is a feature that exists (I have never used it myself and don't want to try and speak authoritatively on it) there should probably be a build (or maybe this tool needs to be split into two? one for provisioning and one for calling the provisioner and providing it config) for macs

maybe open a separate issue about this?

Enzime · 2025-01-13T20:24:59Z

I think you’ll want to use nixos-anywhere if you want to run disko from another machine

onnimonni · 2025-01-13T20:27:34Z

True. I didn't realise this wasn't available when running remotely so I just copied the flake into the remote machine like I did last time.

$ rsync --exclude='/.git' -rvz . remote-machine:~/nixos-setup
$ ssh remote-machine
$ cd ~/nixos-setup
$ nix run github:nix-community/disko -- --mode format --flake .#myHost

I then bumped into following version mismatch error:

#!/nix/store/5mh7kaj2fyv8mk4sfq1brwxgc02884wi-bash-5.2p37/bin/bash
echo 'Error: Attribute `nixosConfigurations.rusty.config.system.build.format` >&2
echo '       not found in flake `/home/onnimonni/nixos-setup`!' >&2
echo '       This is probably caused by the locked version of disko in the flake' >&2
echo '       being different from the version of disko you executed.' >&2
echo 'EITHER set the `disko` input of your flake to `github:nix-community/disko/latest`,' >&2
echo '       run `nix flake update disko` in the flake directory and then try again,' >&2
echo 'OR run `nix run github:nix-community/disko/v1.9.0 -- --help` and use one of its modes.' >&2
exit 1;

And then updated the disko in the remote server

$ nix flake update disko

After this I was succesfully able to run:

$ nix run github:nix-community/disko -- --mode format --flake .#myHost

It returned lot's of errors of things already existing but for the disk which I had wiped with $ wipefs -a /dev/sdj it was able to create the partitions properly.

Then I just needed to run:

$ sudo zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
config:

	NAME                                      STATE     READ WRITE CKSUM
	zroot                                     DEGRADED     0     0     0
	  raidz3-0                                DEGRADED     0     0     0
	    a5f884520cc0c629f9aabf3113b4195e1357  ONLINE       0     0     0
	    c15b8cf80fc7eb15f0852374eb0861f0614f  ONLINE       0     0     0
	    cb5ebf126e0235562c3df2f141049af9498d  ONLINE       0     0     0
	    506206511add20f75a71a6caf9639594f0a2  ONLINE       0     0     0
	    a34ab817beac9e6e6ee7ebac999bfa28cefd  ONLINE       0     0     0
	    9cdcde35541d317a52ede79cb582d7870480  ONLINE       0     0     0
	    7047569657311774709                   OFFLINE      0     0     0  was /dev/disk/by-partlabel/a4713f93250dc8afeef0293a7f55f234bdf2
	    73b8b509c9f5b9224758f21f71aca7f5cd71  ONLINE       0     0     0
	    648210092f8a256645c8b0b1f3d3ba72702c  ONLINE       0     0     0
	    878cd9e8f68a0ba330d57be8bf24557e4002  ONLINE       0     0     0
	    f32cbe44a97e6e29a0b7130bd4a2ae4625ae  ONLINE       0     0     0
	    fd9aca0fad450217f88d0494f5dcbbe2635e  ONLINE       0     0     0
	    b989a16ceadcc4e31fe6fc08204c0a64a8b0  ONLINE       0     0     0
	    e90e1bffb1cc0107ee872d9dd9aa25ea77e4  ONLINE       0     0     0
	    8e1b746840a6239c9114018c8d68d1bfc292  ONLINE       0     0     0
$ sudo zpool replace zroot 7047569657311774709 /dev/disk/by-partlabel/a4713f93250dc8afeef0293a7f55f234bdf2
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-partlabel/a4713f93250dc8afeef0293a7f55f234bdf2 is part of active pool 'zroot'
$ sudo zpool online zroot /dev/disk/by-partlabel/a4713f93250dc8afeef0293a7f55f234bdf2

And it started to resilver the missing partition.

If it would be a new disk I of course would have changed the disks in my disko config and the replace command above would have worked directly without running the online command last.

Thanks a lot for the guide 👍

onnimonni · 2025-01-13T20:30:37Z

I think you’ll want to use nixos-anywhere if you want to run disko from another machine

Can you show me how the command to just do disko format remotely?

When I ran with nixos-anywhere it just kexecs my perfectly working system into the nixos-installer and boots it.

I would only want the new disk to be formatted and to join the zpool I have running.

Enzime · 2025-01-13T23:56:41Z

Actually I was mistaken and nixos-anywhere probably isn't the best fit for what you want

iFreilicht added the question Not a bug or issue, but a question asking for help or information label Sep 20, 2024

onnimonni closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to replace broken hard drive on a bare metal server? #786

How to replace broken hard drive on a bare metal server? #786

onnimonni commented Sep 20, 2024 •

edited

Loading

Mic92 commented Sep 20, 2024 •

edited

Loading

onnimonni commented Sep 20, 2024 •

edited

Loading

iFreilicht commented Sep 20, 2024

onnimonni commented Sep 21, 2024

onnimonni commented Sep 21, 2024

iFreilicht commented Sep 21, 2024 •

edited

Loading

jan-leila commented Sep 23, 2024

jan-leila commented Sep 24, 2024

Mic92 commented Sep 24, 2024

Mic92 commented Sep 24, 2024 •

edited

Loading

jan-leila commented Sep 24, 2024

onnimonni commented Jan 13, 2025

jan-leila commented Jan 13, 2025

Enzime commented Jan 13, 2025

onnimonni commented Jan 13, 2025 •

edited

Loading

onnimonni commented Jan 13, 2025

Enzime commented Jan 13, 2025

How to replace broken hard drive on a bare metal server? #786

How to replace broken hard drive on a bare metal server? #786

Comments

onnimonni commented Sep 20, 2024 • edited Loading

Mic92 commented Sep 20, 2024 • edited Loading

onnimonni commented Sep 20, 2024 • edited Loading

iFreilicht commented Sep 20, 2024

onnimonni commented Sep 21, 2024

onnimonni commented Sep 21, 2024

iFreilicht commented Sep 21, 2024 • edited Loading

jan-leila commented Sep 23, 2024

jan-leila commented Sep 24, 2024

Mic92 commented Sep 24, 2024

Mic92 commented Sep 24, 2024 • edited Loading

jan-leila commented Sep 24, 2024

onnimonni commented Jan 13, 2025

jan-leila commented Jan 13, 2025

Enzime commented Jan 13, 2025

onnimonni commented Jan 13, 2025 • edited Loading

onnimonni commented Jan 13, 2025

Enzime commented Jan 13, 2025

onnimonni commented Sep 20, 2024 •

edited

Loading

Mic92 commented Sep 20, 2024 •

edited

Loading

onnimonni commented Sep 20, 2024 •

edited

Loading

iFreilicht commented Sep 21, 2024 •

edited

Loading

Mic92 commented Sep 24, 2024 •

edited

Loading

onnimonni commented Jan 13, 2025 •

edited

Loading