My entire volume is suddenly empty! #865

relink2013 · 2024-12-13T01:41:19Z

Operating system

Linux 6.6.62-Unraid

Description

vDSM was working just fine yesterday. Today I notced the Drive sync client showed "Abnormal", and that the user home serivce is disabled. I login to DSM and everything is gone, completely.

The only thing that still seems to be intact is the user accounts and basic system settings.

And the data truly does seem to be gone, the host directory that has the disk image in only 3.5MB of used space.

Thankfully some logs still remain and there are lines for every shared folder I had that read

Metadata of shared folder [NAME] was deleted.

I Also see that the volume was recreated, and by the looks of it more than once considering it's now at volume5...

Docker compose

docker run
  -d
  --name='virtual-dsm'
  --net='eth0'
  --ip='192.168.1.12'
  --cpuset-cpus='0,1,2,3,4,5,6,7,8,9,10,11'
  --pids-limit 4096
  -e TZ="America/New_York"
  -e HOST_OS="Unraid"
  -e HOST_HOSTNAME="HOST"
  -e HOST_CONTAINERNAME="virtual-dsm"
  -e 'TCP_PORT_5000'='5000'
  -e 'TCP_PORT_5001'='5001'
  -e 'DISK_SIZE'='1000G'
  -e 'RAM_SIZE'='16G'
  -e 'CPU_CORES'='8'
  -e 'DHCP'='Y'
  -e 'DISK_FMT'=''
  -e 'TCP_PORT_6522'='6522'
  -e 'DISK2_SIZE'='34000G'
  -e 'MAC'='XX-XX-XX-XX-XX-XX'
  -e 'GUEST_SERIAL'='XXXXXXXXXXXXX'
  -e 'HOST_SERIAL'='XXXXXXXXXXXXX'
  -e 'HOST_MAC'='XX-XX-XX-XX-XX-XX'
  -e 'HOST_MODEL'='DS918+'
  -e 'ALLOCATE'='N'
  -l net.unraid.docker.managed=dockerman
  -l net.unraid.docker.webui='http://[IP]:[PORT:5000]'
  -l net.unraid.docker.icon='http://192.168.1.10:8384/V-DSM%20new.png'
  -v '/mnt/system-nvme/DSM_OS/':'/storage':'rw'
  -v '/mnt/skydrift/DSM_Data/':'/storage2':'rw'
  --device='/dev/kvm'
  --device='/dev/vhost-net'
  --stop-timeout 60
  --cap-add NET_ADMIN
  --device-cgroup-rule='c *:* rwm' 'vdsm/virtual-dsm:latest' ; docker network connect npm virtual-dsm
f912422a5460be6b8b8985ac7c6e2b3a8bd1212be83143c65f85c5cc73405ece

Docker log

[ 43.945859] CIFS VFS: cifs_mount failed w/return code = -115
[ 44.775463] audit_printk_skb: 93 callbacks suppressed
[ 44.776426] audit: type=1325 audit(1734052320.304:43): table=filter family=2 entries=16
[ 44.782401] audit: type=1325 audit(1734052320.311:44): table=filter family=2 entries=16
[ 44.788152] audit: type=1325 audit(1734052320.317:45): table=filter family=2 entries=16
[ 44.794162] audit: type=1325 audit(1734052320.323:46): table=filter family=10 entries=30
[ 44.799452] audit: type=1325 audit(1734052320.328:47): table=filter family=10 entries=30
[ 44.804646] audit: type=1325 audit(1734052320.333:48): table=filter family=10 entries=30
[ 44.810701] audit: type=1325 audit(1734052320.340:49): table=filter family=2 entries=20
[ 44.816198] audit: type=1325 audit(1734052320.345:50): table=filter family=10 entries=34
[ 44.822492] audit: type=1325 audit(1734052320.351:51): table=filter family=2 entries=14
[ 44.828753] audit: type=1325 audit(1734052320.358:52): table=filter family=10 entries=21
[ 44.859445] Synotify use 16384 event queue size
[ 44.861302] Synotify use 16384 event queue size
[ 44.879495] Synotify use 16384 event queue size
[ 44.880518] Synotify use 16384 event queue size
[ 45.056424] capability: warning: `nginx' uses 32-bit capabilities (legacy support in use)
[ 45.697015] iSCSI:target_core_rodsp_server.c:1025:rodsp_server_init RODSP server started, login_key(cf91b9f63f64).
[ 45.704875] syno_extent_pool: module license 'Proprietary' taints kernel.
[ 45.706127] Disabling lock debugging due to kernel taint
[ 45.707644] iSCSI:extent_pool.c:766:ep_init syno_extent_pool successfully initialized
[ 45.718232] iSCSI:target_core_device.c:612:se_dev_align_max_sectors Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 45.720034] iSCSI:target_core_configfs.c:5763:target_init_dbroot db_root: cannot open: /etc/target
[ 45.721407] iSCSI:target_core_lunbackup.c:366:init_io_buffer_head 2048 buffers allocated, total 8388608 bytes successfully
[ 45.740372] iSCSI:target_core_file.c:152:fd_attach_hba RODSP plugin for fileio is enabled.
[ 45.743471] iSCSI:target_core_file.c:159:fd_attach_hba ODX Token Manager is enabled.
[ 45.745277] iSCSI:target_core_multi_file.c:91:fd_attach_hba RODSP plugin for multifile is enabled.
[ 45.746716] iSCSI:target_core_ep.c:795:ep_attach_hba RODSP plugin for epio is enabled.
[ 45.747978] iSCSI:target_core_ep.c:802:ep_attach_hba ODX Token Manager is enabled.
[ 45.815472] workqueue: max_active 1024 requested for vhost_scsi is out of range, clamping between 1 and 512
[ 47.205719] Synotify use 16384 event queue size
[ 47.207215] Synotify use 16384 event queue size
[ 47.288390] fuse init (API version 7.23)
[ 47.535752] findhostd uses obsolete (PF_INET,SOCK_PACKET)
[ 48.609940] Synotify use 16384 event queue size
[ 48.672535] Synotify use 16384 event queue size
[ 51.482155] audit_printk_skb: 348 callbacks suppressed
[ 51.483004] audit: type=1325 audit(1734052327.011:169): table=filter family=2 entries=16
[ 51.488334] audit: type=1325 audit(1734052327.017:170): table=filter family=2 entries=16
[ 51.493485] audit: type=1325 audit(1734052327.022:171): table=filter family=2 entries=16
[ 51.499499] audit: type=1325 audit(1734052327.028:172): table=filter family=10 entries=30
[ 51.505118] audit: type=1325 audit(1734052327.034:173): table=filter family=10 entries=30
[ 51.511118] audit: type=1325 audit(1734052327.040:174): table=filter family=10 entries=30
[ 51.516903] audit: type=1325 audit(1734052327.046:175): table=filter family=2 entries=20
[ 51.522982] audit: type=1325 audit(1734052327.052:176): table=filter family=10 entries=34
[ 51.529511] audit: type=1325 audit(1734052327.058:177): table=filter family=2 entries=14
[ 51.536174] audit: type=1325 audit(1734052327.065:178): table=filter family=10 entries=21
[ 54.916830] CIFS VFS: Error connecting to socket. Aborting operation.
[ 54.919939] CIFS VFS: cifs_mount failed w/return code = -115
[ 57.815586] audit_printk_skb: 96 callbacks suppressed
[ 57.816464] audit: type=1325 audit(1734052333.344:211): table=filter family=2 entries=16
[ 57.821311] audit: type=1325 audit(1734052333.350:212): table=filter family=2 entries=16
[ 57.827453] audit: type=1325 audit(1734052333.356:213): table=filter family=2 entries=16
[ 57.833333] audit: type=1325 audit(1734052333.362:214): table=filter family=10 entries=30
[ 57.838263] audit: type=1325 audit(1734052333.367:215): table=filter family=10 entries=30
[ 57.843411] audit: type=1325 audit(1734052333.372:216): table=filter family=10 entries=30
[ 57.849773] audit: type=1325 audit(1734052333.378:217): table=filter family=2 entries=20
[ 57.857122] audit: type=1325 audit(1734052333.386:218): table=filter family=10 entries=34
[ 57.862716] audit: type=1325 audit(1734052333.391:219): table=filter family=2 entries=14
[ 57.869136] audit: type=1325 audit(1734052333.398:220): table=filter family=10 entries=21
[ 58.085283] Synotify use 16384 event queue size
[ 58.321199] Synotify use 16384 event queue size
[ 59.562188] Synotify use 16384 event queue size
[ 59.563099] Synotify use 16384 event queue size
[ 64.774564] audit_printk_skb: 348 callbacks suppressed
[ 64.775588] audit: type=1325 audit(1734052340.304:337): table=filter family=2 entries=16
[ 64.781012] audit: type=1325 audit(1734052340.311:338): table=filter family=2 entries=16
[ 64.789080] audit: type=1325 audit(1734052340.319:339): table=filter family=2 entries=16
[ 64.795727] audit: type=1325 audit(1734052340.325:340): table=filter family=10 entries=30
[ 64.802583] audit: type=1325 audit(1734052340.332:341): table=filter family=10 entries=30
[ 64.808553] audit: type=1325 audit(1734052340.338:342): table=filter family=10 entries=30
[ 64.815033] audit: type=1325 audit(1734052340.345:343): table=filter family=2 entries=20
[ 64.821805] audit: type=1325 audit(1734052340.351:344): table=filter family=10 entries=34
[ 64.828056] audit: type=1325 audit(1734052340.358:345): table=filter family=2 entries=14
[ 64.834272] audit: type=1325 audit(1734052340.364:346): table=filter family=10 entries=21
[ 64.922999] CIFS VFS: Error connecting to socket. Aborting operation.
[ 64.924894] CIFS VFS: cifs_mount failed w/return code = -115
[ 74.982294] CIFS VFS: Error connecting to socket. Aborting operation.
[ 74.984660] CIFS VFS: cifs_mount failed w/return code = -115
[ 84.988497] CIFS VFS: Error connecting to socket. Aborting operation.
[ 84.991418] CIFS VFS: cifs_mount failed w/return code = -115
[ 768.936086] CIFS VFS: Error connecting to socket. Aborting operation.
[ 768.938607] CIFS VFS: cifs_mount failed w/return code = -115
[ 778.943283] CIFS VFS: Error connecting to socket. Aborting operation.
[ 778.946398] CIFS VFS: cifs_mount failed w/return code = -115

Screenshots (optional)

No response

The text was updated successfully, but these errors were encountered:

relink2013 · 2024-12-14T18:48:26Z

I'm starting to investigate if the issue could be due to the way I had my Volumes configured.

Volume1

Resided on a ZFS formatted NVME drive.
Stored DSM, all apps, DBs, Caches, settings, etc...
Stored all "appdata" for docker containers

Volume2

Resided on a large ZFS mirror Pool.
Stored all data, all shared folders and user home folders.

Initial Setup

The way I originally accomplished this setup was pretty simple. I just installed vDSM docker initially with one vdisk on the NVME drive (this became volume1). Once everything was setup I shut it down and added another much larger vdisk on my ZFS pool (this became volume2). I then created all my shared folders on volume2

Way back before I stored anything other than random test data, I rebuilt my ZFS pool which wiped out volume2, no big deal at the time. When I did that I recall DSM switched everything back to volume1. By "everything" I mean all the shared folders, user home folders, etc. So the data was gone, and everything defaulted back to volume1.

However when I recreated what should have been volume2 I noticed it appeared automatically as volume3. But I changed it back using the info from #763.

Experimenting & Lots of Crashing

During probably the first month or so of messing with vDSM I was also experimenting with different ZFS configurations on the host system at the same time. So I was loosing data every other day, but this was perfectly ok. I have backups of everything and I was intentionally experimenting.

This was also casuing all kinds of havok with vDSM, which again was fine by me. I wanted to try and poke as many holes in my setup as I could.

There were several times the ZFS array completely crashed on the host while vDSM was still running. This would create a really interesting situation for vDSM, as it still thought volume2 was present even though it definately was not. Ofcourse after restarting the container it realized volume2 was gone.

Present Day

The "experimenting" phase is over, and my host system has been rock solid for months now, as has vDSM.

Quick notes on ZFS

First: Do not mix enterprise drives and consumer drives. Pick one and stick with it. They have vastly different controllers. The consumer drives cannot communicate fast enough with the enterprise drives which will cause ZFS to randomly report the consumer drive as "offline" even though the disk is perfectly fine.

Second: If your using vDSM and your storing your vdisk on a ZFS volume ensure that you put the vdisk in it's own dataset and configure the dataset to use either no, or minimal compression, disable atime, and set the block size to match what DSM uses (4k if I recall correctly). This will make a major performance difference.

Anyway, back to what happened...I have absolutely no idea which is why Im here. Everything including vDSM has been absolutely rock solid for a few months now and then all the sudden POOF, everything is gone. I had literally just used Synology Drive a few hours prior to this happening.

Current Hypothesis

There are a select few Synology Rack Stations that are designed for DSM to be installed on a seperate NVME drive and then all data is stored on a seperate data volume. However these models are designed to run this way, vDSM by itself is likely not.

I had vDSM as one of the first containers that started on boot, and Im thinking there was a slight delay between the NVME drive and the ZFS pool becoming avaliable after a reboot. This is because I noticed in the DSM logs before they disappeard too, that volume2 was all the way upto volume5...this just happens to coencide with roughly the number of times I have rebooted the host since my final setup of vDSM.

Current Plan to test this:

Basic Test: using Unraid it is incredily easy to just simply add a delay to the start up of the docker container.

Second Level: If the delay seems to help then I will move both vdisk images into a dedicated directory in their own datasets on their respective ZFS pools. Then disable the docker containers auto start and write a small bash script that will run on boot instead. The script will attempt to write and read a speciffied amount of data into both directories, and only when it can write and then read back the data correctly will the script launch the vDSM container.

Suggested Fix:

Add a health check to the docker container so that vDSM will not boot unless all attached vDisks are actually accessable. And a constant check of the vdisks presense on on the host, and should any attached vdisk suddenly "disappear" immediately kill vDSM, hopefully before anything bad happens. I image a hard shutdown is less risky than writing to a non-existant disk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My entire volume is suddenly empty! #865

My entire volume is suddenly empty! #865

relink2013 commented Dec 13, 2024

relink2013 commented Dec 14, 2024

My entire volume is suddenly empty! #865

My entire volume is suddenly empty! #865

Comments

relink2013 commented Dec 13, 2024

Operating system

Description

Docker compose

Docker log

Screenshots (optional)

relink2013 commented Dec 14, 2024

Volume1

Volume2

Initial Setup

Experimenting & Lots of Crashing

Present Day

Quick notes on ZFS

Current Hypothesis

Current Plan to test this:

Suggested Fix: