Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a snapshot of a ZFS backed container with 0 bytes free results in hung lxd when doing snapshot operations #13466

Open
webdock-io opened this issue May 10, 2024 · 4 comments
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@webdock-io
Copy link

Ubuntu Noble
LXD 5.21.1 LTS

Creating an LXD container on a zfs backed filesystem, where you've set a quota (we set refquota flag on the pool as well, not sure if it matters) and then completely fill up the disk with dd - where profile disk is set to:

size: 450GB

And df shows

Filesystem              Size  Used Avail Use% Mounted on
lxd/containers/bigdata  420G  xxxG   xxxM xxx% /

If we then do

dd if=/dev/zero of=temp.bin bs=1G count=420

And make sure df shows 0 byes available, and on the host zfs list also shows 0 bytes available

And then do

lxc snapshot --reuse --no-expiry bigdata mysnapshot

LXD will hang forever. You should see the command just sitting there if you do ps aux. What's worse, if you kill the snapshot operation, and other operations like snapshot delete will also hang. Only remedy was to do snap restart lxd and then lxd perked up immediately, we could free some space and redo the snapshot (which worked fine as soon as some space was free on disk).

Just snapshotting with zfs works instantly, so I suspect LXD is trying to write some data to the instance and this is what's hanging. How much space free on a zfs volume is required for a snapshot to work?

@capriciousduck
Copy link

I see this one on my machine too. My lxd hung and it got me really confused. I couldn't understand what happened until I saw this issue. Any fix or thoughts on this?

@capriciousduck
Copy link

Just wanted to check up on this.

Any update on this?

@MggMuggins
Copy link
Contributor

I've reproduced this; I grabbed a stacktrace from all goroutines (runtime.Stack(buf, true)) while the operation was hanging and it looks like a deadlock:

goroutine 1744 [select, 2 minutes]:
github.com/canonical/lxd/lxd/locking.Lock({0x262dc58, 0x3cc02a0}, {0xc0020d1d70, 0x24})
	/home/wesley/Workspace/lxd/lxd/locking/lock.go:64 +0x12b
github.com/canonical/lxd/lxd/instance/drivers.(*common).updateBackupFileLock(0xc001631800, {0x262dc58, 0x3cc02a0})
	/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:1595 +0x125
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).Delete(0xc001631800, 0x1)
	/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3669 +0x55
github.com/canonical/lxd/lxd/instance/drivers.(*common).snapshotCommon.func1()
	/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:730 +0x22
github.com/canonical/lxd/shared/revert.(*Reverter).Fail(0xc003017bc8)
	/home/wesley/Workspace/lxd/shared/revert/revert.go:29 +0x34
github.com/canonical/lxd/lxd/instance/drivers.(*common).snapshotCommon(0xc002838480, {0x266e3e0, 0xc002838480}, {0xc002fb6490, 0xa}, {0x18?, 0x71d7dcc1fa68?, 0x0?}, 0x0)
	/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_common.go:743 +0x885
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).snapshot(0xc002838480, {0xc002fb6490, 0xa}, {0x102ad5e?, 0x0?, 0x0?}, 0x0)
	/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3437 +0x3b1
github.com/canonical/lxd/lxd/instance/drivers.(*lxc).Snapshot(0xc002838480, {0xc002fb6490, 0xa}, {0xc0013196c8?, 0xc001319788?, 0x0?}, 0x0)
	/home/wesley/Workspace/lxd/lxd/instance/drivers/driver_lxc.go:3449 +0xca
main.instanceSnapshotsPost.func2(0xc001686410?)
	/home/wesley/Workspace/lxd/lxd/instance_snapshot.go:333 +0x91
github.com/canonical/lxd/lxd/operations.(*Operation).Start.func1(0xc00099f680)
	/home/wesley/Workspace/lxd/lxd/operations/operations.go:287 +0x26
created by github.com/canonical/lxd/lxd/operations.(*Operation).Start in goroutine 1709
	/home/wesley/Workspace/lxd/lxd/operations/operations.go:286 +0x105

Indeed, the instance_updatebackupfile_PROJECT_INSTANCE lock is held throughout a snapshot operation. Instance Delete also acquires the lock, so when the snapshot creation fails and the snapshot is deleted, Delete is unable to acquire the lock.

@MggMuggins MggMuggins added the Bug Confirmed to be a bug label May 22, 2024
@MggMuggins MggMuggins self-assigned this May 22, 2024
@tomponline tomponline added this to the lxd-6.2 milestone May 23, 2024
@MggMuggins
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

No branches or pull requests

4 participants