Bug 1739080

Summary: Can't remove snapshot on latest rawhide
Product: [Community] LVM and device-mapper Reporter: Vojtech Trefny <vtrefny>
Component: lvm2Assignee: Zdenek Kabelac <zkabelac>
lvm2 sub component: Snapshots QA Contact: cluster-qe <cluster-qe>
Status: ON_QA --- Docs Contact:
Severity: unspecified    
Priority: urgent CC: agk, heinzm, jbrassow, mcsontos, prajnoha, zkabelac
Version: unspecifiedFlags: pm-rhel: lvm-technical-solution?
pm-rhel: lvm-test-coverage?
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: lvm2-2.03.06-1.fc32 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vojtech Trefny 2019-08-08 12:56:28 UTC
Description of problem:

Removing a snapshot (normal, not thin) on recent rawhide fails, the lvremove command just hangs and the snapshot LV is not removed.

Version-Release number of selected component (if applicable):

  LVM version:     2.03.05(2) (2019-06-15)
  Library version: 1.02.163 (2019-06-15)
  Driver version:  4.40.0


Steps to Reproduce:

# vgcreate blivet_test /dev/vda1 /dev/vdb
# lvcreate -L 1G -n 00 blivet_test
# mkfs.ext4 /dev/blivet_test/00
# lvcreate --size 1G --snapshot --name snap blivet_test/00
# lvremove --force --yes blivet_test/snap

The lvremove just never finishes, no error, nothing suspicious in the journal.

Additional info:

This happens only if we try to remove the snapshot shortly after creating it, removing after few minutes works as expected.

Comment 1 Vojtech Trefny 2019-08-13 10:16:06 UTC
Backtrace from gdb

(gdb) bt
#0  0x00007f94d992418b in semop () from /lib64/libc.so.6
#1  0x00005599f74788a9 in _udev_wait ()
#2  0x00005599f7479f9c in dm_udev_wait ()
#3  0x00005599f74576ed in fs_unlock ()
#4  0x00005599f73b86aa in _lv_info ()
#5  0x00005599f73b9be7 in lv_info ()
#6  0x00005599f73ba276 in lv_check_not_in_use ()
#7  0x00005599f73be9d5 in lv_deactivate ()
#8  0x00005599f740178a in lv_remove_single ()
#9  0x00005599f7401e08 in lv_remove_with_dependencies ()
#10 0x00005599f73abe1e in lvremove_single ()
#11 0x00005599f73a7bcb in process_each_lv_in_vg ()
#12 0x00005599f73a93c0 in process_each_lv ()
#13 0x00005599f7390234 in lvremove ()
#14 0x00005599f738e344 in lvm_run_command ()
#15 0x00005599f738f5f3 in lvm2_main ()
#16 0x00007f94d9848193 in __libc_start_main () from /lib64/libc.so.6
#17 0x00005599f736bf4e in _start ()

Comment 2 Zdenek Kabelac 2019-08-14 13:31:11 UTC
So after long debug sections - I think the bug is not on lvm2 side.

There are new changes on systemd side that probably provide some new asynchronous behavior into udev worker logic.

Downgrade to version  systemd-242-1.fc31.x86_64.rpm    make problem go away.

On the other hand - it's highly nontrivial ATM to do that downgrade on rawhide distro.

So ATM there is no better workaround then  'don't do that' and add significant sleep between such command.

On the other hand few things shown during debugging are worth fixing on lvm2 side - by they are unrelated to blocked wait on udev...

Comment 3 Zdenek Kabelac 2019-08-15 15:22:17 UTC
Just to keep it updated -

lvm2 issues a sequence of actions (udev transation) - where a single udev cookie is used for operation resume a reload of a device (with error target) and followed by its immediate removal.

Udevd (not yet sure why) is able to skip 'handling' of resume (CHANGE) event if the device is already 'gone' (removed) and this way lvm2 will end waiting for  'dmsetup udevcomplete' that will never come.

This is happening while 'other devices' are still suspended (origin LV in case of snapshot removal).

Comment 4 Zdenek Kabelac 2019-08-28 14:21:43 UTC
So the recent commits into  master & stable-2.02 branch made lvm2 'usable' again with lastest udevd with the 'optimization' of dropping events for devices that are already gone.

(the last commit being this one https://www.redhat.com/archives/lvm-devel/2019-August/msg00082.html)

So now lvm2 shall not be scheduling any 'syncs' while being suspend and also few extra syncs were added 
in the middle of  'activation & 'deactivation'.

There is likely more work to be done in terms of handling some error paths where the syncing might be still missing...

But essentially all existing test suite runs are passing.

Comment 5 Fedora Update System 2019-11-14 01:12:08 UTC
lvm2-2.03.06-1.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.

Comment 6 Zdenek Kabelac 2019-11-14 09:22:31 UTC
Just for completeness -  https://github.com/yuwata/systemd/commit/3da84684a8d3acc85cc4c16b3f59459f6fb7ea0a

is the fix for systemd-udevd which restores correct behavior (likely will land in some next version of systemd package).