Bug 1921214
Summary: | mode to interrupt --splitcache operation doesn't work with striped wc origin volumes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Corey Marthaler <cmarthal> | ||||||||
Component: | lvm2 | Assignee: | David Teigland <teigland> | ||||||||
lvm2 sub component: | Cache Logical Volumes | QA Contact: | cluster-qe <cluster-qe> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | medium | ||||||||||
Priority: | medium | CC: | agk, heinzm, jbrassow, mcsontos, mpatocka, msnitzer, prajnoha, teigland, zkabelac | ||||||||
Version: | 8.4 | Keywords: | Triaged | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 8.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | lvm2-2.03.11-3.el8 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-05-18 15:02:12 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Corey Marthaler
2021-01-27 17:44:08 UTC
I'm seeing this on any writecache lv. After lvconvert sets the cleaner option in the kernel, lvm ends up blocking for a long time waiting for udev cookie synchronization. I suspect the udev sync is in some indirect way being blocked by the cleaning that's happening. When udev is finally unblocked (after around 10 sec in this test), the writecache is clean and can be detached immediately. In that sense the writecache detach is working correctly, it's just that the expected behavior that was seen before is now being altered by lvm getting stuck on udev. 16:46:29.442034 lvconvert[1668023] locking/locking.c:334 sync_local_dev_names begin 1611787589 16:46:29.442039 lvconvert[1668023] mm/memlock.c:595 Unlock: Memlock counters: prioritized:0 locked:0 critical:0 daemon:0 suspended:0 16:46:29.442046 lvconvert[1668023] activate/fs.c:492 Syncing device names 16:46:29.442063 lvconvert[1668023] device_mapper/libdm-common.c:2480 Udev cookie 0xd4da93c (semid 5242895) decremented to 1 16:46:29.442075 lvconvert[1668023] device_mapper/libdm-common.c:2766 Udev cookie 0xd4da93c (semid 5242895) waiting for zero 16:46:41.424511 lvconvert[1668023] device_mapper/libdm-common.c:2495 Udev cookie 0xd4da93c (semid 5242895) destroyed 16:46:41.424550 lvconvert[1668023] device_mapper/libdm-common.c:1486 test-fast_cvol: Skipping NODE_ADD (253,4) 0:6 0660 [trust_udev] 16:46:41.424559 lvconvert[1668023] device_mapper/libdm-common.c:1496 test-fast_cvol: Processing NODE_READ_AHEAD 8192 (flags=1) 16:46:41.424626 lvconvert[1668023] device_mapper/libdm-common.c:1250 test-fast_cvol (253:4): read ahead is 8192 16:46:41.424641 lvconvert[1668023] device_mapper/libdm-common.c:1375 test-fast_cvol: retaining kernel read ahead of 8192 (requested 8192) 16:46:41.424651 lvconvert[1668023] device_mapper/libdm-common.c:1486 test-mainone_wcorig: Skipping NODE_ADD (253,5) 0:6 0660 [trust_udev] 16:46:41.424659 lvconvert[1668023] device_mapper/libdm-common.c:1496 test-mainone_wcorig: Processing NODE_READ_AHEAD 8192 (flags=1) 16:46:41.424686 lvconvert[1668023] device_mapper/libdm-common.c:1250 test-mainone_wcorig (253:5): read ahead is 8192 16:46:41.424698 lvconvert[1668023] device_mapper/libdm-common.c:1375 test-mainone_wcorig: retaining kernel read ahead of 8192 (requested 8192) 16:46:41.424705 lvconvert[1668023] device_mapper/libdm-common.c:1486 test-mainone: Skipping NODE_ADD (253,6) 0:6 0660 [trust_udev] 16:46:41.424715 lvconvert[1668023] device_mapper/libdm-common.c:1496 test-mainone: Processing NODE_READ_AHEAD 8192 (flags=1) 16:46:41.424745 lvconvert[1668023] device_mapper/libdm-common.c:1250 test-mainone (253:6): read ahead is 8192 16:46:41.424757 lvconvert[1668023] device_mapper/libdm-common.c:1375 test-mainone: retaining kernel read ahead of 8192 (requested 8192) 16:46:41.424770 lvconvert[1668023] locking/locking.c:337 sync_local_dev_names end 1611787601 Created attachment 1751803 [details]
lvconvert splitcache debug output
lvconvert --splitcache test/main debug output. lvm has suspended LV main (dm-4), added cleaner=1 to the dm table, and resumed. At that point lvm unlocks the VG which calls sync_local_dev_names, which waits for udev to finish any outstanding processing. lvm ends up waiting for 8 seconds for udev to finish processing dm-4 and to respond. The only change lvm has made to the dm device is adding the cleaner setting to the dm table, so there's not anything that udev actually needs to discover. But, udev doesn't have a way for us to tell it to ignore this dm device, even though we know it shouldn't bother with it, and anything udev does will just interfere with our own processing of the dm devs.
11:36:12.813295 lvconvert[1699031] locking/locking.c:334 sync_local_dev_names begin 1611855372
11:36:12.823406 lvconvert[1699031] activate/fs.c:492 Syncing device names
11:36:12.823462 lvconvert[1699031] device_mapper/libdm-common.c:2480 Udev cookie 0xd4d3012 (semid 5275683) decremented to 1
11:36:12.823469 lvconvert[1699031] device_mapper/libdm-common.c:2766 Udev cookie 0xd4d3012 (semid 5275683) waiting for zero
11:36:20.495421 lvconvert[1699031] device_mapper/libdm-common.c:2495 Udev cookie 0xd4d3012 (semid 5275683) destroyed
11:36:20.495461 lvconvert[1699031] device_mapper/libdm-common.c:1486 test-fast_cvol: Skipping NODE_ADD (253,5) 0:6 0660 [trust_udev]
11:36:20.495485 lvconvert[1699031] device_mapper/libdm-common.c:1496 test-fast_cvol: Processing NODE_READ_AHEAD 8192 (flags=1)
11:36:20.495553 lvconvert[1699031] device_mapper/libdm-common.c:1250 test-fast_cvol (253:5): read ahead is 8192
11:36:20.495566 lvconvert[1699031] device_mapper/libdm-common.c:1375 test-fast_cvol: retaining kernel read ahead of 8192 (requested 8192)
11:36:20.495575 lvconvert[1699031] device_mapper/libdm-common.c:1486 test-main_wcorig: Skipping NODE_ADD (253,6) 0:6 0660 [trust_udev]
11:36:20.495581 lvconvert[1699031] device_mapper/libdm-common.c:1496 test-main_wcorig: Processing NODE_READ_AHEAD 512 (flags=1)
11:36:20.495603 lvconvert[1699031] device_mapper/libdm-common.c:1250 test-main_wcorig (253:6): read ahead is 8192
11:36:20.495612 lvconvert[1699031] device_mapper/libdm-common.c:1375 test-main_wcorig: retaining kernel read ahead of 8192 (requested 512)
11:36:20.495617 lvconvert[1699031] device_mapper/libdm-common.c:1486 test-main: Skipping NODE_ADD (253,4) 0:6 0660 [trust_udev]
11:36:20.495624 lvconvert[1699031] device_mapper/libdm-common.c:1496 test-main: Processing NODE_READ_AHEAD 8192 (flags=1)
11:36:20.495649 lvconvert[1699031] device_mapper/libdm-common.c:1250 test-main (253:4): read ahead is 8192
11:36:20.495658 lvconvert[1699031] device_mapper/libdm-common.c:1375 test-main: retaining kernel read ahead of 8192 (requested 8192)
11:36:20.495670 lvconvert[1699031] locking/locking.c:338 sync_local_dev_names end 1611855380
The next attachment will show the corresponding udev debug logs.
Created attachment 1751815 [details]
systemd-udev debug logs
This shows what systemd-udev was stalled doing for 8 seconds while lvm was waiting for it to complete processing of dm-4. It appears that it's probing dm-4 for raid information. Since lvm has set cleaner=1 for dm-4, writecache in the kernel is busy writing back data and is delayed in responding to the reads from systemd-udev probing it. udev shouldn't need to be probing this device, because all that has happened is lvm has added a field into the dm table. Anything udev does with this dm device is prone to interfere with lvm managing it.
The 8 second delay in the log section below corresponds to the 8 second delay in the lvm command log in the previous attachement.
Jan 28 11:36:12 null-04.lab.msp.redhat.com systemd-udevd[1699045]: IMPORT builtin 'blkid' /usr/lib/udev/rules.d/13-dm-disk.rules:23
Jan 28 11:36:12 null-04.lab.msp.redhat.com systemd-udevd[1699045]: probe /dev/dm-4 raid offset=0
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: LINK 'disk/by-uuid/ce62dcb6-4f37-4021-ac94-aecd8e692426' /usr/lib/udev/rules.d/13-dm-disk.rules:25
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: RUN '/usr/sbin/dmsetup udevcomplete $env{DM_COOKIE}' /usr/lib/udev/rules.d/95-dm-notify.rules:12
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: handling device node '/dev/dm-4', devnum=b253:4, mode=0600, uid=0, gid=0
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: preserve already existing symlink '/dev/block/253:4' to '../dm-4'
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: found 'b253:4' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fdm-name-test-main'
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: creating link '/dev/disk/by-id/dm-name-test-main' to '/dev/dm-4'
Thanks to Zdenek for helping me understand the lvm/udev interactions above. All that raises a variety of issues and questions: - A writecache LV may not respond to reads for a considerable period of time while cleaning. This kind of delay has the potential to expose problems in places that aren't prepared to handle delayed reads. - udev should not need to process, much less *probe*, this LV after lvm has simply added a field into the dm table. That's both unnecessary and excessive interference. udev should be more intelligent here, or at least give subsystems a way to suppress this kind of behavior. - lvm could try to work around these issues by sending a "cleaner" message to the kernel instead of setting the cleaner via a table reload. There may be similar or different issues with that, so I wouldn't to commit to implementing that this late in the release. - There are a couple other minor changes I've found with the way lvm is handling writecache dm devs, discovered while debugging this issue. It's possible that those might have some value even without the workaround above. - This seems to be an issue for any LV, not only striped LVs as mentioned in the original report. - The delays are inconsistent, sometimes they seem to happen consistently, and other times not. It's not yet clear why this is. One possibility based on a few observations is that if a fs is mounted on the LV the delay may be less likely or shorter. - This issue is not actually causing a failure of the splitcache command, which still succeeds. The udev delay is just causing the lvconvert command to block uninterruptibly while waiting on udev sync, which means the user does not have a chance to interrupt the command or see the write-back progress. Created attachment 1751857 [details]
patch
It ended up being very simple to send a dm message instead of a table reload to set the cleaner mode in the kernel. This method prevents udev from getting involved, and it's less work for lvm also, since suspend/resume is quite a bit more work than just sending a message. I wouldn't say this fix is urgent since the command isn't broken, but the fix does improve how things work so I'll plan to push this for 8.4 if the bug has the necessary acks.
pushed to git https://sourceware.org/git/?p=lvm2.git;a=commit;h=a690d16d29ff7d5e0cc1adb1fed2621efd5fafa4 This scenarios now passes with the latest lvm rpms. kernel-4.18.0-283.el8 BUILT: Thu Feb 4 05:30:59 CST 2021 lvm2-2.03.11-3.el8 BUILT: Wed Feb 3 10:03:22 CST 2021 lvm2-libs-2.03.11-3.el8 BUILT: Wed Feb 3 10:03:22 CST 2021 SCENARIO - [empty_large_writecache_splitcache] Create a writecache volume with a large pool, fill it, and start (then interupt) a splitcache operation and continue using the origin volume. *** Writecache info for this scenario *** * origin (slow): /dev/sdc /dev/sde /dev/sdg * pool (fast): /dev/sdb /dev/sdd ************************************ Adding "slow" and "fast" tags to corresponding pvs Create origin (slow) volume lvcreate --yes -i 3 -L 2G -n large_writecache_empty writecache_sanity @slow lvcreate --yes -L 1G -n pool writecache_sanity /dev/sdb Deactivate fast pool volume only before conversion to write cache Create writecached volume by combining the cache pool (fast) and origin (slow) volumes lvconvert --yes --type writecache --cachevol writecache_sanity/pool writecache_sanity/large_writecache_empty Activating volume: large_writecache_empty Although the activate passed, errors were found in its output LINE=17:44:55.614668 lvchange[61949] label/label.c:824 Scanned devices: read errors 0 process errors 0 failed 0 Placing an xfs filesystem on origin volume Mounting origin volume Writing files to /mnt/large_writecache_empty Checking files on /mnt/large_writecache_empty large_writecache_empty writecache_sanity Cwi-aoC--- 2.00g [pool_cvol] [large_writecache_empty_wcorig] 36.99 large_writecache_empty_wcorig(0) [large_writecache_empty_wcorig] writecache_sanity owi-aoC--- 2.00g /dev/sdc(0),/dev/sde(0),/dev/sdg(0) [pool_cvol] writecache_sanity Cwi-aoC--- 1.00g /dev/sdb(0) Separating cache pool (lvconvert --yes --splitcache writecache_sanity/large_writecache_empty) from cache origin Detaching writecache setting cleaner. Sending splitcache pid 62050 a SIGINT Detaching writecache cleaning 727052 blocks This command can be cancelled and rerun to complete writecache detach. Verifying proper cleaning cache setting: cleaner=1 Checking files on /mnt/large_writecache_empty lvconvert --yes --splitcache /dev/writecache_sanity/large_writecache_empty Checking files on /mnt/large_writecache_empty Removing cache origin volume writecache_sanity/large_writecache_empty lvremove -f /dev/writecache_sanity/large_writecache_empty Fix verified in the latest rpms. kernel-4.18.0-284.el8 BUILT: Mon Feb 8 04:33:33 CST 2021 lvm2-2.03.11-4.el8 BUILT: Thu Feb 11 04:35:23 CST 2021 lvm2-libs-2.03.11-4.el8 BUILT: Thu Feb 11 04:35:23 CST 2021 device-mapper-1.02.175-4.el8 BUILT: Thu Feb 11 04:35:23 CST 2021 device-mapper-libs-1.02.175-4.el8 BUILT: Thu Feb 11 04:35:23 CST 2021 device-mapper-event-1.02.175-4.el8 BUILT: Thu Feb 11 04:35:23 CST 2021 SCENARIO - [empty_large_writecache_splitcache] Create a writecache volume with a large pool, fill it, and start (then interupt) a splitcache operation and continue using the origin volume. *** Writecache info for this scenario *** * origin (slow): /dev/sdc /dev/sde /dev/sdg * pool (fast): /dev/sdb /dev/sdd ************************************ Adding "slow" and "fast" tags to corresponding pvs Create origin (slow) volume lvcreate --yes -i 3 -L 2G -n large_writecache_empty writecache_sanity @slow lvcreate --yes -L 1G -n pool writecache_sanity /dev/sdb Deactivate fast pool volume only before conversion to write cache Create writecached volume by combining the cache pool (fast) and origin (slow) volumes lvconvert --yes --type writecache --cachevol writecache_sanity/pool writecache_sanity/large_writecache_empty Activating volume: large_writecache_empty Placing an xfs filesystem on origin volume Mounting origin volume Writing files to /mnt/large_writecache_empty Checking files on /mnt/large_writecache_empty large_writecache_empty writecache_sanity Cwi-aoC--- 2.00g [pool_cvol] [large_writecache_empty_wcorig] 36.15 large_writecache_empty_wcorig(0) [large_writecache_empty_wcorig] writecache_sanity owi-aoC--- 2.00g /dev/sdc(0),/dev/sde(0),/dev/sdg(0) [pool_cvol] writecache_sanity Cwi-aoC--- 1.00g /dev/sdb(0) Separating cache pool (lvconvert --yes --splitcache writecache_sanity/large_writecache_empty) from cache origin Detaching writecache setting cleaner. Sending splitcache pid 9694 a SIGINT Detaching writecache cleaning 734852 blocks This command can be cancelled and rerun to complete writecache detach. Verifying proper cleaning cache setting: cleaner=1 Checking files on /mnt/large_writecache_empty lvconvert --yes --splitcache /dev/writecache_sanity/large_writecache_empty Checking files on /mnt/large_writecache_empty Removing cache origin volume writecache_sanity/large_writecache_empty lvremove -f /dev/writecache_sanity/large_writecache_empty Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (lvm2 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1659 |