Bug 1921214

Summary:

mode to interrupt --splitcache operation doesn't work with striped wc origin volumes

Product:

Red Hat Enterprise Linux 8

Reporter:

Corey Marthaler <cmarthal>

Component:

lvm2

Assignee:

David Teigland <teigland>

lvm2 sub component:

Cache Logical Volumes

QA Contact:

cluster-qe <cluster-qe>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

medium

CC:

agk, heinzm, jbrassow, mcsontos, mpatocka, msnitzer, prajnoha, teigland, zkabelac

Version:

8.4

Keywords:

Triaged

Target Milestone:

Target Release:

8.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

lvm2-2.03.11-3.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-18 15:02:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
lvconvert splitcache debug output	none
systemd-udev debug logs	none
patch	none

Description Corey Marthaler 2021-01-27 17:44:08 UTC

Description of problem:
As a part of bug 1805896 verification, we interrupted splitcache operations, checked for "cleaner=1", and then restarted those operations. However, when the origin volume is a stripe, interruption is not allowed still.



[root@host-084 ~]# lvs -a -o +devices
  LV                              VG                Attr       LSize   Pool        Origin                          Data%  Meta%  Move Log Cpy%Sync Convert Devices                            
  large_writecache_empty          writecache_sanity Cwi-aoC---   2.00g [pool_cvol] [large_writecache_empty_wcorig] 37.19                                   large_writecache_empty_wcorig(0)   
  [large_writecache_empty_wcorig] writecache_sanity owi-aoC---   2.00g                                                                                     /dev/sdb(0),/dev/sdd(0),/dev/sdf(0)
  [pool_cvol]                     writecache_sanity Cwi-aoC---   1.00g                                                                                     /dev/sda(0)                        

[root@host-084 ~]# lvconvert --yes --splitcache writecache_sanity/large_writecache_empty
  Detaching writecache setting cleaner.
^C^C^C^C^C  Detaching writecache finished cleaning.
  Logical volume writecache_sanity/large_writecache_empty writecache has been detached.


Version-Release number of selected component (if applicable):
kernel-4.18.0-277.el8    BUILT: Wed Jan 20 09:06:28 CST 2021
lvm2-2.03.11-1.el8    BUILT: Fri Jan  8 05:21:07 CST 2021
lvm2-libs-2.03.11-1.el8    BUILT: Fri Jan  8 05:21:07 CST 2021


How reproducible:
Everytime

Comment 1 David Teigland 2021-01-27 23:41:55 UTC

I'm seeing this on any writecache lv.  After lvconvert sets the cleaner option in the kernel, lvm ends up blocking for a long time waiting for udev cookie synchronization.  I suspect the udev sync is in some indirect way being blocked by the cleaning that's happening.  When udev is finally unblocked (after around 10 sec in this test), the writecache is clean and can be detached immediately.  In that sense the writecache detach is working correctly, it's just that the expected behavior that was seen before is now being altered by lvm getting stuck on udev.

16:46:29.442034 lvconvert[1668023] locking/locking.c:334  sync_local_dev_names begin 1611787589
16:46:29.442039 lvconvert[1668023] mm/memlock.c:595  Unlock: Memlock counters: prioritized:0 locked:0 critical:0 daemon:0 suspended:0
16:46:29.442046 lvconvert[1668023] activate/fs.c:492  Syncing device names
16:46:29.442063 lvconvert[1668023] device_mapper/libdm-common.c:2480  Udev cookie 0xd4da93c (semid 5242895) decremented to 1
16:46:29.442075 lvconvert[1668023] device_mapper/libdm-common.c:2766  Udev cookie 0xd4da93c (semid 5242895) waiting for zero
16:46:41.424511 lvconvert[1668023] device_mapper/libdm-common.c:2495  Udev cookie 0xd4da93c (semid 5242895) destroyed
16:46:41.424550 lvconvert[1668023] device_mapper/libdm-common.c:1486  test-fast_cvol: Skipping NODE_ADD (253,4) 0:6 0660 [trust_udev]
16:46:41.424559 lvconvert[1668023] device_mapper/libdm-common.c:1496  test-fast_cvol: Processing NODE_READ_AHEAD 8192 (flags=1)
16:46:41.424626 lvconvert[1668023] device_mapper/libdm-common.c:1250  test-fast_cvol (253:4): read ahead is 8192
16:46:41.424641 lvconvert[1668023] device_mapper/libdm-common.c:1375  test-fast_cvol: retaining kernel read ahead of 8192 (requested 8192)
16:46:41.424651 lvconvert[1668023] device_mapper/libdm-common.c:1486  test-mainone_wcorig: Skipping NODE_ADD (253,5) 0:6 0660 [trust_udev]
16:46:41.424659 lvconvert[1668023] device_mapper/libdm-common.c:1496  test-mainone_wcorig: Processing NODE_READ_AHEAD 8192 (flags=1)
16:46:41.424686 lvconvert[1668023] device_mapper/libdm-common.c:1250  test-mainone_wcorig (253:5): read ahead is 8192
16:46:41.424698 lvconvert[1668023] device_mapper/libdm-common.c:1375  test-mainone_wcorig: retaining kernel read ahead of 8192 (requested 8192)
16:46:41.424705 lvconvert[1668023] device_mapper/libdm-common.c:1486  test-mainone: Skipping NODE_ADD (253,6) 0:6 0660 [trust_udev]
16:46:41.424715 lvconvert[1668023] device_mapper/libdm-common.c:1496  test-mainone: Processing NODE_READ_AHEAD 8192 (flags=1)
16:46:41.424745 lvconvert[1668023] device_mapper/libdm-common.c:1250  test-mainone (253:6): read ahead is 8192
16:46:41.424757 lvconvert[1668023] device_mapper/libdm-common.c:1375  test-mainone: retaining kernel read ahead of 8192 (requested 8192)
16:46:41.424770 lvconvert[1668023] locking/locking.c:337  sync_local_dev_names end 1611787601

Comment 3 David Teigland 2021-01-28 18:42:13 UTC

Created attachment 1751803 [details]
lvconvert splitcache debug output

lvconvert --splitcache test/main debug output.  lvm has suspended LV main (dm-4), added cleaner=1 to the dm table, and resumed.  At that point lvm unlocks the VG which calls sync_local_dev_names, which waits for udev to finish any outstanding processing.  lvm ends up waiting for 8 seconds for udev to finish processing dm-4 and to respond.  The only change lvm has made to the dm device is adding the cleaner setting to the dm table, so there's not anything that udev actually needs to discover.  But, udev doesn't have a way for us to tell it to ignore this dm device, even though we know it shouldn't bother with it, and anything udev does will just interfere with our own processing of the dm devs.

11:36:12.813295 lvconvert[1699031] locking/locking.c:334  sync_local_dev_names begin 1611855372
11:36:12.823406 lvconvert[1699031] activate/fs.c:492  Syncing device names
11:36:12.823462 lvconvert[1699031] device_mapper/libdm-common.c:2480  Udev cookie 0xd4d3012 (semid 5275683) decremented to 1
11:36:12.823469 lvconvert[1699031] device_mapper/libdm-common.c:2766  Udev cookie 0xd4d3012 (semid 5275683) waiting for zero
11:36:20.495421 lvconvert[1699031] device_mapper/libdm-common.c:2495  Udev cookie 0xd4d3012 (semid 5275683) destroyed
11:36:20.495461 lvconvert[1699031] device_mapper/libdm-common.c:1486  test-fast_cvol: Skipping NODE_ADD (253,5) 0:6 0660 [trust_udev]
11:36:20.495485 lvconvert[1699031] device_mapper/libdm-common.c:1496  test-fast_cvol: Processing NODE_READ_AHEAD 8192 (flags=1)
11:36:20.495553 lvconvert[1699031] device_mapper/libdm-common.c:1250  test-fast_cvol (253:5): read ahead is 8192
11:36:20.495566 lvconvert[1699031] device_mapper/libdm-common.c:1375  test-fast_cvol: retaining kernel read ahead of 8192 (requested 8192)
11:36:20.495575 lvconvert[1699031] device_mapper/libdm-common.c:1486  test-main_wcorig: Skipping NODE_ADD (253,6) 0:6 0660 [trust_udev]
11:36:20.495581 lvconvert[1699031] device_mapper/libdm-common.c:1496  test-main_wcorig: Processing NODE_READ_AHEAD 512 (flags=1)
11:36:20.495603 lvconvert[1699031] device_mapper/libdm-common.c:1250  test-main_wcorig (253:6): read ahead is 8192
11:36:20.495612 lvconvert[1699031] device_mapper/libdm-common.c:1375  test-main_wcorig: retaining kernel read ahead of 8192 (requested 512)
11:36:20.495617 lvconvert[1699031] device_mapper/libdm-common.c:1486  test-main: Skipping NODE_ADD (253,4) 0:6 0660 [trust_udev]
11:36:20.495624 lvconvert[1699031] device_mapper/libdm-common.c:1496  test-main: Processing NODE_READ_AHEAD 8192 (flags=1)
11:36:20.495649 lvconvert[1699031] device_mapper/libdm-common.c:1250  test-main (253:4): read ahead is 8192
11:36:20.495658 lvconvert[1699031] device_mapper/libdm-common.c:1375  test-main: retaining kernel read ahead of 8192 (requested 8192)
11:36:20.495670 lvconvert[1699031] locking/locking.c:338  sync_local_dev_names end 1611855380

The next attachment will show the corresponding udev debug logs.

Comment 4 David Teigland 2021-01-28 18:49:19 UTC

Created attachment 1751815 [details]
systemd-udev debug logs

This shows what systemd-udev was stalled doing for 8 seconds while lvm was waiting for it to complete processing of dm-4.  It appears that it's probing dm-4 for raid information.  Since lvm has set cleaner=1 for dm-4, writecache in the kernel is busy writing back data and is delayed in responding to the reads from systemd-udev probing it.  udev shouldn't need to be probing this device, because all that has happened is lvm has added a field into the dm table.  Anything udev does with this dm device is prone to interfere with lvm managing it.

The 8 second delay in the log section below corresponds to the 8 second delay in the lvm command log in the previous attachement.

Jan 28 11:36:12 null-04.lab.msp.redhat.com systemd-udevd[1699045]: IMPORT builtin 'blkid' /usr/lib/udev/rules.d/13-dm-disk.rules:23
Jan 28 11:36:12 null-04.lab.msp.redhat.com systemd-udevd[1699045]: probe /dev/dm-4 raid offset=0
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: LINK 'disk/by-uuid/ce62dcb6-4f37-4021-ac94-aecd8e692426' /usr/lib/udev/rules.d/13-dm-disk.rules:25
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: RUN '/usr/sbin/dmsetup udevcomplete $env{DM_COOKIE}' /usr/lib/udev/rules.d/95-dm-notify.rules:12
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: handling device node '/dev/dm-4', devnum=b253:4, mode=0600, uid=0, gid=0
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: preserve already existing symlink '/dev/block/253:4' to '../dm-4'
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: found 'b253:4' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fdm-name-test-main'
Jan 28 11:36:20 null-04.lab.msp.redhat.com systemd-udevd[1699045]: creating link '/dev/disk/by-id/dm-name-test-main' to '/dev/dm-4'

Comment 5 David Teigland 2021-01-28 19:11:16 UTC

Thanks to Zdenek for helping me understand the lvm/udev interactions above. All that raises a variety of issues and questions:

- A writecache LV may not respond to reads for a considerable period of time while cleaning. This kind of delay has the potential to expose problems in places that aren't prepared to handle delayed reads.

- udev should not need to process, much less *probe*, this LV after lvm has simply added a field into the dm table. That's both unnecessary and excessive interference. udev should be more intelligent here, or at least give subsystems a way to suppress this kind of behavior.

- lvm could try to work around these issues by sending a "cleaner" message to the kernel instead of setting the cleaner via a table reload. There may be similar or different issues with that, so I wouldn't to commit to implementing that this late in the release.

- There are a couple other minor changes I've found with the way lvm is handling writecache dm devs, discovered while debugging this issue. It's possible that those might have some value even without the workaround above.

- This seems to be an issue for any LV, not only striped LVs as mentioned in the original report.

- The delays are inconsistent, sometimes they seem to happen consistently, and other times not. It's not yet clear why this is. One possibility based on a few observations is that if a fs is mounted on the LV the delay may be less likely or shorter.

- This issue is not actually causing a failure of the splitcache command, which still succeeds. The udev delay is just causing the lvconvert command to block uninterruptibly while waiting on udev sync, which means the user does not have a chance to interrupt the command or see the write-back progress.

Comment 6 David Teigland 2021-01-28 21:34:09 UTC

Created attachment 1751857 [details]
patch

It ended up being very simple to send a dm message instead of a table reload to set the cleaner mode in the kernel.  This method prevents udev from getting involved, and it's less work for lvm also, since suspend/resume is quite a bit more work than just sending a message.  I wouldn't say this fix is urgent since the command isn't broken, but the fix does improve how things work so I'll plan to push this for 8.4 if the bug has the necessary acks.

Comment 7 David Teigland 2021-01-29 18:13:02 UTC

pushed to git
https://sourceware.org/git/?p=lvm2.git;a=commit;h=a690d16d29ff7d5e0cc1adb1fed2621efd5fafa4

Comment 8 Corey Marthaler 2021-02-08 23:54:28 UTC

This scenarios now passes with the latest lvm rpms.

kernel-4.18.0-283.el8    BUILT: Thu Feb  4 05:30:59 CST 2021
lvm2-2.03.11-3.el8    BUILT: Wed Feb  3 10:03:22 CST 2021
lvm2-libs-2.03.11-3.el8    BUILT: Wed Feb  3 10:03:22 CST 2021

SCENARIO - [empty_large_writecache_splitcache]
Create a writecache volume with a large pool, fill it, and start (then interupt) a splitcache operation and continue using the origin volume.

*** Writecache info for this scenario ***
*  origin (slow):  /dev/sdc /dev/sde /dev/sdg
*  pool (fast):    /dev/sdb /dev/sdd
************************************

Adding "slow" and "fast" tags to corresponding pvs
Create origin (slow) volume
lvcreate --yes  -i 3 -L 2G -n large_writecache_empty writecache_sanity @slow

lvcreate --yes  -L 1G -n pool writecache_sanity /dev/sdb

Deactivate fast pool volume only before conversion to write cache
Create writecached volume by combining the cache pool (fast) and origin (slow) volumes
lvconvert --yes --type writecache --cachevol writecache_sanity/pool writecache_sanity/large_writecache_empty
Activating volume: large_writecache_empty
Although the activate passed, errors were found in its output
LINE=17:44:55.614668 lvchange[61949] label/label.c:824  Scanned devices: read errors 0 process errors 0 failed 0

Placing an xfs filesystem on origin volume
Mounting origin volume

Writing files to /mnt/large_writecache_empty

Checking files on /mnt/large_writecache_empty

  large_writecache_empty          writecache_sanity Cwi-aoC---   2.00g [pool_cvol] [large_writecache_empty_wcorig] 36.99                                   large_writecache_empty_wcorig(0)   
   [large_writecache_empty_wcorig] writecache_sanity owi-aoC---   2.00g                                                                                     /dev/sdc(0),/dev/sde(0),/dev/sdg(0)
   [pool_cvol]                     writecache_sanity Cwi-aoC---   1.00g                                                                                     /dev/sdb(0)                        

Separating cache pool (lvconvert --yes --splitcache writecache_sanity/large_writecache_empty) from cache origin
  Detaching writecache setting cleaner.
Sending splitcache pid 62050 a SIGINT
  Detaching writecache cleaning 727052 blocks
  This command can be cancelled and rerun to complete writecache detach.
Verifying proper cleaning cache setting: cleaner=1

Checking files on /mnt/large_writecache_empty

lvconvert --yes --splitcache /dev/writecache_sanity/large_writecache_empty
Checking files on /mnt/large_writecache_empty

Removing cache origin volume writecache_sanity/large_writecache_empty
lvremove -f /dev/writecache_sanity/large_writecache_empty

Comment 11 Corey Marthaler 2021-02-11 22:24:33 UTC

Fix verified in the latest rpms.

kernel-4.18.0-284.el8    BUILT: Mon Feb  8 04:33:33 CST 2021
lvm2-2.03.11-4.el8    BUILT: Thu Feb 11 04:35:23 CST 2021
lvm2-libs-2.03.11-4.el8    BUILT: Thu Feb 11 04:35:23 CST 2021

device-mapper-1.02.175-4.el8    BUILT: Thu Feb 11 04:35:23 CST 2021
device-mapper-libs-1.02.175-4.el8    BUILT: Thu Feb 11 04:35:23 CST 2021
device-mapper-event-1.02.175-4.el8    BUILT: Thu Feb 11 04:35:23 CST 2021

SCENARIO - [empty_large_writecache_splitcache]
Create a writecache volume with a large pool, fill it, and start (then interupt) a splitcache operation and continue using the origin volume.

*** Writecache info for this scenario ***
*  origin (slow):  /dev/sdc /dev/sde /dev/sdg
*  pool (fast):    /dev/sdb /dev/sdd
************************************

Adding "slow" and "fast" tags to corresponding pvs
Create origin (slow) volume
lvcreate --yes  -i 3 -L 2G -n large_writecache_empty writecache_sanity @slow
lvcreate --yes  -L 1G -n pool writecache_sanity /dev/sdb

Deactivate fast pool volume only before conversion to write cache
Create writecached volume by combining the cache pool (fast) and origin (slow) volumes
lvconvert --yes --type writecache --cachevol writecache_sanity/pool writecache_sanity/large_writecache_empty
Activating volume: large_writecache_empty

Placing an xfs filesystem on origin volume
Mounting origin volume

Writing files to /mnt/large_writecache_empty
Checking files on /mnt/large_writecache_empty

  large_writecache_empty          writecache_sanity Cwi-aoC---   2.00g [pool_cvol] [large_writecache_empty_wcorig] 36.15                                   large_writecache_empty_wcorig(0)   
   [large_writecache_empty_wcorig] writecache_sanity owi-aoC---   2.00g                                                                                     /dev/sdc(0),/dev/sde(0),/dev/sdg(0)
   [pool_cvol]                     writecache_sanity Cwi-aoC---   1.00g                                                                                     /dev/sdb(0)                        

Separating cache pool (lvconvert --yes --splitcache writecache_sanity/large_writecache_empty) from cache origin
  Detaching writecache setting cleaner.
Sending splitcache pid 9694 a SIGINT
  Detaching writecache cleaning 734852 blocks
  This command can be cancelled and rerun to complete writecache detach.
Verifying proper cleaning cache setting: cleaner=1

Checking files on /mnt/large_writecache_empty

lvconvert --yes --splitcache /dev/writecache_sanity/large_writecache_empty
Checking files on /mnt/large_writecache_empty

Removing cache origin volume writecache_sanity/large_writecache_empty
lvremove -f /dev/writecache_sanity/large_writecache_empty

Comment 13 errata-xmlrpc 2021-05-18 15:02:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (lvm2 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1659