Bug 1672336

Summary:	lvm2 commands do not show logical volumes when lvmetad is used
Product:	Red Hat Enterprise Linux 7	Reporter:	Andrew Schorr <ajschorr>
Component:	lvm2	Assignee:	David Teigland <teigland>
lvm2 sub component:	Other	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	high	CC:	agk, cmarthal, heinzm, jbrassow, mcsontos, msnitzer, orion, pasik, prajnoha, rhandlin, teigland, thornber, toneata, villapla, zkabelac
Version:	7.6	Keywords:	ZStream
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	lvm2-2.02.184-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1698750 (view as bug list)		Environment:
Last Closed:	2019-08-21 18:29:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1698750

Description Andrew Schorr 2019-02-04 15:56:14 UTC

Description of problem:
After applying the most recent lvm2 update which I was hoping would fix issues with mdraid version 0.90 devices, I was able to start using lvmetad again, but I get no output from commands such as pvs, vgs, lvs, pvdisplay, vgdisplay, lvdisplay.

Version-Release number of selected component (if applicable):
lvm2-2.02.180-10.el7_6.3.x86_64


How reproducible:
always

Steps to Reproduce:
1. boot with lvmetad enabled
2. run "pvs" or "vgs" or any other lvm2 command to show config
3.

Actual results:
no output

Expected results:
information about pvs, vgs, lvs, etc.

Additional info:
I see these boot messages:
Feb 04 10:22:00 ajserver lvm[3115]: WARNING: lvmetad is being updated by another command (pid 1665).
Feb 04 10:22:00 ajserver lvm[3115]: WARNING: Not using lvmetad because cache update failed.
Feb 04 10:22:00 ajserver lvm[3115]: WARNING: Autoactivation reading from disk instead of lvmetad.
Feb 04 10:22:00 ajserver lvm[3115]: 1 logical volume(s) in volume group "vg_data" now active

Comment 2 Andrew Schorr 2019-03-20 02:05:14 UTC

I just rebooted into kernel 3.10.0-957.10.1.el7.x86_64, and things are working a bit better now. I still see some concerning boot messages:

Mar 19 21:59:49 ajserver lvm[1665]: WARNING: lvmetad is being updated by another command (pid 3113).
Mar 19 21:59:49 ajserver lvm[1665]: WARNING: Not using lvmetad because cache update failed.
Mar 19 21:59:49 ajserver lvm[3113]: 1 logical volume(s) in volume group "vg_data" now active

But the pvs, vgs, and lvs commands are producing useful output now. I'm not sure what's going
on here.

Comment 3 David Teigland 2019-03-20 14:32:17 UTC

This commit may help (stable-2.02 branch):
https://sourceware.org/git/?p=lvm2.git;a=commit;h=dcf8f3111aea1179be1083ace772f3108b06c508

I will try to find a test rpm with this fix for you to try.

Comment 4 Marian Csontos 2019-03-21 08:13:03 UTC

This looks like another ZStream candidate. I am preparing a 7.6 scratch build

Comment 5 Marian Csontos 2019-03-22 09:00:39 UTC

Could you please test with the following 7.6 build with the patch included http://people.redhat.com/mcsontos/repos/bz1672336/

Comment 6 Andrew Schorr 2019-03-22 20:38:37 UTC

I installed the repo file, upgraded the rpms, and rebooted. After rebooting:

[schorr@ajserver ~]$ sudo journalctl -b | grep lvm
Mar 22 16:35:30 ajserver lvm[3109]: WARNING: lvmetad is being updated, retrying (setup) for 10 more seconds.
Mar 22 16:35:31 ajserver lvm[3109]: 1 logical volume(s) in volume group "vg_data" now active

And pvs, vgs, and lvs seem to be working properly.

Regards,
Andy

Comment 7 David Teigland 2019-04-05 14:17:36 UTC

*** Bug 1691826 has been marked as a duplicate of this bug. ***

Comment 8 Marian Csontos 2019-04-09 11:52:32 UTC

IIUC this is a race, so may not be 100% reproducible. David, you fixed this, so may have an idea how to test. Are multiple VGs on MD a requirement? Or just multiple VGs?

Comment 9 David Teigland 2019-04-09 14:54:03 UTC

Right, this bug is not directly reproducable (and doesn't require MD).  What we probably want to do is something like:

- kill and restart lvmetad (and don't run any lvm commands, so it remains uninitialized)

- barrage lvmetad with a number of concurrent pvscan --cache -aay <device> commands.

In real life this comes about during system startup, where multiple devices are appear to the system all at once, and systemd/udev fires many pvscan serivces.  But we might be able to simulate similar by running 'udevadm trigger' (we'll need to check if that produces serial or parallel pvscans.)

What we want to do is set up a potential race among many pvscan's trying to initialize lvmetad.

Comment 10 David Teigland 2019-04-09 15:18:36 UTC

udevadm trigger doesn't do that... we might have to cut out systemd/udev/services and test this by directly running a bunch of parallel 'pvscan --cache -aay <device>' commands against the uninitialized lvmetad.

(Part of the problem we've had in this area is that our testing and simulation has been rather different from what the system actually runs, but systemd/udev do not make it easy to reproduce their actions outside of an actual system startup.)

Comment 14 Corey Marthaler 2019-07-01 22:17:13 UTC

We still see the "WARNING: Not updating lvmetad because cache update failed." messages at boot from time to time on a variety of test machines running the latest rpms. Does that warrant not marking this bug verified? We do see the proper pvs/lvs listed after boot however, and we tried the quick check listed in comment #9 and saw no messages.


A couple test machines just last week: 
[root@hayes-01 ~]#  grep "cache update failed" /var/log/message*
/var/log/messages-20190630:Jun 26 14:21:42 hayes-01 lvm: WARNING: Not using lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 14:21:42 hayes-01 lvm: WARNING: Not updating lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 16:18:17 hayes-01 lvm: WARNING: Not updating lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 16:18:17 hayes-01 lvm: WARNING: Not updating lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 16:18:17 hayes-01 lvm: WARNING: Not using lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 17:16:45 hayes-01 lvm: WARNING: Not using lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 17:16:45 hayes-01 lvm: WARNING: Not updating lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 17:16:45 hayes-01 lvm: WARNING: Not updating lvmetad because cache update failed.

[root@hayes-03 ~]#  grep "cache update failed" /var/log/message*
/var/log/messages-20190630:Jun 24 12:02:06 hayes-03 lvm: WARNING: Not using lvmetad because cache update failed.
/var/log/messages-20190630:Jun 24 12:02:06 hayes-03 lvm: WARNING: Not updating lvmetad because cache update failed.
/var/log/messages-20190630:Jun 24 14:01:54 hayes-03 lvm: WARNING: Not using lvmetad because cache update failed.
/var/log/messages-20190630:Jun 26 13:47:33 hayes-03 lvm: WARNING: Not using lvmetad because cache update failed.





Jun 26 14:21:41 hayes-01 systemd: Starting LVM2 PV scan on device 8:113...
Jun 26 14:21:41 hayes-01 lvm: WARNING: lvmetad is being updated, retrying (setup) for 10 more seconds.
Jun 26 14:21:41 hayes-01 lvm: WARNING: lvmetad is being updated, retrying (setup) for 10 more seconds.
Jun 26 14:21:41 hayes-01 lvm: WARNING: lvmetad is being updated, retrying (setup) for 10 more seconds.
Jun 26 14:21:41 hayes-01 multipathd: sdg: add path (uevent)
Jun 26 14:21:41 hayes-01 lvm: WARNING: lvmetad is being updated, retrying (setup) for 10 more seconds.
Jun 26 14:21:41 hayes-01 lvm: WARNING: lvmetad is being updated, retrying (setup) for 10 more seconds.
Jun 26 14:21:41 hayes-01 multipathd: sdg: spurious uevent, path already in pathvec
Jun 26 14:21:42 hayes-01 kernel: cryptd: max_cpu_qlen set to 1000
Jun 26 14:21:42 hayes-01 multipathd: sdf: add path (uevent)
Jun 26 14:21:42 hayes-01 multipathd: sdf: spurious uevent, path already in pathvec
Jun 26 14:21:42 hayes-01 multipathd: sdi: add path (uevent)
Jun 26 14:21:42 hayes-01 multipathd: sdi: spurious uevent, path already in pathvec
Jun 26 14:21:42 hayes-01 multipathd: sdk: add path (uevent)
Jun 26 14:21:42 hayes-01 multipathd: sdk: spurious uevent, path already in pathvec
Jun 26 14:21:42 hayes-01 multipathd: sdc: add path (uevent)
Jun 26 14:21:42 hayes-01 multipathd: sdc: spurious uevent, path already in pathvec
Jun 26 14:21:42 hayes-01 multipathd: sdb: add path (uevent)
Jun 26 14:21:42 hayes-01 multipathd: sdb: spurious uevent, path already in pathvec
Jun 26 14:21:42 hayes-01 kernel: AVX2 version of gcm_enc/dec engaged.
Jun 26 14:21:42 hayes-01 kernel: AES CTR mode by8 optimization enabled
Jun 26 14:21:42 hayes-01 kernel: alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
Jun 26 14:21:42 hayes-01 kernel: alg: No test for __generic-gcm-aes-aesni (__driver-generic-gcm-aes-aesni)
Jun 26 14:21:42 hayes-01 multipathd: sda: add path (uevent)
Jun 26 14:21:42 hayes-01 multipathd: sda: spurious uevent, path already in pathvec
Jun 26 14:21:42 hayes-01 kernel: dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.3)
Jun 26 14:21:42 hayes-01 systemd: Found device PERC_H330_Adp 2.
Jun 26 14:21:42 hayes-01 systemd: Activating swap /dev/disk/by-uuid/a65a2321-6664-44b9-9786-cdabc080c63a...
Jun 26 14:21:42 hayes-01 systemd: Found device PERC_H330_Adp 1.
Jun 26 14:21:42 hayes-01 lvm: WARNING: lvmetad is being updated by another command (pid 1190).
Jun 26 14:21:42 hayes-01 lvm: WARNING: lvmetad is being updated by another command (pid 1190).
Jun 26 14:21:42 hayes-01 lvm: WARNING: Not using lvmetad because cache update failed.
Jun 26 14:21:42 hayes-01 kernel: Adding 4194300k swap on /dev/sda2.  Priority:-2 extents:1 across:4194300k FS
Jun 26 14:21:42 hayes-01 systemd: Activated swap /dev/disk/by-uuid/a65a2321-6664-44b9-9786-cdabc080c63a.
Jun 26 14:21:42 hayes-01 systemd: Reached target Swap.
Jun 26 14:21:42 hayes-01 kernel: iTCO_vendor_support: vendor-support=0
Jun 26 14:21:42 hayes-01 lvm: WARNING: lvmetad is being updated by another command (pid 1190).
Jun 26 14:21:42 hayes-01 kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
Jun 26 14:21:42 hayes-01 kernel: iTCO_wdt: Found a Wellsburg TCO device (Version=2, TCOBASE=0x0460)
Jun 26 14:21:42 hayes-01 lvm: WARNING: lvmetad is being updated by another command (pid 1190).
Jun 26 14:21:42 hayes-01 lvm: WARNING: Not updating lvmetad because cache update failed.
Jun 26 14:21:42 hayes-01 kernel: iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
Jun 26 14:21:42 hayes-01 lvm: Command failed with status code 5.

Comment 15 David Teigland 2019-07-02 14:44:11 UTC

We should consider the fix incomplete unless/until we find the reason for those lvmetad update failures.  There is one known cause for those messages, which is that scanning all the disks legitimately take more than 10 seconds (and there is a workaround for that unusual situation).  But, that doesn't seem likely here.  We need to collect vvvv debugging from all the pvscan commands run by lvm2-pvscan services.  (It might be as simple as setting log/file and log/level in lvm.conf.)

Comment 16 Corey Marthaler 2019-07-02 20:02:09 UTC

Moving this back to assigned based on comment #15. We can target this for rhel7.8 once we have the needed additional debugging.

Comment 18 David Teigland 2019-07-10 14:29:01 UTC

If we want one bz per fix, then this bz is fixed with the patch mentioned above, and comment 14 should be a new bz.  If we want this bz to be for all fixes related to this issue, then we can keep this open for further debugging.

Comment 19 David Teigland 2019-07-31 17:13:05 UTC

Let's call this bz fixed by the commit we know improves it, and open a new bug to collect debugging from any similar problems.