Description of problem: When booting fedora 29 hangs 'a start job is running for dev-mapper...' and then goes into emergency mode. 'pvs' shows only the first pv: PV VG Fmt Attr PSize PFree /dev/nvme0n1p2 vg29 lvm2 a-- <446.13g <159.13g After an 'lvm pvscan --cache --activate ay' both pv's are recognized: PV VG Fmt Attr PSize PFree /dev/md0 vg01 lvm2 a-- <7.28t 100.00g /dev/nvme0n1p2 vg29 lvm2 a-- <446.13g <159.13g With 'init 5' the boot process can be continued normally. The problem has occurred since I upgraded to fedora 29 about a month ago. Fedora 27 had no problem. Version-Release number of selected component (if applicable): fedora 29 How reproducible: Just boot / reboot Steps to Reproduce: 1. boot 2. 3. Actual results: a start job is running for dev-mapper-vg01\0x2dlv01.device a start job is running for dev-mapper-vg01\0x2dlv02.device a start job is running for dev-mapper-vg01\0x2dlv08.device a start job is running for dev-mapper-vg01\0x2dlv19.device Expected results: Additional info:
Created attachment 1526375 [details] journalctl -xb
Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/tmp0 locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: Creating vg29-tmp0 Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-tmp0 (253:7). Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-tmp0 (253:7). Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/tmp0. Feb 03 13:42:05 myhostmydom.local lvm[848]: 8 logical volume(s) in volume group "vg29" monitored Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activated 8 logical volumes in volume group vg29 Feb 03 13:42:05 myhostmydom.local lvm[1121]: 8 logical volume(s) in volume group "vg29" now active Feb 03 13:42:05 myhostmydom.local lvmetad[853]: vg_lookup vgid 2FB2CO-8aRt-bY8B-rgQQ-A8YO-vMMa-wWdDTP name vg01 found incomplete mapping uuid none name none Feb 03 13:42:05 myhostmydom.local lvm[918]: Activated 8 logical volumes in volume group vg29 Feb 03 13:42:05 myhostmydom.local lvm[918]: 8 logical volume(s) in volume group "vg29" now active I haven't seen this message before, but it looks like there's something off with the LVM configuration. I'll tentatively reassign to lvm.
Can you let us know what your lvm layout looks like (PVs/VGs/LVs)? I see that two PVs are detected, and the two pvscans for these PVs seem to both be activating vg29: lvm2-pvscan@259:2.service has begun starting up lvm2-pvscan@9:0.service has begun starting up There doesn't seem to be any pvscan related to vg01. Can you let us know which PV holds vg01?
It would be a great help to have the full lvm debugging (-vvvv) from the pvscan commands (it looks like you are using one -v already?) Thanks
Sorry, I didn't notice that you already included the PV/VG layout in the original description, and vg01 is on md0. There have been a number of fixes for detecting md devices recently. (It's md devices that use the old md superblock at the end of the devices. lvm no longer reads the end of every device to check for old md superblocks, so we have been adding other ways to try to know when to read the end of the devices.) Could you inclued the lvm version so we can see if all the fixes are included?
The log section below shows that the md device has been started in the kernel before the pvscan for it is started. With a new enough version of lvm, pvscan should see the md device in the kernel which should cause it to scan the end of devices and detect the PV on md0. (Full -vvvv debug output from the pvscan would be needed to see this.) The following commits are relevant to the described solution: https://sourceware.org/git/?p=lvm2.git;a=commit;h=a01e1fec0fe7c2fa61577c0e636e907cde7279ea https://sourceware.org/git/?p=lvm2.git;a=commit;h=de2863739f2ea17d89d0e442379109f967b5919d https://sourceware.org/git/?p=lvm2.git;a=commit;h=0e42ebd6d4012d210084a9ccf8d76f853726de3c Feb 03 13:42:05 myhostmydom.local kernel: async_tx: api initialized (async) Feb 03 13:42:05 myhostmydom.local kernel: md/raid:md0: device sdb1 operational as raid disk 1 Feb 03 13:42:05 myhostmydom.local kernel: md/raid:md0: device sde1 operational as raid disk 4 Feb 03 13:42:05 myhostmydom.local kernel: md/raid:md0: device sdd1 operational as raid disk 3 Feb 03 13:42:05 myhostmydom.local kernel: md/raid:md0: device sdc1 operational as raid disk 2 Feb 03 13:42:05 myhostmydom.local kernel: md/raid:md0: device sda1 operational as raid disk 0 Feb 03 13:42:05 myhostmydom.local kernel: md/raid:md0: raid level 5 active with 5 out of 5 devices, algorithm 2 Feb 03 13:42:05 myhostmydom.local kernel: md0: detected capacity change from 0 to 8001054310400 Feb 03 13:42:05 myhostmydom.local systemd[1]: Stopped Timer to wait for more drives before activating degraded array md0.. -- Subject: Unit mdadm-last-resort has finished shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit mdadm-last-resort has finished shutting down. Feb 03 13:42:05 myhostmydom.local systemd[1]: Starting LVM2 PV scan on device 9:0... -- Subject: Unit lvm2-pvscan@9:0.service has begun start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit lvm2-pvscan@9:0.service has begun starting up.
As a side note, lvm2-2.02.183.fc29 is the package with all the above patches.
Note: if the lvm2-pvscan@9:0.service was instantiated, it means the 9:0 (the md0 device) was ready and readable, because it must have been scanned by blkid and blkid must have returned "ID_FS_TYPE=LVM2_member" based on the scan. So the MD device underneath was OK. Below are the logs separated for each pvscan --cache -aay, I'm not 100% sure now which one is pvscan --cache -aay 9:0 and which one is pvscan --cache -aay 259:2 (the pid 918 looks like belonging to pvscan ... 259:2). The issue I see here is that vg29 is activated for both pvscans, even though the PVs belong to separate VGs! That's a bug. $grep 'lvm\[918\]' journalctl_xb.txt Feb 03 13:42:04 myhostmydom.local lvm[918]: Scanning all devices to initialize lvmetad. Feb 03 13:42:05 myhostmydom.local lvm[918]: Scanning metadata from 1 devices to update lvmetad. Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-usr (253:2). Feb 03 13:42:05 myhostmydom.local lvm[918]: Suppressed vg29-usr (253:2) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[918]: Suspending vg29-usr (253:2) with device flush Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-usr (253:2). Feb 03 13:42:05 myhostmydom.local lvm[918]: Suppressed vg29-usr (253:2) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-usr (253:2). Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-root (253:0). Feb 03 13:42:05 myhostmydom.local lvm[918]: Suppressed vg29-root (253:0) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[918]: Suspending vg29-root (253:0) with device flush Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-root (253:0). Feb 03 13:42:05 myhostmydom.local lvm[918]: Suppressed vg29-root (253:0) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-root (253:0). Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-swap (253:1). Feb 03 13:42:05 myhostmydom.local lvm[918]: Suppressed vg29-swap (253:1) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[918]: Suspending vg29-swap (253:1) with device flush Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-swap (253:1). Feb 03 13:42:05 myhostmydom.local lvm[918]: Suppressed vg29-swap (253:1) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-swap (253:1). Feb 03 13:42:05 myhostmydom.local lvm[918]: 3 logical volume(s) in volume group "vg29" already active Feb 03 13:42:05 myhostmydom.local lvm[918]: 3 existing logical volume(s) in volume group "vg29" monitored Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/usr locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/usr. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/root locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/root. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/var locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/var. Feb 03 13:42:05 myhostmydom.local lvm[918]: Creating vg29-var Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-var (253:3). Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-var (253:3). Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/swap locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/swap. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/data0 locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/data0. Feb 03 13:42:05 myhostmydom.local lvm[918]: Creating vg29-data0 Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-data0 (253:4). Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-data0 (253:4). Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/tmp2 locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/tmp2. Feb 03 13:42:05 myhostmydom.local lvm[918]: Creating vg29-tmp2 Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-tmp2 (253:5). Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-tmp2 (253:5). Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/tmp1 locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/tmp1. Feb 03 13:42:05 myhostmydom.local lvm[918]: Creating vg29-tmp1 Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-tmp1 (253:6). Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-tmp1 (253:6). Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[918]: Activating logical volume vg29/tmp0 locally. Feb 03 13:42:05 myhostmydom.local lvm[918]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/tmp0. Feb 03 13:42:05 myhostmydom.local lvm[918]: Creating vg29-tmp0 Feb 03 13:42:05 myhostmydom.local lvm[918]: Loading table for vg29-tmp0 (253:7). Feb 03 13:42:05 myhostmydom.local lvm[918]: Resuming vg29-tmp0 (253:7). Feb 03 13:42:05 myhostmydom.local lvm[918]: Activated 8 logical volumes in volume group vg29 Feb 03 13:42:05 myhostmydom.local lvm[918]: 8 logical volume(s) in volume group "vg29" now active $ grep "lvm\[1121\]" journalctl_xb.txt Feb 03 13:42:05 myhostmydom.local lvm[1121]: Scanning all devices to initialize lvmetad. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Scanning metadata from 2 devices to update lvmetad. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Loading table for vg29-usr (253:2). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suppressed vg29-usr (253:2) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suspending vg29-usr (253:2) with device flush Feb 03 13:42:05 myhostmydom.local lvm[1121]: Loading table for vg29-usr (253:2). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suppressed vg29-usr (253:2) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Resuming vg29-usr (253:2). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Loading table for vg29-root (253:0). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suppressed vg29-root (253:0) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suspending vg29-root (253:0) with device flush Feb 03 13:42:05 myhostmydom.local lvm[1121]: Loading table for vg29-root (253:0). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suppressed vg29-root (253:0) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Resuming vg29-root (253:0). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Loading table for vg29-swap (253:1). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suppressed vg29-swap (253:1) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suspending vg29-swap (253:1) with device flush Feb 03 13:42:05 myhostmydom.local lvm[1121]: Loading table for vg29-swap (253:1). Feb 03 13:42:05 myhostmydom.local lvm[1121]: Suppressed vg29-swap (253:1) identical table reload. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Resuming vg29-swap (253:1). Feb 03 13:42:05 myhostmydom.local lvm[1121]: 4 logical volume(s) in volume group "vg29" already active Feb 03 13:42:05 myhostmydom.local lvm[1121]: 4 existing logical volume(s) in volume group "vg29" monitored Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/usr locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/usr. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/root locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/root. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/var locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/var. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/swap locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/swap. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/data0 locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/data0. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/tmp2 locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/tmp2. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/tmp1 locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/tmp1. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activating logical volume vg29/tmp0 locally. Feb 03 13:42:05 myhostmydom.local lvm[1121]: activation/volume_list configuration setting not defined: Checking only host tags for vg29/tmp0. Feb 03 13:42:05 myhostmydom.local lvm[1121]: Activated 8 logical volumes in volume group vg29 Feb 03 13:42:05 myhostmydom.local lvm[1121]: 8 logical volume(s) in volume group "vg29" now active It looks like the 9:0 scan inside pvscan --cache -aay 9:0 is ignored for some reason, we'd see more from -vvvv output as David suggests (...just use "pvscan -vvvv --cache --activate ay %i" in /usr/lib/systemd/system/lvm2-pvscan@.service). Then please rerun the system with "debug" added on kernel command line and if the problem appears, run "lvmdump -u -l -s" to collect detailed lvm debug info and attach it here please. Thanks.
Created attachment 1534220 [details] Journalctl with pvscan -vvvv and kernel debug option
Created attachment 1534221 [details] lvm pvscan --cache --activate ay -vvvv from the emergency console
Created attachment 1534222 [details] lvmdump -u -l -s from the emergency console
Thanks to all for your support ;-) The rpm packages are in the latest fedora release: rpm -qa | grep ^lvm2 lvm2-2.02.183-1.fc29.x86_64 lvm2-libs-2.02.183-1.fc29.x86_64 So i've changed systemd lvm2-pvscan@.service --verbose to -vvvv, booted with debug, created a "lvmdump -u -l -s", after that "lvm pvscan --cache --activate ay -vvvv" from the emergency console. The files are attached to this case. I hope this helps you.
Thank you very much, that identifies the bug. (It's not related to properly identifying the md device as I mentioned above.) Description of the bug: The first pvscan[a] is for the nvme dev, and the second pvscan[b] is for the md0 dev. Both pvscan's believe that they are doing lvmetad initialization (this is a bug we figured out for a different bz today.) pvscan[a] gets the list of devices which does not include md0 because md0 hasn't appeared yet. pvscan[a] begins the device scan based on the list of devs without md0, pvscan[a] is delayed a bit (looks like udev db delays), but this delay isn't essential to the bug. pvscan[a] is paused at this point while pvscan[b] runs... pvscan[b] gets the list of devices including md0. pvscan[b] does the device scan including both PVs and VGs. pvscan[b] begins lvmetad initialization including md0 and finishes the update, so lvmetad correctly holds both vgs. pvscan[b] pauses here (before the activation phase) and pvscan[a] runs... pvscan[a] does the lvmetad initialization *without* md0 which did not exist when it scanned devs. lvmetad has now been updated from pvscan[a] without md0! pvscan[b] now runs the activation phase based on the two VGs it saw above. pvscan[b] asks lvmetad for information about vg01 on md0, but lvmetad no longer knows about vg01 or md0. pvscan[b] gets an error from lvmetad, and lvmetad logs this as an error: Feb 12 20:41:13 myhostmydom.local lvmetad[849]: vg_lookup vgid 2FB2CO-8aRt-bY8B-rgQQ-A8YO-vMMa-wWdDTP name vg01 found incomplete mapping uuid none name none Feb 12 20:41:13 myhostmydom.local lvm[1165]: #cache/lvmetad.c:986 Asking lvmetad for VG 2FB2CO-8aRt-bY8B-rgQQ-A8YO-vMMa-wWdDTP vg01 Feb 12 20:41:13 myhostmydom.local lvm[1165]: #cache/lvmetad.c:785 Request to lookup VG 2FB2CO-8aRt-bY8B-rgQQ-A8YO-vMMa-wWdDTP in lvmetad did not find any matching obj pvscan[b] cannot get info for vg01 from lvmetad so it quits without activating vg01. (But pvscan[b] does activate the other vg on the nvme pv, because both pvscans mistakenly believe that they should be scanning and activating all VGs to initialize the system.) Also, lvmetad is left in the incorrect state. As mentioned above, the other bug I sorted out earlier today causes multiple pvscan's to believe that they should do lvmetad initialization. Fixing that bug will probably solve this bug, although it's possible that the additional problems seen here could require additional fixes.
Analysis of the bug causing multiple pvscans to believe they are doing lvmetad initialization, based on debug logs from Peter. This emphasized the 10+ second delays in reading the devices, but that delay is not a necessary factor for hitting this bug. Because the pvscans run from systemd, are all run concurrently, each of them thinks that it is doing the first initialization case, so each one is scanning all devs and activating all VGs. The code that should serialize them isn't working because of timing issues (a combination of scanning taking a long time, and the new aio scanning changing the sequence/timing of things.) Each pvscan does: 1. lvmetad_token_matches() to check the lvmetad state which detects if initialization is needed, and is supposed to wait for another pvscan that's doing initialization. a. this logs "Sending lvmetad get_global_info" b. see if another pvscan is updating lvmetad, and if so wait for that other pvscan to finish before continuing. (this is what we want to happen, but it's not) c. in your logs, each pvscan comes back seeing lvmetad is uninitialized and is not being updated, so each one believes it needs to do initialization 2. lvmetad_pvscan_all_devs(cmd, 0) to scan devs and populate lvmetad. a. this starts by scanning all devs (with the new parallel/aio) b. then it calls _token_update() to tell lvmetad it is updating state c. this logs "Sending lvmetad token_update update in progress" d. any new pvscan that starts *now* will wait for this one to finish, but any other pvscan already running will be doing these same initialization steps (sadly they are already running in this case) e. it sends lvmetad info about each PV that it saw from 2.a f. it calls _token_update() to tell lvmetad that it's done populating the cache, and sets an actual token. The timing of things is defeating the serialization logic: - Step 2a is taking over 10 seconds. - The scanning in 2a is taking place *before* pvscan sets the "update in progress" token in 2c. Because the scan takes so long, it gives all the pvscans a chance to start and think that init is required. Before the new aio scan, I believe that "update in progress" would be set, *then* each device would be read serially, and as each was read, the info for that PV was sent to lvmetad. My first idea for fixing this is to set the "update in progress" token *before* scanning all the devs. The timing of those two lvmetad serialization steps from your logs: Feb 12 13:55:28.497284 fed.virt lvm[950]: #cache/lvmetad.c:257 Sending lvmetad get_global_info Feb 12 13:55:28.714728 fed.virt lvm[972]: #cache/lvmetad.c:257 Sending lvmetad get_global_info Feb 12 13:55:29.077168 fed.virt lvm[979]: #cache/lvmetad.c:257 Sending lvmetad get_global_info Feb 12 13:55:29.548005 fed.virt lvm[991]: #cache/lvmetad.c:257 Sending lvmetad get_global_info Feb 12 13:55:29.210852 fed.virt lvm[978]: #cache/lvmetad.c:257 Sending lvmetad get_global_info (all pvscans now running, none have set "update in progress", so all think they need to do initialization) 10+ seconds later, all are done scanning devices, and they all begin to populate lvmetad in steps 2c - 2f above. Feb 12 13:55:40.698830 fed.virt lvm[950]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress Feb 12 13:55:40.784047 fed.virt lvm[972]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress Feb 12 13:55:46.955881 fed.virt lvm[979]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress Feb 12 13:55:47.952818 fed.virt lvm[991]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress Feb 12 13:55:47.948535 fed.virt lvm[978]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress
This is the main fix that should solve the problem: https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=778d3c881d980dc576ab6638e7e0f7e3e59ce4e5 I've only tested this manually, I've not come up with an automated method of testing many parallel pvscan commands that run against a new/uninitialized instance of lvmetad. We may also want to increase the default lvmetad_update_wait_time from 10 seconds to something larger.
That first commit in comment 15 basically restores the old timing of initializing lvmetad from pvscan, and it should largely fix the problem. But, the timing change revealed a legitimate race in initializing lvmetad that could still appear and cause the same problems. I've written a patch that should fix that also, but I'm not sure if it's worth trying to fix at this point in rhel7. Both fixes are here: https://sourceware.org/git/?p=lvm2.git;a=shortlog;h=refs/heads/dev-dct-lvmetad-init-2
This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.