Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
This is copying fedora bug 1672062.
The first pvscan[a] is for the nvme dev, and the second pvscan[b] is for the md0 dev.
Both pvscan's believe that they are doing lvmetad initialization (this is a bug we figured out for a different bz today.)
pvscan[a] gets the list of devices which does not include md0 because md0 hasn't appeared yet.
pvscan[a] begins the device scan based on the list of devs without md0,
pvscan[a] is delayed a bit (looks like udev db delays), but this delay isn't essential to the bug.
pvscan[a] is paused at this point while pvscan[b] runs...
pvscan[b] gets the list of devices including md0.
pvscan[b] does the device scan including both PVs and VGs.
pvscan[b] begins lvmetad initialization including md0 and finishes the update, so lvmetad correctly holds both vgs.
pvscan[b] pauses here (before the activation phase) and pvscan[a] runs...
pvscan[a] does the lvmetad initialization *without* md0 which did not exist when it scanned devs.
lvmetad has now been updated from pvscan[a] without md0!
pvscan[b] now runs the activation phase based on the two VGs it saw above.
pvscan[b] asks lvmetad for information about vg01 on md0, but lvmetad no longer knows about vg01 or md0.
pvscan[b] gets an error from lvmetad, and lvmetad logs this as an error:
Feb 12 20:41:13 myhostmydom.local lvmetad[849]: vg_lookup vgid 2FB2CO-8aRt-bY8B-rgQQ-A8YO-vMMa-wWdDTP name vg01 found incomple
Feb 12 20:41:13 myhostmydom.local lvm[1165]: #cache/lvmetad.c:986 Asking lvmetad for VG 2FB2CO-8aRt-bY8B-rgQQ-A8YO-v
Feb 12 20:41:13 myhostmydom.local lvm[1165]: #cache/lvmetad.c:785 Request to lookup VG 2FB2CO-8aRt-bY8B-rgQQ-A8YO-vMMa
pvscan[b] cannot get info for vg01 from lvmetad so it quits without activating vg01. (But pvscan[b] does activate the other v
Also, lvmetad is left in the incorrect state.
As mentioned above, the other bug I sorted out earlier today causes multiple pvscan's to believe that they should do lvmetad i
Analysis of the bug causing multiple pvscans to believe they are doing lvmetad initialization, based on debug logs from Peter.
Because the pvscans run from systemd, are all run concurrently, each of them thinks that it is doing the first initialization
VGs. The code that should serialize them isn't working because of timing issues (a combination of scanning taking a long time
Each pvscan does:
1. lvmetad_token_matches() to check the lvmetad state which detects
if initialization is needed, and is supposed to wait for another
pvscan that's doing initialization.
a. this logs "Sending lvmetad get_global_info"
b. see if another pvscan is updating lvmetad, and if so wait for
that other pvscan to finish before continuing.
(this is what we want to happen, but it's not)
c. in your logs, each pvscan comes back seeing lvmetad is uninitialized
and is not being updated, so each one believes it needs to do
initialization
2. lvmetad_pvscan_all_devs(cmd, 0) to scan devs and populate lvmetad.
a. this starts by scanning all devs (with the new parallel/aio)
b. then it calls _token_update() to tell lvmetad it is updating state
c. this logs "Sending lvmetad token_update update in progress"
d. any new pvscan that starts *now* will wait for this one to finish,
but any other pvscan already running will be doing these same
initialization steps (sadly they are already running in this case)
e. it sends lvmetad info about each PV that it saw from 2.a
f. it calls _token_update() to tell lvmetad that it's done populating
the cache, and sets an actual token.
The timing of things is defeating the serialization logic:
- Step 2a is taking over 10 seconds.
- The scanning in 2a is taking place *before* pvscan sets
the "update in progress" token in 2c. Because the scan takes so
long, it gives all the pvscans a chance to start and think
that init is required. Before the new aio scan, I believe
that "update in progress" would be set, *then* each device
would be read serially, and as each was read, the info for that
PV was sent to lvmetad.
My first idea for fixing this is to set the "update in progress" token
*before* scanning all the devs.
The timing of those two lvmetad serialization steps from your logs:
Feb 12 13:55:28.497284 fed.virt lvm[950]: #cache/lvmetad.c:257 Sending lvmetad get_global_info
Feb 12 13:55:28.714728 fed.virt lvm[972]: #cache/lvmetad.c:257 Sending lvmetad get_global_info
Feb 12 13:55:29.077168 fed.virt lvm[979]: #cache/lvmetad.c:257 Sending lvmetad get_global_info
Feb 12 13:55:29.548005 fed.virt lvm[991]: #cache/lvmetad.c:257 Sending lvmetad get_global_info
Feb 12 13:55:29.210852 fed.virt lvm[978]: #cache/lvmetad.c:257 Sending lvmetad get_global_info
(all pvscans now running, none have set "update in progress", so all think
they need to do initialization)
10+ seconds later, all are done scanning devices, and they all begin to
populate lvmetad in steps 2c - 2f above.
Feb 12 13:55:40.698830 fed.virt lvm[950]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress
Feb 12 13:55:40.784047 fed.virt lvm[972]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress
Feb 12 13:55:46.955881 fed.virt lvm[979]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress
Feb 12 13:55:47.952818 fed.virt lvm[991]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress
Feb 12 13:55:47.948535 fed.virt lvm[978]: #cache/lvmetad.c:559 Sending lvmetad token_update update in progress
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Comment 2Jonathan Earl Brassow
2019-04-04 20:46:34 UTC
a duplicate of bug 1672336 ?
If so, please close as such. We will use that one, since it has user comments.