Bug 1812802
Summary: | lvchange --refresh succeeds with various warnings during stress test | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | David Vaanunu <dvaanunu> | ||||
Component: | lvm2 | Assignee: | David Teigland <teigland> | ||||
lvm2 sub component: | LVM Metadata / lvmetad | QA Contact: | cluster-qe <cluster-qe> | ||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||
Severity: | unspecified | ||||||
Priority: | high | CC: | agk, heinzm, jbrassow, msnitzer, nsoffer, pasik, prajnoha, rhodain, teigland, zkabelac | ||||
Version: | 8.2 | ||||||
Target Milestone: | rc | ||||||
Target Release: | 8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-05-26 17:40:32 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
David Vaanunu
2020-03-12 08:33:12 UTC
Seems that some warnings are eliminated upstream: commit 957904933b2b0f94eddc6204ca2fa61e2df741a0 (HEAD -> master, origin/master, origin/HEAD) Author: David Teigland <teigland> Date: Thu Mar 12 10:18:51 2020 -0500 reduce device path error messsages When /dev entries or sysfs entries are changing due to concurrent lvm commands, it can cause warning/error messages about missing paths. David, what about: WARNING: ignoring metadata seqno 1566 on /dev/mapper/x for seqno 1567 on /dev/mapper/y for VG vg-name. WARNING: Inconsistent metadata found for VG Bug. These messages may be a warning when running lvm locally, but in a cluster setup used by RHV, this is expected condition. RHV can filter out these warnings, but looks like LVM is too verbose by default, and it can help more users if LVM stops writing these warnings to stderr. Removing those warnings (about metadata inconsistency) for all users might be a good idea, but there are probably cases where we would need to replace them with different/better messages. Since it's a real issue and can indicate real problems for non-RHV users, we would need to do some careful research and testing before dropping them, i.e. it's not a trivial case (like the fopen messages) that we can push out quickly. The next time I'm working in this area it's something I'll look at, but for now you'll need to just ignore them. (In reply to David Teigland from comment #2) Thanks David. I filed bug 1814022 for suppressing such warnings until we have a better solution in LVM. I think this case is effectively showing our label scanning is accessing devices which it should never open (clearly filter command only lists 1 device). Older version of lvm2 always accesses (open/read...) only devices that have passed the filter. Current version seems to have parallel code for label scanning which is probably not properly filtering scanned devices through a whole filter chain and opens & reports messages from irrelevant devices causing possibly some other issues as well. I believe the idea was to replace current label scanning with a new io manager engine - which was supposed to address the issue, but it has not yet happened. (In reply to Zdenek Kabelac from comment #4) > I think this case is effectively showing our label scanning is accessing > devices which it should never open > (clearly filter command only lists 1 device). I guess adding -vvvv to the lvm commands will shed more light on this. David, can you rerun the same tests with --verbose flag? python2 extend.py run-regular \ --verbose \ --extend-delay 10 \ --iterations 100 \ --port 8000 \ <manager-address> \ <VG_NAME> \ /dev/mapper/<LUN_ID> \ 2>run-regular.log The warnings should show more info about the warnings. (In reply to Zdenek Kabelac from comment #4) > I think this case is effectively showing our label scanning is accessing > devices which it should never open > (clearly filter command only lists 1 device). That is incorrect AFAIK, I would be interested to see any logging that indicated this. > Older version of lvm2 always accesses (open/read...) only devices that have > passed the filter. > > Current version seems to have parallel code for label scanning which is > probably not properly filtering scanned devices through a whole filter chain > and opens & reports messages from irrelevant devices causing possibly some > other issues as well. I have not seen this, please let me know if you find any evidence of that. > I believe the idea was to replace current label scanning with a new io > manager engine - which was supposed to address the issue, but it has not yet > happened. That should not have any difference here. Using the extend test I have collected a full debug log of an lvchange --refresh command that prints the "Inconsistent" message. As I've said a number of times in the past, this is expected given the way vdsm is using lvm. In the following I will explain that in more detail, using the debug log to show what is happening and why it's not a problem. Steps in the lvm command: dev cache scan -------------- Creates a list of device path names from /dev dir entries (or udev). This is the list of devs available on the system that could potentially be scanned. This step does not open or read the devices. Start of dev cache scan: label/label.c:1001 Finding devices to scan device/dev-cache.c:1173 Creating list of system devices. Example of finding a path: device/dev-cache.c:712 Found dev ... filtering --------- Excludes items from the previous dev cache list based on various filters, e.g. lvm.conf regex filter, zero size dev, dev is a multipath component. The size check requires opening the dev so stat can be run. No devices are read. Some filters which require reading the dev (if not already excluded by regex), like the md component filter and signature filters, are deferred and applied later. Start of filtering: label/label.c:1038 Filtering devices to scan Example of a dev being skipped by lvm.conf filter: filters/filter-regex.c:172 /dev/loop6: Skipping (regex) Example of a filter being deferred to later because it requires reading: filters/filter-signature.c:32 filter signature deferred /dev/sde label scan ---------- The list of devs from dev cache scan, minus the devs already excluded from filtering are opened and read here. After each dev is read, any deferred filters from the previous step are applied (e.g. signatures are checked now that the data is available), and devs are immediately excluded if the dev does not pass the deferred filters. Remaining devs are checked to see if they have an LVM label. If no LVM label, the dev is dropped here. With an LVM label, the VG metadata is read from the dev and a summary of the metadata is saved in lvmcache structures (info and vginfo). (In the example below only one device passed the filters.) Start of label scan: label/label.c:665 Scanning 1 devices for VG info Data has been read from a device: label/label.c:719 Processing data from device /dev/sde 8:64 fd 4 block 0x5646f330e190 Deferred filters are applied to the device: label/label.c:379 Scan filtering /dev/sde An lvm label is found on the device: label/label.c:323 /dev/sde: lvm2 label detected at sector 1 Start of metadata scan (first mda): format_text/text_label.c:497 Scanning /dev/sde mda1 summary. Metadata is found: format_text/format-text.c:1657 Found metadata summary on /dev/sde at 85184512 size 77783 for VG vvgg Summary of metadata is saved in lvmcache: cache/lvmcache.c:1435 lvmcache /dev/sde: now in VG vvgg with 2 mda(s). cache/lvmcache.c:1259 lvmcache /dev/sde: VG vvgg: set VGID to jVWyt6vKaEcboTssdp7OJTvWepW67eC9. cache/lvmcache.c:1634 lvmcache /dev/sde mda1 VG vvgg set seqno 2296 checksum b425d952 mda_size 77783 End of metadata scan (first mda): format_text/text_label.c:525 Scanned /dev/sde mda1 seqno 2296 Start of metadata scan (second mda): format_text/text_label.c:548 Scanning /dev/sde mda2 summary. Metadata is found: format_text/format-text.c:1657 Found metadata summary on /dev/sde at 999617183744 size 77783 for VG vvgg Summary of metadata is saved in lvmcache: (skipped because both mda1 and mda2 have the same metadata seqno 2296) End of metadata scan (second mda): format_text/text_label.c:575 Scanned /dev/sde mda2 seqno 2296 End of label scan: label/label.c:798 Scanned devices: read errors 0 process errors 0 failed 0 Command processing ------------------ The command-specific code begins, usually via process_each_vg, process_each_lv or process_each_pv. Most commands are implemented by processing one or more VGs. Begin processing of a VG: toollib.c:3668 Processing VG vvgg jVWyt6-vKaE-cboT-ssdp-7OJT-vWep-W67eC9 Lock the VG (local flock plus a distributed lock for shared VGs): misc/lvm-flock.c:230 Locking /run/lock/lvm/V_vvgg RB Read the VG metadata -------------------- Now that a lock is held on the VG, read the VG metadata again. It was already read above during label scan during which a summary of it was saved in lvmcache. Another process or host could have changed the VG metadata between the label scan and the flock begin taken, so reread the metadata before using it for the command-specific actions. Begin reading the VG: metadata/metadata.c:4700 Reading VG vvgg jVWyt6vKaEcboTssdp7OJTvWepW67eC9 Rescan metadata from devs: metadata/metadata.c:4747 Rescanning devices for vvgg (Reusing the same code as in label scan above, so most of the same messages are seen, but note the differences in metadata found here vs the metadata details reported by the same messages above during initial label scan. These differences illustrate the purpose of rescanning and how it's working correctly.) Start of label scan: label/label.c:665 Scanning 1 devices for VG info Data has been read from a device: label/label.c:719 Processing data from device /dev/sde 8:64 fd 4 block 0x5646f330e350 An lvm label is found on the device: label/label.c:323 /dev/sde: lvm2 label detected at sector 1 Start of metadata scan (first mda): format_text/text_label.c:497 Scanning /dev/sde mda1 summary. Metadata is found: format_text/format-text.c:1657 Found metadata summary on /dev/sde at 85262336 size 77891 for VG vvgg Summary of metadata is saved in lvmcache: cache/lvmcache.c:1435 lvmcache /dev/sde: now in VG vvgg with 2 mda(s). cache/lvmcache.c:1259 lvmcache /dev/sde: VG vvgg: set VGID to jVWyt6vKaEcboTssdp7OJTvWepW67eC9. cache/lvmcache.c:1634 lvmcache /dev/sde mda1 VG vvgg set seqno 2297 checksum a03ca8a9 mda_size 77891 End of metadata scan (first mda): format_text/text_label.c:525 Scanned /dev/sde mda1 seqno 2297 Start of metadata scan (second mda): format_text/text_label.c:548 Scanning /dev/sde mda2 summary. Metadata is found: format_text/format-text.c:1657 Found metadata summary on /dev/sde at 999617339904 size 78001 for VG vvgg Summary of metadata is saved in lvmcache: cache/lvmcache.c:1655 lvmcache /dev/sde mda2 VG vvgg newer seqno 2298 checksum f56ca155 mda_size 78001 (Here we observe something different. During the initial label scan, both mda1 and mda2 contained the same seqno 2296. The other host then wrote seqno 2297 after our initial label scan, but before we rescanned mda1 where we saw 2297. Then the other host wrote seqno 2298 between us reading mda1 with 2297 and reading mda2 with 2298. Between local commands using flock, and remote commands using distributed locks, the seqno difference between 2296 and 2297 is normal and expected. But the difference with mda1 seqno 2297 and mda2 seqno 2298 is unique to RHV because RHV has concurrent remote commands but no distributed lock.) End of metadata scan (second mda): format_text/text_label.c:575 Scanned /dev/sde mda2 seqno 2298 End of label scan: label/label.c:798 Scanned devices: read errors 0 process errors 0 failed 0 Read the VG metadata for processing (all fields, not just summary): metadata/metadata.c:4853 Reading VG vvgg metadata from /dev/sde 4096 (Reading the VG metadata for processing uses the same data that was used to read the VG summary just above, no new io is performed.) Start of metadata read (first mda): format_text/import.c:155 Reading metadata from /dev/sde at 85262336 size 77891 (+0) Metadata is found: format_text/format-text.c:497 Found metadata on /dev/sde at 85262336 size 77891 for VG vvgg Start of metadata read (second mda): format_text/import.c:155 Reading metadata from /dev/sde at 999617339904 size 78001 (+0) Metadata is found: format_text/format-text.c:497 Found metadata on /dev/sde at 999617339904 size 78001 for VG vvgg Warning is produced because different metadata is seen in mda1 vs mda2: metadata/metadata.c:4885 WARNING: ignoring metadata seqno 2297 on /dev/sde for seqno 2298 on /dev/sde for VG vvgg. metadata/metadata.c:4902 WARNING: Inconsistent metadata found for VG vvgg. The lvm command will continue to be processed using the newest version of the metadata, 2298. Created attachment 1671207 [details] command debug output The full debugging output that was used for comment 7. Ahh - so according to comment 7 - it looks we are 'reading' proper set of devices - but are repeatedly read/access same data during one command as we have lockless first scan. So with current master we have: lvchange[8728] toollib.c:3808 Obtaining the complete list of VGs before processing their LVs lvchange[8728] toollib.c:3668 Processing VG vg XXXXXXXXXXXXXXXXX lvchange[8728] misc/lvm-flock.c:229 Locking /run/lock/lvm/V_vg RB lvchange[8728] misc/lvm-flock.c:113 _do_flock /run/lock/lvm/V_vg:aux WB With stable-2.02 branch we have: #cache/lvmcache.c:1570 Found VG info for 1 VGs #toollib.c:3815 Obtaining the complete list of VGs before processing their LVs #toollib.c:3677 Processing VG vg XXXXXXXXXXXXXXXXXX #locking/locking.c:331 Dropping cache for vg. #misc/lvm-flock.c:201 Locking /run/lock/lvm/V_vg RB #misc/lvm-flock.c:99 _do_flock /run/lock/lvm/V_vg:aux WB So anything cached before is erased - so there cannot happen inconsistencies between discovered & cached metadata in the 1st. scan and 2nd. scan. But this bring back the question of 'locking' mechanism RHEV deploys. Is the case: 1st.host changing in loop lvm2 metadata on a single disk A 2nd.host running another lvm2 command with disk A - and there is no locking between these 2 commands ? As in attachment from comment 8: /dev/sde has seqno 2296 during 'unlocked' scan then after taking the VG lock seqno 2297 is reported from mda1 and seqno 2298 goes from mda2 ?? When lvm2 helds the lock - obtained metadata from mda1 & mda2 must match. This situation should not be possible in the case the locking works properly between commands unless I'm missing something??) So how does the trace from comment 8 is relevant with the original report of this bug ? We need to see -vvvv from 'RHEV' case. Is now RHEV relaying on lockless reading of metadata for read-only lvm2 commands and uses 2 metadata areas per PV ? And we need to see how the locking works between running commands. > 1st.host changing in loop lvm2 metadata on a single disk A > 2nd.host running another lvm2 command with disk A - and there is no locking > between these 2 commands ? Correct, there is no locking. vdsm has its own clustering implementation that ensures only one host is updating the vg and other hosts are only performing safe lvm commands in that vg. > As in attachment from comment 8: > > /dev/sde has seqno 2296 during 'unlocked' scan > then after taking the VG lock seqno 2297 is reported from mda1 and seqno > 2298 goes from mda2 ?? > When lvm2 helds the lock - obtained metadata from mda1 & mda2 must match. > This situation should not be possible in the case the locking works properly > between commands > unless I'm missing something??) You're missing that the commands are running on different hosts, so there is no locking. If the commands were on the same host, then flocks would prevent this, or if they were using lvmlockd that would also prevent it. > So how does the trace from comment 8 is relevant with the original report of > this bug ? This bug was opened about the WARNING message about inconsistent metadata. That is now explained and shown to be harmless. > We need to see -vvvv from 'RHEV' case. I don't think we need any other traces unless there is another problem we need to explain. > Is now RHEV relaying on lockless reading of metadata for read-only lvm2 > commands and uses 2 metadata areas per PV ? I don't know if they have always used 2 mdas. > And we need to see how the locking works between running commands. vdsm coordinates which hosts are running which commands to ensure things are done safely. Based on my memory of their descriptions over the years, the basic idea is that one host does lvcreate/lvextend on behalf of other hosts, and then tells those other hosts to do lvchange -ay or lvchange --refresh. (In reply to David Teigland from comment #10) > > And we need to see how the locking works between running commands. > > vdsm coordinates which hosts are running which commands to ensure things are > done safely. Based on my memory of their descriptions over the years, the > basic idea is that one host does lvcreate/lvextend on behalf of other hosts, > and then tells those other hosts to do lvchange -ay or lvchange --refresh. Correct. lvchange -ay/n is use when a host need to start/finish using a lv. For example when running a vm using this lv, or copying lv contents. lvchange --refresh is done after a host requested the manager host to extend a lv. When the host gets a reply that the extend was completed it refreshes the volume. This volume is active and used by a running vm during this. This used only for qcow2 images. lvchange --refresh is also used when a lv is active when we start to use it, to make sure we don't use lvs which are not synced with current vg metadata. We also run pvs, vgs, lvs, and vgck from other hosts. (In reply to Nir Soffer from comment #11) > (In reply to David Teigland from comment #10) > > > And we need to see how the locking works between running commands. > > > > vdsm coordinates which hosts are running which commands to ensure things are > > done safely. Based on my memory of their descriptions over the years, the > > basic idea is that one host does lvcreate/lvextend on behalf of other hosts, > > and then tells those other hosts to do lvchange -ay or lvchange --refresh. > > Correct. > > lvchange -ay/n is use when a host need to start/finish using a lv. For > example > when running a vm using this lv, or copying lv contents. > > lvchange --refresh is done after a host requested the manager host to extend > a lv. When the host gets a reply that the extend was completed it refreshes > the volume. This volume is active and used by a running vm during this. This > used only for qcow2 images. > > lvchange --refresh is also used when a lv is active when we start to use it, > to make sure we don't use lvs which are not synced with current vg metadata. > > We also run pvs, vgs, lvs, and vgck from other hosts. The main question is - how is it protected that 'vdsm master' is i.e. running 'lvextend' command - and 'vdms client' runs 'lvs' command. ? Is it possible to have command 'lvextend' running in parallel with 'lvs' - is there some 'locking' engine for this on vdsm side ? (In reply to Zdenek Kabelac from comment #12) > (In reply to Nir Soffer from comment #11) > > (In reply to David Teigland from comment #10) > > > > And we need to see how the locking works between running commands. > > > > > > vdsm coordinates which hosts are running which commands to ensure things are > > > done safely. Based on my memory of their descriptions over the years, the > > > basic idea is that one host does lvcreate/lvextend on behalf of other hosts, > > > and then tells those other hosts to do lvchange -ay or lvchange --refresh. > > > > Correct. > > > > lvchange -ay/n is use when a host need to start/finish using a lv. For > > example > > when running a vm using this lv, or copying lv contents. > > > > lvchange --refresh is done after a host requested the manager host to extend > > a lv. When the host gets a reply that the extend was completed it refreshes > > the volume. This volume is active and used by a running vm during this. This > > used only for qcow2 images. > > > > lvchange --refresh is also used when a lv is active when we start to use it, > > to make sure we don't use lvs which are not synced with current vg metadata. > > > > We also run pvs, vgs, lvs, and vgck from other hosts. > > The main question is - how is it protected that 'vdsm master' is i.e. > running 'lvextend' > command - and 'vdms client' runs 'lvs' command. ? The only protection is for having one host doing the read-write operations using sanlock lease. > Is it possible to have command 'lvextend' running in parallel with 'lvs' - > is there some 'locking' engine for this on vdsm side ? Sure, we can have 50 "client" hosts doing lvs, pvs, vgs, lvchange --refresh, lvchange -ay/n, vgck at the same time while the "master" host is running lvcreate, lvextend, lvremove, lvchange, vgchange. as mentioned in comment 1, this has already been fixed by commit https://sourceware.org/git/?p=lvm2.git;a=commit;h=957904933b2b0f94eddc6204ca2fa61e2df741a0 |