Bug 1693075
Summary: | After live merge, top volume still has INTERNAL volume type due to unable to read lvm metadata properly | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | |
Status: | CLOSED DUPLICATE | QA Contact: | Avihai <aefrat> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.2.8 | CC: | aefrat, dfediuck, lleistne, lsurette, mkalinin, mwest, nsoffer, pelauter, srevivo, teigland, tnisan, ycui | |
Target Milestone: | ovirt-4.4.1 | Keywords: | Triaged, ZStream | |
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
|
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1715026 (view as bug list) | Environment: | ||
Last Closed: | 2020-03-21 15:37:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1703275, 1715026 |
Description
Germano Veit Michel
2019-03-27 06:39:48 UTC
This also seem to have caused the metadata not being written at all for a newly created volume, I'll update with more details tomorrow when I finish that one. (In reply to Germano Veit Michel from comment #0) > Differently to BZ1553133, here the VG is not getting corrupted, it just > fails to properly read the metadata for short periods of time. Not entirely sure actually. But here the seqno of the metadata is sequential, there are no jumps. Still there are not that many in the first 128MB as each is 1.6MB. Nir, this seems like the same root cause as in bug 1553133 although the outcome is a bit different (no corruption), the solution should fix both since we change the locking mechanism, right? This seems like like the temporary random failures reported by qe. David, is this a result of lvm on another host, trying to "fix" the metadata during lvchange --refresh or other command that should not change the metadata? (In reply to Nir Soffer from comment #6) > This seems like like the temporary random failures reported by qe. > > David, is this a result of lvm on another host, trying to "fix" the metadata > during lvchange --refresh or other command that should not change the > metadata? While one host is writing VG metadata (as a normal change, not fixing anything), if another host tries to read the VG (e.g. lvs) at the same time, the lvs may see bad VG metadata and report errors. The locking_type 4 change from bug 1553133 should prevent the 'lvs' from "fixing" any transient problem it sees. It shouldn't be difficult to test this with lvm directly. On one host run a loop that does repeated 'lvcreate' in the VG, and on the second host run a loop that does repeated 'lvs' in the same VG. I expect you may see the 'lvs' report errors periodically, especially when the VG metadata grows large. (In reply to David Teigland from comment #7) > (In reply to Nir Soffer from comment #6) In this case we see an error in lvs command on the SPM - the only host adding, removing or extending lvs. Is it possible that we had: 1. host A: create/extend/delete/change tags -> writing new metadata 2. host B: run lvs/vgs/pvs/lvchange --refresh -> try fix inconsistent metadata 3. host A: run lvs -> see inconsistent metadata from bad fix by host B, fix it again Step 1 and 2 may happen in parallel. > Is it possible that we had:
>
> 1. host A: create/extend/delete/change tags -> writing new metadata
> 2. host B: run lvs/vgs/pvs/lvchange --refresh -> try fix inconsistent
> metadata
> 3. host A: run lvs -> see inconsistent metadata from bad fix by host B, fix
> it again
>
> Step 1 and 2 may happen in parallel.
If host B was able to modify the metadata, then host A could report errors and try to fix things again. If the metadata buffer is large enough, you might be able to see evidence of those changes from previous copies of the metadata in the buffer.
David, do we have a way to detect when lvm try to fix inconsistent metadata? can we enable specific debug logs that will reveal when this happens? We already collect anything lvm write to stderr. Currently this goes to debug log that is disable by default but we can change it to log warnings if this can help to diagnose such issues. Is the code not using the locking_type 4 change from bug 1553133 which should prevent the unwanted repair? If not, then you would probably see this: WARNING: Inconsistent metadata found for VG %s - updating to use version %u" (In reply to David Teigland from comment #11) > Is the code not using the locking_type 4 change from bug 1553133 which > should prevent the unwanted repair? Not yet, integrating locking_type 4 is not that easy. > If not, then you would probably see this: > WARNING: Inconsistent metadata found for VG %s - updating to use version %u" Sounds good, I'll change the logging to log lvm stderr as warning. I think the attached patch is the best we can do for now. With the improved log, we will warnings when lvm try to fix metadata. I think should deliver the improved log in 4.3.5, and reopen this bug when we have more data from users. Avihai, I think your team filed one or more bugs about the same issue, which may be a duplicate of this bug. Possible duplicate: bug 1637405 (In reply to Nir Soffer from comment #13) > I think the attached patch is the best we can do for now. With the improved > log, we will warnings when lvm try to fix metadata. > > I think should deliver the improved log in 4.3.5, and reopen this bug when > we have more data from users. > > Avihai, I think your team filed one or more bugs about the same issue, which > may be a duplicate of this bug. In the bug you mentioned deleteVolume VDSM error looks different so I'm not sure about its similarity VDSM: 2019-05-04 21:55:30,237+0300 ERROR (tasks/7) [storage.TaskManager.Task] (Task='3f8954bd-093f-4dff-a69e-266965267a33') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1967, in deleteVolume vol.delete(postZero=postZero, force=force, discard=discard) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 588, in delete self.validateDelete() However Bug 1666795 which were resolved at 4.3.2 looks with a similar errors in VDSM log(VolumeDoesNotExist) , maybe this is more related? (In reply to Avihai from comment #16) The error in vdsm log does not matter. It is a result of the actual issue, not the root cause. When we run the automation tests again with the fix, we may get warnings from LVM showing the root cause. The expected result is: 1. Random flow fail on the SPM, trying to access LVM with a message about bad LVM metadata, like the one from comment 0: WARN (tasks/2) [storage.LVM] lvm lvs failed: 5 [] [' Metadata on /dev/mapper/36000d31000e8a600000000000000000c at 27487763185152 has wrong VG name "= [', '"pv1", 167317', ']', '}', 'segment9 {',... 2. At the same time, on another host, WARN log message with: WARN [storage.LVM] Command [...] succeeded with warnings: [' WARNING: Inconsistent metadata found for VG xxx-yyy-zzz - updating to use version yy'] In this case we know that LVM command running on that host tried to fix the VG metadata, which may corrupt the VG metadata - bug 1553133. (In reply to Nir Soffer from comment #18) > (In reply to Avihai from comment #16) > > The error in vdsm log does not matter. It is a result of the actual issue, > not > the root cause. > > When we run the automation tests again with the fix, we may get warnings > from LVM showing the root cause. > > The expected result is: > > 1. Random flow fail on the SPM, trying to access LVM with a message about > bad LVM metadata, like the one from comment 0: > > WARN (tasks/2) [storage.LVM] lvm lvs failed: 5 [] [' Metadata on > /dev/mapper/36000d31000e8a600000000000000000c at 27487763185152 has > wrong VG name "= [', '"pv1", 167317', ']', '}', 'segment9 {',... > > 2. At the same time, on another host, WARN log message with: > > WARN [storage.LVM] Command [...] succeeded with warnings: [' WARNING: > Inconsistent metadata found for VG xxx-yyy-zzz - updating to use version > yy'] > > In this case we know that LVM command running on that host tried to fix > the VG metadata, which may corrupt the VG metadata - bug 1553133. Moved the discussion to clone 4.3 bug 1693075, please update all info there. Also see my comment about retargeting this bug to 4.3.6 in that bug/ David, with see these LVM warnings in ovirt system tests: 2019-07-04 12:11:45,478-0400 WARN (jsonrpc/5) [storage.LVM] Command ['/sbin/lvm', 'lvs', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600140525911f82f9ce4e34b70c73bc2$|^/dev/mapper/3600140541592ee797ff458aba08a28a4$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }', '--noheadings', '--units', 'b', '--nosuffix', '--separator', '|', '--ignoreskippedcluster', '-o', 'uuid,name,vg_name,attr,size,seg_start_pe,devices,tags', '05a08319-1d06-41a1-a7ad-fccd5fa3768e'] succeeded with warnings: [' WARNING: Cannot find matching striped segment for 05a08319-1d06-41a1-a7ad-fccd5fa3768e/97fec0b7-b94c-46d1-a47d-494b4f0ccdcc.'] (lvm:391) This warning repeated 35 times on a host which is not the SPM. What is the meaning of this warning? That think it means that this host has an active LV that has been modified (e.g. extended) from another host, but has not been refreshed on this host. So, lvm prints this warning when it seems the metadata for the LV does not match the LV active in device mapper. (In reply to David Teigland from comment #21) > That think it means that this host has an active LV that has been modified > (e.g. extended) from another host, but has not been refreshed on this host. > So, lvm prints this warning when it seems the metadata for the LV does not > match the LV active in device mapper. This may happen during automatic extend flow. We see these warnings now in our system tests (49 warnings per run): 2019-07-12 15:18:34,034-0400 WARN (jsonrpc/7) [storage.LVM] Command ['/sbin/lvm', 'lvchange', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600140510698bd4b8b84143b0f7dbbc7$|^/dev/mapper/360014053544df693e4b4f63a6ba4a428$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }', '--autobackup', 'n', '-a', 'y', '--deltag', u'IU_651741de-1549-4659-88bb-37d117044df9', '--addtag', u'IU__remove_me_651741de-1549-4659-88bb-37d117044df9', '91047f09-f2d7-4453-8e11-483fed09556d/fdd994be-e45e-43b2-a614-9b9626cb0aef'] succeeded with warnings: [' Scan of VG 91047f09-f2d7-4453-8e11-483fed09556d from /dev/mapper/360014053544df693e4b4f63a6ba4a428 found metadata seqno 72 vs previous 71.', ' Scan of VG 91047f09-f2d7-4453-8e11-483fed09556d from /dev/mapper/360014053544df693e4b4f63a6ba4a428 found metadata seqno 72 vs previous 71.'] (lvm:391) This should run only on the SPM host, so I wonder how lvm can see inconsistent metadata if this is the only host changing the VG/LVs? Does lvm try to fix the metadata in this flow? This is the complete flow when this error happens: 2019-07-12 15:11:56,157-0400 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'Image.delete' in bridge with {u'force': u'false', u'imageID': u'cba6f922-1def-40b7-b197-c65a1 088847d', u'postZero': u'false', u'storagepoolID': u'5eddd851-93f5-482d-b0c7-7245d98e66ea', u'storagedomainID': u'91047f09-f2d7-4453-8e11-483fed09556d', u'discard': True} (_ _init__:329) 2019-07-12 15:11:56,174-0400 INFO (jsonrpc/3) [vdsm.api] START deleteImage(sdUUID=u'91047f09-f2d7-4453-8e11-483fed09556d', spUUID=u'5eddd851-93f5-482d-b0c7-7245d98e66ea', i mgUUID=u'cba6f922-1def-40b7-b197-c65a1088847d', postZero=u'false', force=u'false', discard=True) from=::ffff:192.168.201.4,59046, flow_id=7240e365, task_id=457a38dc-b1a5-473 5-816f-65e6840ada4a (api:48) 2019-07-12 15:11:56,543-0400 WARN (jsonrpc/3) [storage.LVM] Command ['/sbin/lvm', 'lvchange', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600140510698bd4b8b84143b0f7dbbc7$|^/dev/mapper/360014053544df693e4b4f63a6ba4a428$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }', '--autobackup', 'n', '-a', 'y', '--deltag', u'IU_cba6f922-1def-40b7-b197-c65a1088847d', '--addtag', u'IU__remove_me_cba6f922-1def-40b7-b197-c65a1088847d', '91047f09-f2d7-4453-8e11-483fed09556d/034748fe-3c34-4d05-88d4-00a2a3856513'] succeeded with warnings: [' Scan of VG 91047f09-f2d7-4453-8e11-483fed09556d from /dev/mapper/360014053544df693e4b4f63a6ba4a428 found metadata seqno 31 vs previous 30.', ' Scan of VG 91047f09-f2d7-4453-8e11-483fed09556d from /dev/mapper/360014053544df693e4b4f63a6ba4a428 found metadata seqno 31 vs previous 30.'] (lvm:391) 2019-07-12 15:11:56,567-0400 INFO (jsonrpc/3) [vdsm.api] FINISH deleteImage return=None from=::ffff:192.168.201.4,59046, flow_id=7240e365, task_id=457a38dc-b1a5-4735-816f-65e6840ada4a (api:54) This flow runs only on the SPM. Tested with latest CentOS 7.6. On the other host in this setup, we see these LVM commands related to this VG before this warning on the SPM host: 2019-07-12 15:02:30,610-0400 INFO (jsonrpc/4) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['metadata'] (lvm:1448) 2019-07-12 15:02:30,904-0400 INFO (jsonrpc/4) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['ids'] (lvm:1448) 2019-07-12 15:02:31,092-0400 INFO (jsonrpc/4) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['leases'] (lvm:1448) 2019-07-12 15:02:31,452-0400 INFO (jsonrpc/4) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['inbox', 'outbox', 'xleases', 'master'] (lvm:1448) 2019-07-12 15:02:39,893-0400 INFO (jsonrpc/2) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['c8ab8b74-dfe7-4306-9924-bc970d4a1074'] (lvm:1448) 2019-07-12 15:02:40,718-0400 INFO (jsonrpc/4) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['c8ab8b74-dfe7-4306-9924-bc970d4a1074'] (lvm:1456) 2019-07-12 15:02:41,415-0400 INFO (jsonrpc/0) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['122328ea-bf51-43a6-8b10-fb0800aaa832'] (lvm:1448) 2019-07-12 15:02:42,159-0400 INFO (jsonrpc/3) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['122328ea-bf51-43a6-8b10-fb0800aaa832'] (lvm:1456) 2019-07-12 15:04:13,009-0400 INFO (jsonrpc/0) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['c8ab8b74-dfe7-4306-9924-bc970d4a1074'] (lvm:1448) 2019-07-12 15:04:14,238-0400 INFO (jsonrpc/5) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['c8ab8b74-dfe7-4306-9924-bc970d4a1074'] (lvm:1456) 2019-07-12 15:04:38,777-0400 INFO (jsonrpc/5) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['122328ea-bf51-43a6-8b10-fb0800aaa832'] (lvm:1448) 2019-07-12 15:04:40,551-0400 INFO (jsonrpc/4) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['122328ea-bf51-43a6-8b10-fb0800aaa832'] (lvm:1456) 2019-07-12 15:08:37,610-0400 INFO (jsonrpc/4) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['b798074f-c891-459f-aafe-fb54ce6d0134', 'c8ab8b74-dfe7-4306-9924-bc970d4a1074'] (lvm:1448) 2019-07-12 15:08:38,313-0400 INFO (jsonrpc/0) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['b798074f-c891-459f-aafe-fb54ce6d0134'] (lvm:1456) 2019-07-12 15:09:40,533-0400 INFO (vm/762f2d6f) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['122328ea-bf51-43a6-8b10-fb0800aaa832'] (lvm:1448) 2019-07-12 15:10:34,554-0400 INFO (mailbox-hsm/2) [storage.LVM] Refreshing LVs (vg=91047f09-f2d7-4453-8e11-483fed09556d, lvs=('122328ea-bf51-43a6-8b10-fb0800aaa832',)) (lvm:1472) 2019-07-12 15:11:20,133-0400 INFO (migsrc/762f2d6f) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['034748fe-3c34-4d05-88d4-00a2a3856513'] (lvm:1448) 2019-07-12 15:11:34,255-0400 INFO (migsrc/762f2d6f) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['034748fe-3c34-4d05-88d4-00a2a3856513'] (lvm:1456) 2019-07-12 15:11:34,491-0400 INFO (migsrc/762f2d6f) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['89750db5-703b-4da3-a971-a831422c319e'] (lvm:1448) 2019-07-12 15:11:35,045-0400 INFO (migsrc/762f2d6f) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['89750db5-703b-4da3-a971-a831422c319e'] (lvm:1456) 2019-07-12 15:11:35,397-0400 INFO (jsonrpc/4) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['122328ea-bf51-43a6-8b10-fb0800aaa832'] (lvm:1456) 2019-07-12 15:11:41,532-0400 INFO (jsonrpc/0) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['89750db5-703b-4da3-a971-a831422c319e'] (lvm:1448) 2019-07-12 15:11:42,054-0400 INFO (jsonrpc/0) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['89750db5-703b-4da3-a971-a831422c319e'] (lvm:1456) 2019-07-12 15:11:42,629-0400 INFO (vm/762f2d6f) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['122328ea-bf51-43a6-8b10-fb0800aaa832'] (lvm:1448) 2019-07-12 15:11:44,411-0400 INFO (vm/762f2d6f) [storage.LVM] Activating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['034748fe-3c34-4d05-88d4-00a2a3856513'] (lvm:1448) 2019-07-12 15:11:55,847-0400 INFO (vm/762f2d6f) [storage.LVM] Deactivating lvs: vg=91047f09-f2d7-4453-8e11-483fed09556d lvs=['034748fe-3c34-4d05-88d4-00a2a3856513'] (lvm:1456) Is it possible that activating or deactivating LVs or refreshing another LV from another hosts caused LVM to try to fix the VG metadata without logging any warnings? "found metadata seqno 31 vs previous 30" -- these messages usually mean that the metadata has changed between scanning consecutive metadata areas for the VG. This is during the label scan phase where VG locks are not held, so it could happen legitimately from time to time if you're reading a VG while it's being written from another host. LVM will not try to repair when it sees this. I think I should look at a completely debug logs of the LVM commands running during one of these events, to check exactly what's happening. It would be best to log lvm commands from all hosts during this event in case there are any changes being made that we don't expect. sync2jira sync2jira Hi Nir, Lot's of work has been done on this bug so far. But no fix yet. With the work done on bug 1553133, could there be any progress on this bz as well? This looks like a duplicate of bug 1553133. *** This bug has been marked as a duplicate of bug 1553133 *** Removing stale need infos. |