+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1693075 +++ ====================================================================== Description of problem: Current chain: xxx < 20d6814c < 7a5d99a4 (top) Live Merge result: xxx < 20d6814c (top) Problem: 20d6814c VOLTYPE is set to INTERNAL instead of LEAF. Checking the logs, it seems this is a result of the deleteVolume on the SPM failing to do [1] 2019-03-22 07:24:32,251+1300 INFO (jsonrpc/1) [vdsm.api] START deleteVolume(sdUUID=u'0b0107b2-9d2a-4871-b4d6-76bdffba84cd', spUUID=u'58117215-0261-0167-0311-0000000002b6', imgUUID=u'778f6bf5-77e8-4f75-a41d-a1d5341f0c3d', volumes=[u'7a5d99a4-ad44-41f1-94cd-1d67b3d8c13c'], postZero=u'false', force=u'false', discard=False) from=::ffff:10.1.224.10,59366, flow_id=dcdde73b-3b1f-4259-89c1-3b63ff21a67a, task_id=478b1e32-aea5-4710-850a-8b525e5c165f (api:46) 2019-03-22 07:24:56,108+1300 WARN (tasks/2) [storage.Volume] cannot finalize parent volume 20d6814c-1234-480d-8128-17fee729927d (blockVolume:625) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 617, in delete self.setParent(sc.BLANK_UUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1494, in setParent self.setParentMeta(puuid) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 756, in setParentMeta return self._manifest.setParentMeta(puuid) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 270, in setParentMeta self.setMetaParam(sc.PUUID, puuid) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 327, in setMetaParam meta = self.getMetadata() File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 100, in getMetadata metaId = self.getMetadataId() File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 81, in getMetadataId return (self.sdUUID, self.getMetaOffset()) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 85, in getMetaOffset md = getVolumeTag(self.sdUUID, self.volUUID, sc.TAG_PREFIX_MD) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 773, in getVolumeTag tags = lvm.getLV(sdUUID, volUUID).tags File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 970, in getLV raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) LogicalVolumeDoesNotExistError: Logical volume does not exist: (u'0b0107b2-9d2a-4871-b4d6-76bdffba84cd/7a5d99a4-ad44-41f1-94cd-1d67b3d8c13c',) This happened because the "lvm lvs" preceding it failed. And it failed with a very weird error: 2019-03-22 07:24:56,107+1300 WARN (tasks/2) [storage.LVM] lvm lvs failed: 5 [] [' Metadata on /dev/mapper/36000d31000e8a600000000000000000c at 27487763185152 has wrong VG name "= [', '"pv1", 167317', ']', '}', 'segment9 {', 'start_extent = 88', 'extent_count = 8', '', 'type = "striped"', 'stripe_count = 1', '', 'stripes = [', '"pv1", 16751" expected 0b0107b2-9d2a-4871-b4d6-76bdffba84cd.', ' Volume group "0b0107b2-9d2a-4871-b4d6-76bdffba84cd" not found', ' Cannot process volume group 0b0107b2-9d2a-4871-b4d6-76bdffba84cd'] (lvm:431) 2019-03-22 07:24:56,107+1300 WARN (tasks/2) [storage.LVM] lv: 7a5d99a4-ad44-41f1-94cd-1d67b3d8c13c not found in lvs vg: 0b0107b2-9d2a-4871-b4d6-76bdffba84cd response (lvm:627) So the VG name was "= [', '"pv1", 167317', ']', '}', 'segm....." which is actually a random piece of LVM metadata on that PV. * The LVM metadata for this SD is 1.6MB (1300 LVs, high fragmentation of LV segments) * There are several snapshots created/deleted around these times (Commvault) Differently to BZ1553133, here the VG is not getting corrupted, it just fails to properly read the metadata for short periods of time. It looks like there is some improper locking of lvm commands, or it is reading the lvm metadata while something else is still writing? [1] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/storage/blockVolume.py#L616 Version-Release number of selected component (if applicable): vdsm-4.20.46-1.el7ev.x86_64 kernel-3.10.0-957.1.3.el7.x86_64 lvm2-2.02.180-10.el7_6.2.x86_64 How reproducible: Not yet (Originally by Germano Veit Michel)
This also seem to have caused the metadata not being written at all for a newly created volume, I'll update with more details tomorrow when I finish that one. (Originally by Germano Veit Michel)
(In reply to Germano Veit Michel from comment #0) > Differently to BZ1553133, here the VG is not getting corrupted, it just > fails to properly read the metadata for short periods of time. Not entirely sure actually. But here the seqno of the metadata is sequential, there are no jumps. Still there are not that many in the first 128MB as each is 1.6MB. (Originally by Germano Veit Michel)
Nir, this seems like the same root cause as in bug 1553133 although the outcome is a bit different (no corruption), the solution should fix both since we change the locking mechanism, right? (Originally by Tal Nisan)
This seems like like the temporary random failures reported by qe. David, is this a result of lvm on another host, trying to "fix" the metadata during lvchange --refresh or other command that should not change the metadata? (Originally by Nir Soffer)
(In reply to Nir Soffer from comment #6) > This seems like like the temporary random failures reported by qe. > > David, is this a result of lvm on another host, trying to "fix" the metadata > during lvchange --refresh or other command that should not change the > metadata? While one host is writing VG metadata (as a normal change, not fixing anything), if another host tries to read the VG (e.g. lvs) at the same time, the lvs may see bad VG metadata and report errors. The locking_type 4 change from bug 1553133 should prevent the 'lvs' from "fixing" any transient problem it sees. It shouldn't be difficult to test this with lvm directly. On one host run a loop that does repeated 'lvcreate' in the VG, and on the second host run a loop that does repeated 'lvs' in the same VG. I expect you may see the 'lvs' report errors periodically, especially when the VG metadata grows large. (Originally by David Teigland)
(In reply to David Teigland from comment #7) > (In reply to Nir Soffer from comment #6) In this case we see an error in lvs command on the SPM - the only host adding, removing or extending lvs. Is it possible that we had: 1. host A: create/extend/delete/change tags -> writing new metadata 2. host B: run lvs/vgs/pvs/lvchange --refresh -> try fix inconsistent metadata 3. host A: run lvs -> see inconsistent metadata from bad fix by host B, fix it again Step 1 and 2 may happen in parallel. (Originally by Nir Soffer)
> Is it possible that we had: > > 1. host A: create/extend/delete/change tags -> writing new metadata > 2. host B: run lvs/vgs/pvs/lvchange --refresh -> try fix inconsistent > metadata > 3. host A: run lvs -> see inconsistent metadata from bad fix by host B, fix > it again > > Step 1 and 2 may happen in parallel. If host B was able to modify the metadata, then host A could report errors and try to fix things again. If the metadata buffer is large enough, you might be able to see evidence of those changes from previous copies of the metadata in the buffer. (Originally by David Teigland)
David, do we have a way to detect when lvm try to fix inconsistent metadata? can we enable specific debug logs that will reveal when this happens? We already collect anything lvm write to stderr. Currently this goes to debug log that is disable by default but we can change it to log warnings if this can help to diagnose such issues. (Originally by Nir Soffer)
Is the code not using the locking_type 4 change from bug 1553133 which should prevent the unwanted repair? If not, then you would probably see this: WARNING: Inconsistent metadata found for VG %s - updating to use version %u" (Originally by David Teigland)
(In reply to David Teigland from comment #11) > Is the code not using the locking_type 4 change from bug 1553133 which > should prevent the unwanted repair? Not yet, integrating locking_type 4 is not that easy. > If not, then you would probably see this: > WARNING: Inconsistent metadata found for VG %s - updating to use version %u" Sounds good, I'll change the logging to log lvm stderr as warning. (Originally by Nir Soffer)
I think the attached patch is the best we can do for now. With the improved log, we will warnings when lvm try to fix metadata. I think should deliver the improved log in 4.3.5, and reopen this bug when we have more data from users. Avihai, I think your team filed one or more bugs about the same issue, which may be a duplicate of this bug. (Originally by Nir Soffer)
Possible duplicate: bug 1637405 (Originally by Nir Soffer)
(In reply to Nir Soffer from comment #13) > I think the attached patch is the best we can do for now. With the improved > log, we will warnings when lvm try to fix metadata. > > I think should deliver the improved log in 4.3.5, and reopen this bug when > we have more data from users. > > Avihai, I think your team filed one or more bugs about the same issue, which > may be a duplicate of this bug. In the bug you mentioned deleteVolume VDSM error looks different so I'm not sure about its similarity VDSM: 2019-05-04 21:55:30,237+0300 ERROR (tasks/7) [storage.TaskManager.Task] (Task='3f8954bd-093f-4dff-a69e-266965267a33') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1967, in deleteVolume vol.delete(postZero=postZero, force=force, discard=discard) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 588, in delete self.validateDelete() However Bug 1666795 which were resolved at 4.3.2 looks with a similar errors in VDSM log(VolumeDoesNotExist) , maybe this is more related? (Originally by Avihai Efrat)
Nir, Please provide a clear scenario so we can test/ack it . Also, I understand the improved log will be available in 4.3.5 so this issue can be only fixed in 4.3.6 , please retarget accordingly.
From clone bug 1693075 c#18 I understand that once we have the improved log in 4.3.5 we should see if the issue reproduces in our regression TierX tests(as there is no clear scenario) and provide the logs so you can debug and fix this issue. Please ack if this is the plan. Also as this issue will only fixed in 4.3.6 or later , please retarget accordingly.
(In reply to Avihai from comment #18) > From clone bug 1693075 c#18 I understand that once we have the improved log > in 4.3.5 we should see if the issue reproduces in our regression TierX > tests(as there is no clear scenario) and provide the logs so you can debug > and fix this issue. Yes. > Also as this issue will only fixed in 4.3.6 or later , please retarget > accordingly. No, we plan to close this in 4.3.5, since we don't have enough data to do anything useful with this. We will reopen the bug if we get more data. Can you add the missing ack? we need it to merge the patches to 4.3.
(In reply to Nir Soffer from comment #19) > (In reply to Avihai from comment #18) > > From clone bug 1693075 c#18 I understand that once we have the improved log > > in 4.3.5 we should see if the issue reproduces in our regression TierX > > tests(as there is no clear scenario) and provide the logs so you can debug > > and fix this issue. > > Yes. > > > Also as this issue will only fixed in 4.3.6 or later, please retarget > > accordingly. > > No, we plan to close this in 4.3.5, since we don't have enough data to > do anything useful with this. The fixes you posted are about improving the logs and not about fixing the customer issue meaning the customer issue depends on those logs patched. If I QA_ACK this bug now it will go to 'ON_QA' without a proper fix to the customer issue itself but only with the log improvements patches - this is bad practice and this is why I suggest to open a new bug with the LVM logs improvement patches and mark this bug as 'depends on' the new improved LVM logs bug. > We will reopen the bug if we get more data. > Can you add the missing ack? we need it to merge the patches to 4.3. To QA_ACK this bug(customer issue) we need patches that fix this issue and not the improved LVM logs(which should be in a separate bug). So to ack we need : 1) improved logs new bug that you need to open to be verified in 4.3.5 (Please open a new bug with the patches for LVM log improvement from this bug). currently, this bug depends on the improved LVM logs bug and that is why you should open as a separate bug and not mix with this bug. 2) A clear scenario/ reproducer - can you provide one? 3) A fix to the customer issue - currently missing(and this is why I'm suggesting to retarget this bug to 4.3.6).
Moving back to ASSIGNED. In short, this patch does not fix the issue but only improve logging so we can understand it. See my previous comment for details.
sync2jira
Logs were added, if the issue is encountered again, please add the new logs here and on bug 1693075