Bug 1715026
Summary: | [downstream clone - 4.3.5] After live merge, top volume still has INTERNAL volume type due to unable to read lvm metadata properly | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> |
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Avihai <aefrat> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.2.8 | CC: | aefrat, dfediuck, frolland, lsurette, mkalinin, mwest, nsoffer, srevivo, teigland, tnisan, ycui |
Target Milestone: | ovirt-4.3.8 | Keywords: | Triaged, ZStream |
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | vdsm-4.30.20 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 1693075 | Environment: | |
Last Closed: | 2020-01-13 15:17:54 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1553133, 1693075 | ||
Bug Blocks: |
Description
RHV bug bot
2019-05-29 12:22:20 UTC
This also seem to have caused the metadata not being written at all for a newly created volume, I'll update with more details tomorrow when I finish that one. (Originally by Germano Veit Michel) (In reply to Germano Veit Michel from comment #0) > Differently to BZ1553133, here the VG is not getting corrupted, it just > fails to properly read the metadata for short periods of time. Not entirely sure actually. But here the seqno of the metadata is sequential, there are no jumps. Still there are not that many in the first 128MB as each is 1.6MB. (Originally by Germano Veit Michel) Nir, this seems like the same root cause as in bug 1553133 although the outcome is a bit different (no corruption), the solution should fix both since we change the locking mechanism, right? (Originally by Tal Nisan) This seems like like the temporary random failures reported by qe. David, is this a result of lvm on another host, trying to "fix" the metadata during lvchange --refresh or other command that should not change the metadata? (Originally by Nir Soffer) (In reply to Nir Soffer from comment #6) > This seems like like the temporary random failures reported by qe. > > David, is this a result of lvm on another host, trying to "fix" the metadata > during lvchange --refresh or other command that should not change the > metadata? While one host is writing VG metadata (as a normal change, not fixing anything), if another host tries to read the VG (e.g. lvs) at the same time, the lvs may see bad VG metadata and report errors. The locking_type 4 change from bug 1553133 should prevent the 'lvs' from "fixing" any transient problem it sees. It shouldn't be difficult to test this with lvm directly. On one host run a loop that does repeated 'lvcreate' in the VG, and on the second host run a loop that does repeated 'lvs' in the same VG. I expect you may see the 'lvs' report errors periodically, especially when the VG metadata grows large. (Originally by David Teigland) (In reply to David Teigland from comment #7) > (In reply to Nir Soffer from comment #6) In this case we see an error in lvs command on the SPM - the only host adding, removing or extending lvs. Is it possible that we had: 1. host A: create/extend/delete/change tags -> writing new metadata 2. host B: run lvs/vgs/pvs/lvchange --refresh -> try fix inconsistent metadata 3. host A: run lvs -> see inconsistent metadata from bad fix by host B, fix it again Step 1 and 2 may happen in parallel. (Originally by Nir Soffer) > Is it possible that we had:
>
> 1. host A: create/extend/delete/change tags -> writing new metadata
> 2. host B: run lvs/vgs/pvs/lvchange --refresh -> try fix inconsistent
> metadata
> 3. host A: run lvs -> see inconsistent metadata from bad fix by host B, fix
> it again
>
> Step 1 and 2 may happen in parallel.
If host B was able to modify the metadata, then host A could report errors and try to fix things again. If the metadata buffer is large enough, you might be able to see evidence of those changes from previous copies of the metadata in the buffer.
(Originally by David Teigland)
David, do we have a way to detect when lvm try to fix inconsistent metadata? can we enable specific debug logs that will reveal when this happens? We already collect anything lvm write to stderr. Currently this goes to debug log that is disable by default but we can change it to log warnings if this can help to diagnose such issues. (Originally by Nir Soffer) Is the code not using the locking_type 4 change from bug 1553133 which should prevent the unwanted repair? If not, then you would probably see this: WARNING: Inconsistent metadata found for VG %s - updating to use version %u" (Originally by David Teigland) (In reply to David Teigland from comment #11) > Is the code not using the locking_type 4 change from bug 1553133 which > should prevent the unwanted repair? Not yet, integrating locking_type 4 is not that easy. > If not, then you would probably see this: > WARNING: Inconsistent metadata found for VG %s - updating to use version %u" Sounds good, I'll change the logging to log lvm stderr as warning. (Originally by Nir Soffer) I think the attached patch is the best we can do for now. With the improved log, we will warnings when lvm try to fix metadata. I think should deliver the improved log in 4.3.5, and reopen this bug when we have more data from users. Avihai, I think your team filed one or more bugs about the same issue, which may be a duplicate of this bug. (Originally by Nir Soffer) Possible duplicate: bug 1637405 (Originally by Nir Soffer) (In reply to Nir Soffer from comment #13) > I think the attached patch is the best we can do for now. With the improved > log, we will warnings when lvm try to fix metadata. > > I think should deliver the improved log in 4.3.5, and reopen this bug when > we have more data from users. > > Avihai, I think your team filed one or more bugs about the same issue, which > may be a duplicate of this bug. In the bug you mentioned deleteVolume VDSM error looks different so I'm not sure about its similarity VDSM: 2019-05-04 21:55:30,237+0300 ERROR (tasks/7) [storage.TaskManager.Task] (Task='3f8954bd-093f-4dff-a69e-266965267a33') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1967, in deleteVolume vol.delete(postZero=postZero, force=force, discard=discard) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 588, in delete self.validateDelete() However Bug 1666795 which were resolved at 4.3.2 looks with a similar errors in VDSM log(VolumeDoesNotExist) , maybe this is more related? (Originally by Avihai Efrat) Nir, Please provide a clear scenario so we can test/ack it . Also, I understand the improved log will be available in 4.3.5 so this issue can be only fixed in 4.3.6 , please retarget accordingly. From clone bug 1693075 c#18 I understand that once we have the improved log in 4.3.5 we should see if the issue reproduces in our regression TierX tests(as there is no clear scenario) and provide the logs so you can debug and fix this issue. Please ack if this is the plan. Also as this issue will only fixed in 4.3.6 or later , please retarget accordingly. (In reply to Avihai from comment #18) > From clone bug 1693075 c#18 I understand that once we have the improved log > in 4.3.5 we should see if the issue reproduces in our regression TierX > tests(as there is no clear scenario) and provide the logs so you can debug > and fix this issue. Yes. > Also as this issue will only fixed in 4.3.6 or later , please retarget > accordingly. No, we plan to close this in 4.3.5, since we don't have enough data to do anything useful with this. We will reopen the bug if we get more data. Can you add the missing ack? we need it to merge the patches to 4.3. (In reply to Nir Soffer from comment #19) > (In reply to Avihai from comment #18) > > From clone bug 1693075 c#18 I understand that once we have the improved log > > in 4.3.5 we should see if the issue reproduces in our regression TierX > > tests(as there is no clear scenario) and provide the logs so you can debug > > and fix this issue. > > Yes. > > > Also as this issue will only fixed in 4.3.6 or later, please retarget > > accordingly. > > No, we plan to close this in 4.3.5, since we don't have enough data to > do anything useful with this. The fixes you posted are about improving the logs and not about fixing the customer issue meaning the customer issue depends on those logs patched. If I QA_ACK this bug now it will go to 'ON_QA' without a proper fix to the customer issue itself but only with the log improvements patches - this is bad practice and this is why I suggest to open a new bug with the LVM logs improvement patches and mark this bug as 'depends on' the new improved LVM logs bug. > We will reopen the bug if we get more data. > Can you add the missing ack? we need it to merge the patches to 4.3. To QA_ACK this bug(customer issue) we need patches that fix this issue and not the improved LVM logs(which should be in a separate bug). So to ack we need : 1) improved logs new bug that you need to open to be verified in 4.3.5 (Please open a new bug with the patches for LVM log improvement from this bug). currently, this bug depends on the improved LVM logs bug and that is why you should open as a separate bug and not mix with this bug. 2) A clear scenario/ reproducer - can you provide one? 3) A fix to the customer issue - currently missing(and this is why I'm suggesting to retarget this bug to 4.3.6). Moving back to ASSIGNED. In short, this patch does not fix the issue but only improve logging so we can understand it. See my previous comment for details. sync2jira sync2jira Logs were added, if the issue is encountered again, please add the new logs here and on bug 1693075 |