Bug 1439984
Summary: | Don't try to deactivate in use LVs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | ||||
Component: | vdsm | Assignee: | Vojtech Juranek <vjuranek> | ||||
Status: | CLOSED DEFERRED | QA Contact: | Avihai <aefrat> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.0.7 | CC: | amarchuk, ebenahar, eshenitz, gveitmic, lsurette, mkalinin, srevivo, tnisan, ycui | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-09-29 11:33:06 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1213786 | ||||||
Bug Blocks: | 902971, 1520566 | ||||||
Attachments: |
|
Description
Germano Veit Michel
2017-04-07 03:02:23 UTC
Created attachment 1269541 [details]
vdsm debug logs
Reproduced both create from template (step 5) and clone VM (step 6) again now with vdsm debug logs enabled. The LVs/volume IDs won't match with the bug description, except the base (parent) one, but it's the exact same scenario.
Raz, were those seen in QE? (In reply to Raz Tamir from comment #3) > Yes, > In https://bugzilla.redhat.com/show_bug.cgi?id=1433052#c11 That's a specific bug in the new amend flow. While it's similar to the issue reported here, it's not the same. The problem here is that the virt subsystem and the stoage subsystem are separate. While the storage subsystem tries to do the right thing (tear down volumes it prepared), it lacks the virt knowledge to understand that they are in use by running VMs. Moving these flows to SPDM (goal for 4.2), where the engine manages preparing and tearing down, instead of having this code mangled inside the storage operation, should resolve the issue. Thanks for clarifying, We saw this issue since 4.1.1.5/6 and I have been told this is covered by the bug mentioned in comment #3. Just to make sure this get the right attention, this bug is failing a lot of our automation tests as most of our VMs are created from a template as thin copy (In reply to Raz Tamir from comment #6) > Thanks for clarifying, > We saw this issue since 4.1.1.5/6 and I have been told this is covered by > the bug mentioned in comment #3. > Just to make sure this get the right attention, this bug is failing a lot of > our automation tests as most of our VMs are created from a template as thin > copy Why is it failing many of them? Say it is created wrongly, why should it affect many tests? We are creating a new VM for (almost) each test. It saves the time for removing snapshots, attached disks, etc. at the teardown phase (In reply to Raz Tamir from comment #6) > Thanks for clarifying, > We saw this issue since 4.1.1.5/6 and I have been told this is covered by > the bug mentioned in comment #3. > Just to make sure this get the right attention, this bug is failing a lot of > our automation tests as most of our VMs are created from a template as thin > copy Raz, can you share such an example of such a failure? The issue described in this bug probably existed in the codebase for quite some time (possibly since day one?). Let's make sure we aren't missing something. Allon, We don't see this anymore on our environments. The thing that has changed is that our environments are nested and use to run on physical hosts that running rhel 7.3. We found some issues with nested on rhel 7.3 hosts (bugs opened) and we moved to work with rhel 7.2 and haven't seen this failure since then (In reply to Raz Tamir from comment #10) > We found some issues with nested on rhel 7.3 hosts (bugs opened) and we > moved to work with rhel 7.2 and haven't seen this failure since then Hmmm, the my reproducer above (ovirt 4.1) was running nested on 7.2 Hosts. I had some lockups on 7.3 nested, then downgraded the kernel. Maybe I'm not imaginative enough, but I don't think this relates to nested. Let me know if you need more tests. Do we have any news on this? I don't see why the LVM stuff should be related to nested hosts, and which version of nested hosts is being used. Do we have a clear understanding of the issue? This looks like the same root cause as bug 1395941. Should be handled together. (In reply to Allon Mureinik from comment #13) > This looks like the same root cause as bug 1395941. Which depends on a platform bug that is now targeted to RHEL 7.5. > Should be handled together. So it should be moved to 4.3? Is it really a blocker? (In reply to Yaniv Kaul from comment #14) > (In reply to Allon Mureinik from comment #13) > > This looks like the same root cause as bug 1395941. > > Which depends on a platform bug that is now targeted to RHEL 7.5. There's some discussion going on there, let's wait and see if we get it any earlier. > > > Should be handled together. > > So it should be moved to 4.3? We'll see. If we can't solve it earlier, yes. > Is it really a blocker? IMHO, no. (In reply to Allon Mureinik from comment #15) > > > Should be handled together. > > > > So it should be moved to 4.3? > We'll see. > If we can't solve it earlier, yes. We're waiting for RHEL 7.5 for this, pushing out. (In reply to Allon Mureinik from comment #13) > This looks like the same root cause as bug 1395941. > Should be handled together. Allon, is this the correct bug? Does not seem relevant, but I might be missing something. (In reply to Marina from comment #19) > (In reply to Allon Mureinik from comment #13) > > This looks like the same root cause as bug 1395941. > > Should be handled together. > > Allon, is this the correct bug? Does not seem relevant, but I might be > missing something. Until qemu-img-2.10 (delivered in RHEL 7.5), creating a qcow2 layer required the entire backing chain, up to the root node, to be visible to the qemu-img process (i.e., LVs had to be activated). This limitation is what caused the slowdown described in bug 1395941 (having to activate and deactivate multiple LVs), and it is the root cause of this bug, at least partially (having to activate the same LV for multiple, uncoordinated, flows). In RHV, we do not require qemu-img's safety features - engine knows all the parameters of the new qcow layer, such as the size and the name of the backing volume, and sends them to vdsm to call qemu-img to create the new layer. In qemu-img 2.10 we can avoid activating the backing LVs, the circumvent the need to coordinate these activations and deactivations entirely. Cloning from a template/vm/snapshot is a slightly trickier flow, since you really do need to activate the source volumes in order to copy the data from them. This will require some design work, and we need to see if we prioritize it for 4.3 or not. Germano, Does the explanation from Allon makes sense to you and maybe we don't need to have this bug implemented or your bug is still relevant? If yes - how widely does it affect RHV customers now? Or the impact is mainly on supportability side of things? Hi Marina, (In reply to Marina from comment #21) > Does the explanation from Allon makes sense to you and maybe we don't need > to have this bug implemented or your bug is still relevant? It makes sense but it does not make the bug less relevant. > If yes - how > widely does it affect RHV customers now? Or the impact is mainly on > supportability side of things? There is no actual effect on a production scenario. It is just vdsm doing something it should not in the first place and then polluting the logs with false errors. I think at this point medium severity is more appropriate. This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. Addressing this bug will make it easier to troubleshoot storage related issues in the logs. However, it is a medium severity. This bug/RFE didn't get enough attention so far and is now flagged as pending close. Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release. This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it. Please feel free to reopen if you have a plan how to contribute this feature/bug fix. |