Bug 1376754
Summary: | Host is set to ERROR mode (cannot start VM) while being in Maintenance | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Alexandros Gkesos <agkesos> |
Component: | ovirt-engine | Assignee: | Arik <ahadas> |
Status: | CLOSED ERRATA | QA Contact: | Israel Pinto <ipinto> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.0.2 | CC: | ahadas, apinnick, ipinto, jentrena, lsurette, mavital, mgoldboi, michal.skrivanek, mperina, mtessun, ppostler, rbalakri, Rhev-m-bugs, srevivo, tjelinek, ykaul |
Target Milestone: | ovirt-4.2.0 | ||
Target Release: | 4.2.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Previously, if a host was placed in maintenance mode and migration was cancelled while at least 3 virtual machines were attempting to migrate to it, the host ended up in an ERROR state. In the current release, the host does not move into an ERROR state in this situation.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-05-15 17:38:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 4
Martin Perina
2016-09-16 12:31:47 UTC
Also cc-ing Arik, as I think Virt should look at it as well, based on the description. Arik - can you take a look? (In reply to Martin Perina from comment #4) > Eli, could you please investigate? From looking in the log and the scenario its seems like a pure virt issue. The fact is that a VM is attempted to run from some reason on a host that is already on maintenance points to a possible race when the host was set to maintenance although some tasks (migration?) are still running and have this host in the <running hosts> list It looks like a VIRT bug. Currently the logic is: when VM is migrated successfully, increment the counter of each host that the VM failed to migrated to in the process (there are 2 rerun attempts by default). When this counter reaches 3 for a given host - set it to ERROR status. In a scenario where host that VMs are being migrated to is switched to maintenance, we cancel those incoming migrations. If there are 3 (or more) VMs that were migrating to the host and then they manage to migrate successfully to another host - the host will be switched to ERROR (overriding its MAINTENANCE state). I believe we didn't encounter this yet because: 1. Usually you don't migrate VMs to host that is going to be in maintenance 2. There is a timing issue - the host should switch to maintenance and then the counter to reach to 3 and override this status Need to put some thoughts on this since I'm not sure whether it is better not to count migration failures because of deliberate cancelations or not to switch host that is on maintenance into ERROR state (or both..). The bug does not cause any corruption and happens only in special cases. Targeting 4.1 I believe the best would be to remove the Error host state. It's a relic from the past when we didn't have a proper scheduler, it was a simple mechanism to improve selection of migration destination hosts. It shouldn't be needed anymore This is a more invasive change not suitable for 4.1, but since it's a very rare occurrence and there is no real impact on functionality I suggest to defer to 4.2 (In reply to Michal Skrivanek from comment #16) > I believe the best would be to remove the Error host state. It's a relic > from the past when we didn't have a proper scheduler, it was a simple > mechanism to improve selection of migration destination hosts. It shouldn't > be needed anymore > > This is a more invasive change not suitable for 4.1, but since it's a very > rare occurrence and there is no real impact on functionality I suggest to > defer to 4.2 I agree that such a change can be risky on the current state of 4.1, as long as it's rarely happening on the customer's environment we should defer to 4.2 (In reply to Michal Skrivanek from comment #16) > I believe the best would be to remove the Error host state. It's a relic > from the past when we didn't have a proper scheduler, it was a simple > mechanism to improve selection of migration destination hosts. It shouldn't > be needed anymore > > This is a more invasive change not suitable for 4.1, but since it's a very > rare occurrence and there is no real impact on functionality I suggest to > defer to 4.2 Any updates on this? (In reply to Yaniv Kaul from comment #19) > (In reply to Michal Skrivanek from comment #16) > > I believe the best would be to remove the Error host state. It's a relic > > from the past when we didn't have a proper scheduler, it was a simple > > mechanism to improve selection of migration destination hosts. It shouldn't > > be needed anymore > > > > This is a more invasive change not suitable for 4.1, but since it's a very > > rare occurrence and there is no real impact on functionality I suggest to > > defer to 4.2 > > Any updates on this? for same reasons as above - I would propose to deferring it to 4.3 (In reply to Tomas Jelinek from comment #20) > (In reply to Yaniv Kaul from comment #19) > > (In reply to Michal Skrivanek from comment #16) > > > I believe the best would be to remove the Error host state. It's a relic > > > from the past when we didn't have a proper scheduler, it was a simple > > > mechanism to improve selection of migration destination hosts. It shouldn't > > > be needed anymore > > > > > > This is a more invasive change not suitable for 4.1, but since it's a very > > > rare occurrence and there is no real impact on functionality I suggest to > > > defer to 4.2 > > > > Any updates on this? > > for same reasons as above - I would propose to deferring it to 4.3 I don't like deferring it over and over. As we know this issue since quite some time. So if we have a chance getting this fixed in 4.2 I would prefer it. If we need to delay to 4.3, well somewhat ok with me, but we need to ensure to deliver in 4.3. I would not like this being deferred again, or we should say, we don't want to fix it in the near future, as it is a corner case only, and as such close it out. So to summarize: If we can get this solved in 4.2, we should do it. If this is really not possible, we can postpone to 4.3, but we need to fix it then. Verify: Software version:4.2.0-0.5.master.el7 Steps: 1. Run 4 VM on one host_1 2. Start migration for all VMs to host_2 3. Set host_1 host_2 to maintenance 4. Check that hosts is switch to maintenance 5. All VMs migrate to host_3 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488 BZ<2>Jira Resync |