Bug 1376754 - Host is set to ERROR mode (cannot start VM) while being in Maintenance
Summary: Host is set to ERROR mode (cannot start VM) while being in Maintenance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.0.2
Hardware: All
OS: All
high
medium
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Arik
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-16 10:34 UTC by Alexandros Gkesos
Modified: 2021-06-10 11:33 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, if a host was placed in maintenance mode and migration was cancelled while at least 3 virtual machines were attempting to migrate to it, the host ended up in an ERROR state. In the current release, the host does not move into an ERROR state in this situation.
Clone Of:
Environment:
Last Closed: 2018-05-15 17:38:43 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:1488 0 None None None 2018-05-15 17:40:57 UTC
oVirt gerrit 84531 0 master MERGED core: drop the error status of hosts 2017-11-22 15:10:13 UTC

Comment 4 Martin Perina 2016-09-16 12:31:47 UTC
Eli, could you please investigate?

Comment 5 Oved Ourfali 2016-09-18 07:36:57 UTC
Also cc-ing Arik, as I think Virt should look at it as well, based on the description.

Arik - can you take a look?

Comment 6 Eli Mesika 2016-09-18 09:40:09 UTC
(In reply to Martin Perina from comment #4)
> Eli, could you please investigate?

From looking in the log and the scenario its seems like a pure virt issue.
The fact is that a VM is attempted to run from some reason on a host that is already on maintenance points to a possible race when the host was set to  maintenance although some tasks (migration?) are still running and have this host in the <running hosts> list

Comment 7 Arik 2016-09-18 12:48:37 UTC
It looks like a VIRT bug.

Currently the logic is: when VM is migrated successfully, increment the counter of each host that the VM failed to migrated to in the process (there are 2 rerun attempts by default). When this counter reaches 3 for a given host - set it to ERROR status.

In a scenario where host that VMs are being migrated to is switched to maintenance, we cancel those incoming migrations. If there are 3 (or more) VMs that were migrating to the host and then they manage to migrate successfully to another host - the host will be switched to ERROR (overriding its MAINTENANCE state).

I believe we didn't encounter this yet because:
1. Usually you don't migrate VMs to host that is going to be in maintenance
2. There is a timing issue - the host should switch to maintenance and then the counter to reach to 3 and override this status

Need to put some thoughts on this since I'm not sure whether it is better not to count migration failures because of deliberate cancelations or not to switch host that is on maintenance into ERROR state (or both..).

Comment 8 Tomas Jelinek 2016-09-20 06:49:37 UTC
The bug does not cause any corruption and happens only in special cases. Targeting 4.1

Comment 16 Michal Skrivanek 2016-12-21 08:56:54 UTC
I believe the best would be to remove the Error host state. It's a relic from the past when we didn't have a proper scheduler, it was a simple mechanism to improve selection of migration destination hosts. It shouldn't be needed anymore

This is a more invasive change not suitable for 4.1, but since it's a very rare occurrence and there is no real impact on functionality I suggest to defer to 4.2

Comment 18 Moran Goldboim 2016-12-28 22:00:42 UTC
(In reply to Michal Skrivanek from comment #16)
> I believe the best would be to remove the Error host state. It's a relic
> from the past when we didn't have a proper scheduler, it was a simple
> mechanism to improve selection of migration destination hosts. It shouldn't
> be needed anymore
> 
> This is a more invasive change not suitable for 4.1, but since it's a very
> rare occurrence and there is no real impact on functionality I suggest to
> defer to 4.2

I agree that such a change can be risky on the current state of 4.1, as long as it's rarely happening on the customer's environment we should defer to 4.2

Comment 19 Yaniv Kaul 2017-09-04 15:01:35 UTC
(In reply to Michal Skrivanek from comment #16)
> I believe the best would be to remove the Error host state. It's a relic
> from the past when we didn't have a proper scheduler, it was a simple
> mechanism to improve selection of migration destination hosts. It shouldn't
> be needed anymore
> 
> This is a more invasive change not suitable for 4.1, but since it's a very
> rare occurrence and there is no real impact on functionality I suggest to
> defer to 4.2

Any updates on this?

Comment 20 Tomas Jelinek 2017-10-27 09:28:54 UTC
(In reply to Yaniv Kaul from comment #19)
> (In reply to Michal Skrivanek from comment #16)
> > I believe the best would be to remove the Error host state. It's a relic
> > from the past when we didn't have a proper scheduler, it was a simple
> > mechanism to improve selection of migration destination hosts. It shouldn't
> > be needed anymore
> > 
> > This is a more invasive change not suitable for 4.1, but since it's a very
> > rare occurrence and there is no real impact on functionality I suggest to
> > defer to 4.2
> 
> Any updates on this?

for same reasons as above - I would propose to deferring it to 4.3

Comment 21 Martin Tessun 2017-11-10 08:51:10 UTC
(In reply to Tomas Jelinek from comment #20)
> (In reply to Yaniv Kaul from comment #19)
> > (In reply to Michal Skrivanek from comment #16)
> > > I believe the best would be to remove the Error host state. It's a relic
> > > from the past when we didn't have a proper scheduler, it was a simple
> > > mechanism to improve selection of migration destination hosts. It shouldn't
> > > be needed anymore
> > > 
> > > This is a more invasive change not suitable for 4.1, but since it's a very
> > > rare occurrence and there is no real impact on functionality I suggest to
> > > defer to 4.2
> > 
> > Any updates on this?
> 
> for same reasons as above - I would propose to deferring it to 4.3

I don't like deferring it over and over. As we know this issue since quite some time.
So if we have a chance getting this fixed in 4.2 I would prefer it. If we need to delay to 4.3, well somewhat ok with me, but we need to ensure to deliver in 4.3. I would not like this being deferred again, or we should say, we don't want to fix it in the near future, as it is a corner case only, and as such close it out.

So to summarize:
If we can get this solved in 4.2, we should do it. If this is really not possible, we can postpone to 4.3, but we need to fix it then.

Comment 27 Israel Pinto 2017-12-07 11:46:00 UTC
Verify:
Software version:4.2.0-0.5.master.el7


Steps:
1. Run 4 VM on one host_1
2. Start migration for all VMs to host_2
3. Set host_1 host_2 to maintenance 
4. Check that hosts is switch to maintenance 
5. All VMs migrate to host_3

Comment 30 errata-xmlrpc 2018-05-15 17:38:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 31 Franta Kust 2019-05-16 13:06:18 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.