Bug 1376754

Summary:	Host is set to ERROR mode (cannot start VM) while being in Maintenance
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Alexandros Gkesos <agkesos>
Component:	ovirt-engine	Assignee:	Arik <ahadas>
Status:	CLOSED ERRATA	QA Contact:	Israel Pinto <ipinto>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.0.2	CC:	ahadas, apinnick, ipinto, jentrena, lsurette, mavital, mgoldboi, michal.skrivanek, mperina, mtessun, ppostler, rbalakri, Rhev-m-bugs, srevivo, tjelinek, ykaul
Target Milestone:	ovirt-4.2.0
Target Release:	4.2.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, if a host was placed in maintenance mode and migration was cancelled while at least 3 virtual machines were attempting to migrate to it, the host ended up in an ERROR state. In the current release, the host does not move into an ERROR state in this situation.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-15 17:38:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 4 Martin Perina 2016-09-16 12:31:47 UTC

Eli, could you please investigate?

Comment 5 Oved Ourfali 2016-09-18 07:36:57 UTC

Also cc-ing Arik, as I think Virt should look at it as well, based on the description.

Arik - can you take a look?

Comment 6 Eli Mesika 2016-09-18 09:40:09 UTC

(In reply to Martin Perina from comment #4)
> Eli, could you please investigate?

From looking in the log and the scenario its seems like a pure virt issue.
The fact is that a VM is attempted to run from some reason on a host that is already on maintenance points to a possible race when the host was set to  maintenance although some tasks (migration?) are still running and have this host in the <running hosts> list

Comment 7 Arik 2016-09-18 12:48:37 UTC

It looks like a VIRT bug.

Currently the logic is: when VM is migrated successfully, increment the counter of each host that the VM failed to migrated to in the process (there are 2 rerun attempts by default). When this counter reaches 3 for a given host - set it to ERROR status.

In a scenario where host that VMs are being migrated to is switched to maintenance, we cancel those incoming migrations. If there are 3 (or more) VMs that were migrating to the host and then they manage to migrate successfully to another host - the host will be switched to ERROR (overriding its MAINTENANCE state).

I believe we didn't encounter this yet because:
1. Usually you don't migrate VMs to host that is going to be in maintenance
2. There is a timing issue - the host should switch to maintenance and then the counter to reach to 3 and override this status

Need to put some thoughts on this since I'm not sure whether it is better not to count migration failures because of deliberate cancelations or not to switch host that is on maintenance into ERROR state (or both..).

Comment 8 Tomas Jelinek 2016-09-20 06:49:37 UTC

The bug does not cause any corruption and happens only in special cases. Targeting 4.1

Comment 16 Michal Skrivanek 2016-12-21 08:56:54 UTC

I believe the best would be to remove the Error host state. It's a relic from the past when we didn't have a proper scheduler, it was a simple mechanism to improve selection of migration destination hosts. It shouldn't be needed anymore

This is a more invasive change not suitable for 4.1, but since it's a very rare occurrence and there is no real impact on functionality I suggest to defer to 4.2

Comment 18 Moran Goldboim 2016-12-28 22:00:42 UTC

(In reply to Michal Skrivanek from comment #16)
> I believe the best would be to remove the Error host state. It's a relic
> from the past when we didn't have a proper scheduler, it was a simple
> mechanism to improve selection of migration destination hosts. It shouldn't
> be needed anymore
> 
> This is a more invasive change not suitable for 4.1, but since it's a very
> rare occurrence and there is no real impact on functionality I suggest to
> defer to 4.2

I agree that such a change can be risky on the current state of 4.1, as long as it's rarely happening on the customer's environment we should defer to 4.2

Comment 19 Yaniv Kaul 2017-09-04 15:01:35 UTC

(In reply to Michal Skrivanek from comment #16)
> I believe the best would be to remove the Error host state. It's a relic
> from the past when we didn't have a proper scheduler, it was a simple
> mechanism to improve selection of migration destination hosts. It shouldn't
> be needed anymore
> 
> This is a more invasive change not suitable for 4.1, but since it's a very
> rare occurrence and there is no real impact on functionality I suggest to
> defer to 4.2

Any updates on this?

Comment 20 Tomas Jelinek 2017-10-27 09:28:54 UTC

(In reply to Yaniv Kaul from comment #19)
> (In reply to Michal Skrivanek from comment #16)
> > I believe the best would be to remove the Error host state. It's a relic
> > from the past when we didn't have a proper scheduler, it was a simple
> > mechanism to improve selection of migration destination hosts. It shouldn't
> > be needed anymore
> > 
> > This is a more invasive change not suitable for 4.1, but since it's a very
> > rare occurrence and there is no real impact on functionality I suggest to
> > defer to 4.2
> 
> Any updates on this?

for same reasons as above - I would propose to deferring it to 4.3

Comment 21 Martin Tessun 2017-11-10 08:51:10 UTC

(In reply to Tomas Jelinek from comment #20)
> (In reply to Yaniv Kaul from comment #19)
> > (In reply to Michal Skrivanek from comment #16)
> > > I believe the best would be to remove the Error host state. It's a relic
> > > from the past when we didn't have a proper scheduler, it was a simple
> > > mechanism to improve selection of migration destination hosts. It shouldn't
> > > be needed anymore
> > > 
> > > This is a more invasive change not suitable for 4.1, but since it's a very
> > > rare occurrence and there is no real impact on functionality I suggest to
> > > defer to 4.2
> > 
> > Any updates on this?
> 
> for same reasons as above - I would propose to deferring it to 4.3

I don't like deferring it over and over. As we know this issue since quite some time.
So if we have a chance getting this fixed in 4.2 I would prefer it. If we need to delay to 4.3, well somewhat ok with me, but we need to ensure to deliver in 4.3. I would not like this being deferred again, or we should say, we don't want to fix it in the near future, as it is a corner case only, and as such close it out.

So to summarize:
If we can get this solved in 4.2, we should do it. If this is really not possible, we can postpone to 4.3, but we need to fix it then.

Comment 27 Israel Pinto 2017-12-07 11:46:00 UTC

Verify:
Software version:4.2.0-0.5.master.el7


Steps:
1. Run 4 VM on one host_1
2. Start migration for all VMs to host_2
3. Set host_1 host_2 to maintenance 
4. Check that hosts is switch to maintenance 
5. All VMs migrate to host_3

Comment 30 errata-xmlrpc 2018-05-15 17:38:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 31 Franta Kust 2019-05-16 13:06:18 UTC

BZ<2>Jira Resync