1064860 – VMs get stuck in 'Unknown' state when power management is not working.

Bug 1064860 - VMs get stuck in 'Unknown' state when power management is not working.

Summary: VMs get stuck in 'Unknown' state when power management is not working.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Eli Mesika
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-02-13 12:39 UTC by Roman Hodain
Modified:	2016-02-10 19:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ovirt-engine-3.5.0_beta
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-17 17:07:54 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	26466	0	'None'	ABANDONED	core: VMs moved to UNKNOWN after set to DOWN	2020-05-27 07:43:32 UTC
oVirt gerrit	28114	0	'None'	MERGED	core: move VM status handling to VdsManager.	2020-05-27 07:43:32 UTC

Description Roman Hodain 2014-02-13 12:39:00 UTC

Description of problem:
	
	VMs gets stuck in "Unknown"  when hypervisor is rebooted and fencing is
not working even if the hypervisor comes up and the engine detects the the VMs
are down.


Version-Release number of selected component (if applicable):

	rhevm-3.3.0-0.46.el6ev.noarch

How reproducible:

	100%

Steps to Reproduce:

	1. Create a new DC with just one hyperviosr (local storage)
	2. Start a VM on it
	3. Reboot it

Actual results:

	VM is set to the unknown state forever.

Expected results:

	VM is mark temporaryly as unknown state and later as down wehe the
hypervisors comes up

Additional info:
	
	It seems that the is caused by defunct fencing. When the fencing does
not succed thee times the VMs are marked as in unknown state, but the rerun
trheratment already happened as the hypervisors came up already.

2014-02-12 13:42:46,359 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host dhcp-1-146.brq.redhat.com.
2014-02-12 13:42:46,531 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-63) vm TestVM running in db and not running in vds - add to rerun treatment. vds rhev-h.exmaple.com
...
2014-02-12 13:43:16,669 INFO  [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-47) Attempt 3 to find fence proxy host failed...
2014-02-12 13:43:46,670 ERROR [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-47) Failed to run Power Management command on Host rhev-h.example.com, no running proxy Host was found.
2014-02-12 13:43:46,684 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-47) START, SetVmStatusVDSCommand( vmId = 4785f791-c535-4f64-97ef-fbd6a11bf8fd, status = Unknown), log id: 21cd9e52
2014-02-12 13:43:46,687 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-47) FINISH, SetVmStatusVDSCommand, log id: 21cd9e52
2014-02-12 13:43:46,724 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-47) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM TestVM was set to the Unknown status.

Comment 1 Michal Skrivanek 2014-02-28 10:03:47 UTC

the fencing operation should be aborted when the host comes up in the meantime. Then the rerun treatment should work properly and not get overwritten by failed fencing afterwards

Comment 2 Eli Mesika 2014-03-18 13:05:12 UTC

Roman, you should right click on the Host from the Hosts list in the web admin UI and select "Confirm Host has been rebooted"

Please recheck with the above

Comment 3 Arthur Berezin 2014-03-20 15:02:32 UTC

There's a user experience problem here, the user reboot a host with running VMs thus VMs are going to unknown state. There's not indication in the VMs tab(only hidden in the events section) for the user that he should go back to host level and confirm that the has been rebooted.
Adding User Experience keyword.

Comment 4 Roman Hodain 2014-03-21 11:02:21 UTC

(In reply to Arthur Berezin from comment #3)
> There's a user experience problem here, the user reboot a host with running
> VMs thus VMs are going to unknown state. There's not indication in the VMs
> tab(only hidden in the events section) for the user that he should go back
> to host level and confirm that the has been rebooted.
> Adding User Experience keyword.

I do not thing that this is the problem here. The problem is that the hypervisor where the VM was running is already up and the VM is still in the unknown state. Why would I mark hypervisor which is up as rebooted?

Comment 5 Eli Mesika 2014-03-24 22:09:22 UTC

(In reply to Roman Hodain from comment #4)
> (In reply to Arthur Berezin from comment #3)
> > There's a user experience problem here, the user reboot a host with running
> > VMs thus VMs are going to unknown state. There's not indication in the VMs
> > tab(only hidden in the events section) for the user that he should go back
> > to host level and confirm that the has been rebooted.
> > Adding User Experience keyword.
> 
> I do not thing that this is the problem here. The problem is that the
> hypervisor where the VM was running is already up and the VM is still in the
> unknown state. Why would I mark hypervisor which is up as rebooted?

I am just copy/past from your bug description :

Steps to Reproduce:

	1. Create a new DC with just one hyperviosr (local storage)
	2. Start a VM on it
	3. Reboot it

So, you had rebooted the Host manually right? If so , please test again while after you reboot the host you also right click on it as "Confirm host has been rebooted"

BTW there is no fencing issue here since fencing can not work when there is only one Host in the DC (no proxy host available...)

Comment 6 Roman Hodain 2014-03-27 17:15:43 UTC

(In reply to Eli Mesika from comment #5)
> (In reply to Roman Hodain from comment #4)
> > (In reply to Arthur Berezin from comment #3)
> > > There's a user experience problem here, the user reboot a host with running
> > > VMs thus VMs are going to unknown state. There's not indication in the VMs
> > > tab(only hidden in the events section) for the user that he should go back
> > > to host level and confirm that the has been rebooted.
> > > Adding User Experience keyword.
> > 
> > I do not thing that this is the problem here. The problem is that the
> > hypervisor where the VM was running is already up and the VM is still in the
> > unknown state. Why would I mark hypervisor which is up as rebooted?
> 
> I am just copy/past from your bug description :
> 
> Steps to Reproduce:
> 
> 	1. Create a new DC with just one hyperviosr (local storage)
> 	2. Start a VM on it
> 	3. Reboot it
> 
> So, you had rebooted the Host manually right? If so , please test again
> while after you reboot the host you also right click on it as "Confirm host
> has been rebooted"
> 
> BTW there is no fencing issue here since fencing can not work when there is
> only one Host in the DC (no proxy host available...)

Hi,

I have tested your suggestion, but thisis not possible. At the time when the VM is in the unknown state the hypervisor is already up:
	
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".

let me repeat what happens:

 - VM is up
 - host is up
 - host is down
 - fencing is triggered
 - fencing in progress (not working)
 - hypervisor is up
 - Vm is marked as down
 - Fencing failed
 - Vm is marked as in Unknown state.
 - Mark the hypervisor as rbooted. (not possible)

I still think that this is an issue of fencing. The fencing is triggered and if it fails it marks VM as in unknow state even if they are already marked as down by the hypervisor which is already up.
It i snot related only to local storage, but also to an issues where the fencing not working.

Roman

Comment 9 Einav Cohen 2014-06-02 17:55:06 UTC

(In reply to Arthur Berezin from comment #3)
> There's a user experience problem here, the user reboot a host with running
> VMs thus VMs are going to unknown state. There's not indication in the VMs
> tab(only hidden in the events section) for the user that he should go back
> to host level and confirm that the has been rebooted.
> Adding User Experience keyword.

is this what this bug is about? I see that this BZ is in POST, so the problem reported here was solved; what you are saying is that we have a user-experience problem that, if I understand correctly, should be tracked separately from this issue. if so - please open a separate RFE for that. For now I removed the UserExperience keyword from this BZ. 
My hunch is that this should be solved via a notification-center or something similar that we can plan for 4.0, definitely not 3.5 material. 
thanks.

Comment 10 Arthur Berezin 2014-06-05 08:29:46 UTC

(In reply to Einav Cohen from comment #9)
> (In reply to Arthur Berezin from comment #3)
> > There's a user experience problem here, the user reboot a host with running
> > VMs thus VMs are going to unknown state. There's not indication in the VMs
> > tab(only hidden in the events section) for the user that he should go back
> > to host level and confirm that the has been rebooted.
> > Adding User Experience keyword.
> 
> is this what this bug is about? I see that this BZ is in POST, so the
> problem reported here was solved; what you are saying is that we have a
> user-experience problem that, if I understand correctly, should be tracked
> separately from this issue. if so - please open a separate RFE for that. For
> now I removed the UserExperience keyword from this BZ. 
> My hunch is that this should be solved via a notification-center or
> something similar that we can plan for 4.0, definitely not 3.5 material. 
> thanks.

There are 2 issues here, the first is fixed by Eli's patch - VM are marked as unknown after the host was rebooted and fencing failed. The other is that there's no "Call for Action" in the VMs tab when the user is expected to manually confirm a host was rebooted. I'll open a separate RFE on the second issue.

Comment 11 sefi litmanovich 2014-09-04 07:38:09 UTC

Verified with ovirt-engine-3.5.0-0.0.master.20140821064931.gitb794d66.el6.noarch.
vdsm-4.16.2-1.gite8cba75.el6.x86_64.

1. single host in datacenter is up (host has no power management configured).
2. create vm.
3. vm is up.
4. manually reboot the host.
5. host state connecting.
6. fencing failed for SPM host in DC, setting DC to non-operational
7. host state non-responsive.
8. vm state unknown.
9. host up.
10. vm down.
11. host is contending for SPM.
12. DC up host is SPM.

Comment 12 Eyal Edri 2015-02-17 17:07:54 UTC

rhev 3.5.0 was released. closing.

Note You need to log in before you can comment on or make changes to this bug.