1157421 – [vdsm] hosted-engine: [iSCSI support] HostedEngine VM is not resumed automatically after EIO

Bug 1157421 - [vdsm] hosted-engine: [iSCSI support] HostedEngine VM is not resumed automatically after EIO

Summary: [vdsm] hosted-engine: [iSCSI support] HostedEngine VM is not resumed automati...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Adam Litke
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1067162 rhev35rcblocker rhev35gablocker
TreeView+	depends on / blocked

Reported:	2014-10-27 09:01 UTC by Elad
Modified:	2016-02-10 20:21 UTC (History)
CC List:	15 users (show)
Fixed In Version:	vdsm-4.16.8.1-1.el6ev
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm.log (1.18 MB, application/x-gzip) 2014-10-27 09:01 UTC, Elad	no flags	Details
non-HE setup (644.90 KB, application/x-gzip) 2014-10-29 10:04 UTC, Elad	no flags	Details
logs1 (3.14 MB, application/x-gzip) 2014-12-09 19:50 UTC, Elad	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	35369	master	MERGED	irs: Directly register domain state change callbacks	Never
oVirt gerrit	35436	master	MERGED	clientIF: Use a weakref.proxy when registering contEIOVMs	Never
oVirt gerrit	35808	ovirt-3.5	MERGED	clientIF: Use a weakref.proxy when registering contEIOVMs	Never
oVirt gerrit	35809	ovirt-3.5	MERGED	irs: Directly register domain state change callbacks	Never

Description Elad 2014-10-27 09:01:19 UTC

Created attachment 950915 [details]
vdsm.log

Description of problem:
I disconnected the hosted engine host from the iSCSI storage server where the engine's VM disk is located by unmapping it. Then I reconnected the host by mapping it back in the storage server. The HostedEngine VM wasn't resumed after the LUNs were exposed back to the host.

Version-Release number of selected component (if applicable):
rhev 3.5 vt7
rhel6.6 host
ovirt-hosted-engine-setup-1.2.1-1.el6ev.noarch
rhevm-3.5.0-0.17.beta.el6ev.noarch
vdsm-4.16.7.1-1.el6ev.x86_64


How reproducible:
Always


Steps to Reproduce:
1. Deploy hosted-engine using iSCSI
2. Once deployment is done, unmap the host from the LUN where the engine's VM disk is located. Wait for the VM to enter to paused state (check 'virsh -r list') 
3. When the VM is paused, map the LUN back to the host.

Actual results:
The LUN is exposed back to the host:

multipath -ll:

3514f0c544760013a dm-16 XtremIO,XtremApp
size=40G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 5:0:0:2 sdp 8:240 active ready running

But the VM is not resumed from paused state. Waited for 1 hour.

Thread-435::DEBUG::2014-10-27 10:33:15,574::BindingXMLRPC::1139::vds::(wrapper) return vmGetStats
'status': 'Paused'


Thread-1504::DEBUG::2014-10-27 10:57:46,460::BindingXMLRPC::1139::vds::(wrapper) return vmGetStats
'status': 'Paused'




Expected results:
HostedEngine VM should be resumed automatically after EIO 

Additional info: vdsm.log

Comment 1 Allon Mureinik 2014-10-28 20:50:17 UTC

Stupid question: does any other VM resume after being paused in RHEL 6.6?
Are we sure this is specific to HE?

Comment 2 Elad 2014-10-29 10:04:35 UTC

Created attachment 951714 [details]
non-HE setup

(In reply to Allon Mureinik from comment #1)
> Stupid question: does any other VM resume after being paused in RHEL 6.6?
> Are we sure this is specific to HE?

Checked on RHEL6.6 host, RHEL6.6 guest:

2014-10-29 11:49:19,077 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-96) VM vm_2 79c976e8-e700-4a93-9590-5a387931e33c moved from Paused --> Up


Attaching the logs

Comment 3 Adam Litke 2014-11-19 19:41:02 UTC

I have found the problem.  The clientIF singleton registers a domain state change callback called contEIOVMs which is responsible for resuming the VM in this case.  Unfortunately the callback doesn't get registered properly due to two issues: 1) The current code requires a call to connectStoragePool in order to fully register the callbacks, and 2) The callback cannot be an instance method due to limitations of weakref.  See the linked gerrit patches for my proposed short-term solutions to the problem.

Comment 4 Elad 2014-12-09 16:22:59 UTC

Adam, 
I'm working on verification for the issue and I noticed that instead of resuming the hosted-engine VM from paused, the operation that is being done is power off the VM and then start it back.

MainThread::INFO::2014-12-09 17:42:07,403::hosted_engine::952::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff`


MainThread::INFO::2014-12-09 17:42:21,452::hosted_engine::880::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`


In qemu log:

block I/O error in device 'drive-virtio-disk0': Input/output error (5)
qemu: terminating on signal 15 from pid 2142
2014-12-09 15:42:07.856+0000: shutting down
2014-12-09 15:42:25.100+0000: starting up


AFAIK, resuming a VM from paused due to IO errors doesn't inolve these kind of operations. 

Based on that, I'm failing the verification, re-openning.

Comment 5 Allon Mureinik 2014-12-09 19:18:31 UTC

The VM comes up again - this is not a blocker.

Elad - can you please attach the full logs?

Adam - from the snippet, it seems that HE's code controls this. Can you confirm? If so, please move the BZ to the correct component.

Comment 6 Elad 2014-12-09 19:50:25 UTC

Created attachment 966441 [details]
logs1

Comment 7 Adam Litke 2014-12-09 20:32:28 UTC

(In reply to Allon Mureinik from comment #5)
> The VM comes up again - this is not a blocker.
> 
> Elad - can you please attach the full logs?
> 
> Adam - from the snippet, it seems that HE's code controls this. Can you
> confirm? If so, please move the BZ to the correct component.

Correct.  This is really a new situation.  The issue I fixed was a problem in vdsm where an event callback was not getting called.

What Elad is reporting is that hosted-engine detects that engine is down for an extended period of time and it makes the decision to power cycle the VM.  This may be the desired behavior for longer service interruptions.  In any case, moving to the hosted-engine component so they can decide if they want to change behavior.

Comment 9 Doron Fediuck 2014-12-10 14:02:39 UTC

The behavior described in comment 4 is the desired behavior for the engine
VM, as after some timeout it should be started on a different host (or the
same depending on the context).

Comment 10 Martin Sivák 2014-12-10 14:15:11 UTC

Hosted engine did exactly what it was supposed to do here.

It monitored the running VM:

MainThread::INFO::2014-12-09 17:28:21,217::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)

Storage domain failed:

MainThread::INFO::2014-12-09 17:28:44,161::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
MainThread::INFO::2014-12-09 17:30:36,529::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING

Was restored again:

MainThread::INFO::2014-12-09 17:31:02,855::hosted_engine::571::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 1

And the VM recovered briefly:

MainThread::INFO::2014-12-09 17:31:02,855::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)

Engine stopped responding to http status requests and we started the recovery timers:

MainThread::INFO::2014-12-09 17:31:12,902::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'paused'}, 'bridge': True, 'mem-free': 13709.0, 'maintenance': False, 'cpu-load': 0.13020000000000001, 'gateway': True}
MainThread::INFO::2014-12-09 
17:31:14,391::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 0)
MainThread::ERROR::2014-12-09 17:31:24,408::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 300 seconds

Engine did not recover so we asked it to shut down cleanly:

MainThread::ERROR::2014-12-09 17:36:24,110::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 0 seconds
MainThread::INFO::2014-12-09 17:36:36,483::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400)
MainThread::INFO::2014-12-09 17:36:46,505::hosted_engine::952::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown`
MainThread::INFO::2014-12-09 17:36:57,729::states::532::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Waiting on shutdown to complete (12 of 300 seconds)

Which did not happen so we forced it...

MainThread::INFO::2014-12-09 17:41:44,812::states::532::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Waiting on shutdown to complete (299 of 300 seconds)
MainThread::INFO::2014-12-09 17:41:57,381::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineForceStop (score: 2400)
MainThread::INFO::2014-12-09 17:42:07,403::hosted_engine::952::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff`
MainThread::INFO::2014-12-09 17:42:08,863::hosted_engine::957::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: Machine destroyed

and started it again on the best host.

MainThread::INFO::2014-12-09 17:42:10,107::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2014-12-09 17:42:20,130::states::458::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (2400), attempting to start engine VM

Which it did..

MainThread::INFO::2014-12-09 17:42:21,452::hosted_engine::880::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::ERROR::2014-12-09 17:42:22,017::hosted_engine::900::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost
MainThread::INFO::2014-12-09 17:42:23,966::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStarting (score: 2400)
MainThread::INFO::2014-12-09 17:43:59,046::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)


And next time you decide that something new might be wrong, please create a new bug instead of repurposing an old one.

Comment 11 Elad 2014-12-10 14:51:01 UTC

Martin, 
As part of the verification prcdure, if the bug is not fixed/fixed wrongly, we supposed to re-open it. By mistake I moved it to 'NEW' instead of 'ASSIGNED'.

Regarding the behaviour here - 
Resume a guest from EIO doesn't supposed to involve a VM shut down. I don't agree with comments #9 and #10. 

Allon, your decision

Comment 12 Allon Mureinik 2014-12-10 15:59:23 UTC

Nope, Doron is the relevant stakeholder here.

The way I understand, there are two competing mechanisms here - 
(1) VDSM should try to unpause a VM once its domain is out of EIO
(2) HE tries to power-off and power-up and VM.

Inherently, these two race each other.
If (1) doesn't happen at all - it's a bug, and IMHO should be solved, unless the SLA stakeholders can explain otherwise.
If (1) does work, but wasn't observed here since (2) happened too fast, I don't think we'd want to touch this.

Comment 13 Allon Mureinik 2014-12-16 20:20:07 UTC

If this behavior is what we define as correct, and we do not intend to change it (as noted in comment 12, IMHO, Doron is the stakeholder that should approve this), then by definition this bug should be returned to ON_QA for 3.5.0.

Comment 14 Elad 2014-12-17 07:27:36 UTC

As this is the desired behaviour, moving the bug to VERIFIED

Tested using rhev 3.5 vt13.1
ovirt-hosted-engine-setup-1.2.1-7.el6ev.noarch
ovirt-hosted-engine-ha-1.2.4-2.el6ev.noarch

Comment 15 Allon Mureinik 2015-02-16 19:14:07 UTC

RHEV-M 3.5.0 has been released, closing this bug.

Comment 16 Allon Mureinik 2015-02-16 19:14:09 UTC

RHEV-M 3.5.0 has been released, closing this bug.

Note You need to log in before you can comment on or make changes to this bug.