1619365 – VM HostedEngine is unexpectedly down with qemu error on source host after migration is completed

Bug 1619365 - VM HostedEngine is unexpectedly down with qemu error on source host after migration is completed

Summary: VM HostedEngine is unexpectedly down with qemu error on source host after mig...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Agent
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1629887
TreeView+	depends on / blocked

Reported:	2018-08-20 15:44 UTC by Polina
Modified:	2018-11-02 14:38 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-hosted-engine-ha-2.2.18-1.el7ev.noarch.rpm
Clone Of:
Environment:
Last Closed:	2018-11-02 14:38:24 UTC
oVirt Team:	Integration
Embargoed:
Flags:	rule-engine: ovirt-4.2+

Attachments	(Terms of Use)
engine and vdsm (595.09 KB, application/x-gzip) 2018-08-20 15:44 UTC, Polina	no flags	Details
hosted_engine logs (380.70 KB, application/x-gzip) 2018-08-21 12:25 UTC, Polina	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1567615	high	CLOSED	Hosted engine - engine restarted during hosted engine VM migration -ovirt_hosted_engine_ha.agent.hosted_engine.HostedEng...	2021-02-22 00:41:40 UTC
oVirt gerrit	94460	v2.2.z	MERGED	ha: fix lock detection for newer qemu	2021-02-02 06:21:18 UTC
oVirt gerrit	94461	v2.2.z	MERGED	ha: fix lock detection for newer qemu	2021-02-02 06:21:18 UTC

Internal Links: 1567615

Description Polina 2018-08-20 15:44:18 UTC

Created attachment 1477290 [details]
engine and vdsm

Description of problem:
The migration process of Hosted Engine VM is completed successfully, and after this the qemu error happens on the source host - "libvirtError: internal error: qemu unexpectedly closed the monitor:". After the error the HE score remains 0 for about 9 minutes. 

Version-Release number of selected component (if applicable):
rhv-release-4.2.6-3-001.noarch

How reproducible: sometimes. the problem was seen while automation tier2 run.

Steps to Reproduce:
1.migrate the HE VM 

Actual results: migrates successfully. but then we see qemu error reported in engine and vdsm log (please see attachment)


Expected results: no error on source host after migration 


Additional info: for QE: was seen while running tier2 rhevmtests/compute/sla/ha_vm test

Comment 1 Michal Skrivanek 2018-08-21 05:32:13 UTC

That error is from attempted VM start, not migration. HE agent logs are needed, can you please attach that?

Comment 2 Polina 2018-08-21 12:25:20 UTC

Created attachment 1477501 [details]
hosted_engine logs

the error happens on the source host after migration process 
agent.log and broker.log are attached

Comment 3 Michal Skrivanek 2018-08-21 15:16:36 UTC

seems after the migration finished the local agent decided to run the HE VM again
...
MainThread::INFO::2018-08-16 14:58:55,274::states::839::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Migration to None compl
ete, no longer monitoring vm
MainThread::INFO::2018-08-16 14:58:55,640::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of
 state_transition (EngineMigratingAway-EngineDown) sent? sent
MainThread::INFO::2018-08-16 14:58:56,072::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current
 state EngineDown (score: 3400)
MainThread::INFO::2018-08-16 14:59:06,100::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local h
ost has best score (3400), attempting to start engine VM
...

Simone/Martin?

Comment 4 Martin Sivák 2018-08-21 15:59:58 UTC

The fact the agent tries to start the VM again might not be an issue as the status from the destination might have not arrived yet.

But the code should identify the "lock is taken" situation and not end up in unexpectedly down state (which penalizes the host for 10 minutes).

Polina:

- I presume the engine is happily up and running on the second host right?
- Next time please attach the output of hosted-engine --vm-status from both hosts to make the situation clear (both are needed to check for whiteboard sync errors)

Comment 5 Simone Tiraboschi 2018-08-21 16:02:27 UTC

(In reply to Polina from comment #0)
> Steps to Reproduce:
> 1.migrate the HE VM 

How? triggering it from the engine?


(In reply to Michal Skrivanek from comment #3)
> seems after the migration finished the local agent decided to run the HE VM
> again

Yes, I think so.

host 1 found the engine VM migrating away:

MainThread::INFO::2018-08-16 14:56:05,503::states::421::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM found migrating away

So I supposed that the migration of the engine VM has been triggered by the user from engine side and ovirt-ha-agent on host 1 simply detected it.

Once the migration successfully completed, and before host 2 reported an updated status, host 1 is still the host with the better score and so it tried to start the engine VM again:

MainThread::INFO::2018-08-16 14:59:07,652::hosted_engine::948::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost
MainThread::INFO::2018-08-16 14:59:07,755::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent
MainThread::INFO::2018-08-16 14:59:07,775::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-08-16

and sanlock prevented that as expected:

14:59:17,802::states::786::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is unexpectedly down.

and so host1 moved to EngineMaybeAway:

MainThread::INFO::2018-08-16 14:59:17,802::states::786::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is unexpectedly down.
MainThread::INFO::2018-08-16 14:59:17,822::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'>
MainThread::INFO::2018-08-16 14:59:17,979::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineMaybeAway) sent? sent
MainThread::INFO::2018-08-16 14:59:18,418::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMaybeAway (score: 3400)

and so up to know everything worked as expected.

The issue is that now, after that, host 1 moves from EngineMaybeAway to EngineUnexpectedlyDown

MainThread::INFO::2018-08-16 14:59:18,433::states::880::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm was unexpectedly shut down
MainThread::INFO::2018-08-16 14:59:18,454::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>
MainThread::INFO::2018-08-16 14:59:18,545::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineMaybeAway-EngineUnexpectedlyDown) sent? sent
MainThread::INFO::2018-08-16 14:59:18,560::states::700::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Thu Aug 16 14:59:18 2018

Penalizing itself to 0 points for N minutes.

On my opinion the issue comes from the handling on timeouts here:


MainThread::INFO::2018-08-16 14:59:17,822::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'>
...
MainThread::INFO::2018-08-16 14:59:18,454::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>

Comment 6 Simone Tiraboschi 2018-08-21 16:04:23 UTC

(In reply to Martin Sivák from comment #4)
> But the code should identify the "lock is taken" situation and not end up in
> unexpectedly down state (which penalizes the host for 10 minutes).

The host correctly went into EngineMaybeAway since it failed starting the engine VM due to sanlock, the issue is EngineMaybeAway -> EngineUnexpectedlyDown due to a timeout after that.

Comment 7 Simone Tiraboschi 2018-08-21 16:13:18 UTC

(In reply to Simone Tiraboschi from comment #6)
> MainThread::INFO::2018-08-16
> 14:59:17,822::state_decorators::92::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(check) Timeout cleared while transitioning
> <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
> 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'>
> ...
> MainThread::INFO::2018-08-16
> 14:59:18,454::state_decorators::92::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(check) Timeout cleared while transitioning
> <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class
> 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>

We should have a 40 seconds timeout here:
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/agent/constants.py.in#L48

but it moved from EngineStarting to EngineMaybeAway and then to EngineUnexpectedlyDown in less than a second.

Maybe we introduced a regression with: https://gerrit.ovirt.org/#/c/90531/

Comment 8 Polina 2018-08-22 07:03:44 UTC

(In reply to Martin Sivák from comment #4)
> The fact the agent tries to start the VM again might not be an issue as the
> status from the destination might have not arrived yet.
> 
> But the code should identify the "lock is taken" situation and not end up in
> unexpectedly down state (which penalizes the host for 10 minutes).
> 
> Polina:
> 
> - I presume the engine is happily up and running on the second host right?
> - Next time please attach the output of hosted-engine --vm-status from both
> hosts to make the situation clear (both are needed to check for whiteboard
> sync errors)

Yes, the engine is successfully Up on the second host. 
Since the problem is shown up while automation build running it is hard to present the "hosted-engine --vm-status". If the situation happens while manual testing, it will allow to get the --vm-status info.

(In reply to Simone Tiraboschi from comment #5)
> (In reply to Polina from comment #0)
> > Steps to Reproduce:
> > 1.migrate the HE VM 
> 
> How? triggering it from the engine?
> 
Yes, the migration is triggered by user (by sending rest command to engine).

Comment 9 Andrej Krejcir 2018-08-22 15:31:44 UTC

Looking at the vdsm logs, the VM failed to start because of libvirt error:


2018-08-16 14:59:11,729+0300 ERROR (vm/62c16ea2) [virt.vm] (vmId='62c16ea2-e456-4161-89f8-cd6061b43716') The vm start process failed (vm:948)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 877, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2895, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: qemu unexpectedly closed the monitor: 
2018-08-16T11:59:11.027056Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2018-08-16T11:59:11.114622Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-f6390d8a-3aad-4dd2-a03c-723e11a038b1,id=ua-f6390d8a-3aad-4dd2-a03c-723e11a038b1,bootindex=1: Failed to get shared "write" lock
Is another process using the image?


This error probably means that the sanlock prevented VM from starting. There is also an unrelated warning before the error.

The engine_health sub-monitor compares a specific string to this error message to detect failure to start because of sanlock. Here the message is different than what it expects, so it interprets the VM Down state as an unexpected failure.

Comment 10 Simone Tiraboschi 2018-09-24 07:45:35 UTC

Verification hint:
the issue happens on file based storage domains (NFS, gluster) when qemu internal locking got triggered/detected before sanlock one so it's not 100% systematic.

In order to be sure you really hit the issue please ensure you see 'Is another process using the image?' qemu error message under vdsm logs.

Comment 11 Polina 2018-10-09 09:14:53 UTC

Verified on 
ovirt-release42-snapshot-4.2.7-0.2.rc2.20181007015214.gitfb30674.el7.noarch
Hosted Storage on NFS. 
Tested by running several times the automation test which found the problem - rhevmtests/compute/sla/ha_vm/ha_vm_test.py

Comment 12 Sandro Bonazzola 2018-11-02 14:38:24 UTC

This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.