Created attachment 1477290 [details] engine and vdsm Description of problem: The migration process of Hosted Engine VM is completed successfully, and after this the qemu error happens on the source host - "libvirtError: internal error: qemu unexpectedly closed the monitor:". After the error the HE score remains 0 for about 9 minutes. Version-Release number of selected component (if applicable): rhv-release-4.2.6-3-001.noarch How reproducible: sometimes. the problem was seen while automation tier2 run. Steps to Reproduce: 1.migrate the HE VM Actual results: migrates successfully. but then we see qemu error reported in engine and vdsm log (please see attachment) Expected results: no error on source host after migration Additional info: for QE: was seen while running tier2 rhevmtests/compute/sla/ha_vm test
That error is from attempted VM start, not migration. HE agent logs are needed, can you please attach that?
Created attachment 1477501 [details] hosted_engine logs the error happens on the source host after migration process agent.log and broker.log are attached
seems after the migration finished the local agent decided to run the HE VM again ... MainThread::INFO::2018-08-16 14:58:55,274::states::839::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Migration to None compl ete, no longer monitoring vm MainThread::INFO::2018-08-16 14:58:55,640::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineMigratingAway-EngineDown) sent? sent MainThread::INFO::2018-08-16 14:58:56,072::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-08-16 14:59:06,100::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local h ost has best score (3400), attempting to start engine VM ... Simone/Martin?
The fact the agent tries to start the VM again might not be an issue as the status from the destination might have not arrived yet. But the code should identify the "lock is taken" situation and not end up in unexpectedly down state (which penalizes the host for 10 minutes). Polina: - I presume the engine is happily up and running on the second host right? - Next time please attach the output of hosted-engine --vm-status from both hosts to make the situation clear (both are needed to check for whiteboard sync errors)
(In reply to Polina from comment #0) > Steps to Reproduce: > 1.migrate the HE VM How? triggering it from the engine? (In reply to Michal Skrivanek from comment #3) > seems after the migration finished the local agent decided to run the HE VM > again Yes, I think so. host 1 found the engine VM migrating away: MainThread::INFO::2018-08-16 14:56:05,503::states::421::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM found migrating away So I supposed that the migration of the engine VM has been triggered by the user from engine side and ovirt-ha-agent on host 1 simply detected it. Once the migration successfully completed, and before host 2 reported an updated status, host 1 is still the host with the better score and so it tried to start the engine VM again: MainThread::INFO::2018-08-16 14:59:07,652::hosted_engine::948::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2018-08-16 14:59:07,755::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent MainThread::INFO::2018-08-16 14:59:07,775::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-08-16 and sanlock prevented that as expected: 14:59:17,802::states::786::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is unexpectedly down. and so host1 moved to EngineMaybeAway: MainThread::INFO::2018-08-16 14:59:17,802::states::786::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is unexpectedly down. MainThread::INFO::2018-08-16 14:59:17,822::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> MainThread::INFO::2018-08-16 14:59:17,979::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineMaybeAway) sent? sent MainThread::INFO::2018-08-16 14:59:18,418::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMaybeAway (score: 3400) and so up to know everything worked as expected. The issue is that now, after that, host 1 moves from EngineMaybeAway to EngineUnexpectedlyDown MainThread::INFO::2018-08-16 14:59:18,433::states::880::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm was unexpectedly shut down MainThread::INFO::2018-08-16 14:59:18,454::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'> MainThread::INFO::2018-08-16 14:59:18,545::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineMaybeAway-EngineUnexpectedlyDown) sent? sent MainThread::INFO::2018-08-16 14:59:18,560::states::700::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Thu Aug 16 14:59:18 2018 Penalizing itself to 0 points for N minutes. On my opinion the issue comes from the handling on timeouts here: MainThread::INFO::2018-08-16 14:59:17,822::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> ... MainThread::INFO::2018-08-16 14:59:18,454::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>
(In reply to Martin Sivák from comment #4) > But the code should identify the "lock is taken" situation and not end up in > unexpectedly down state (which penalizes the host for 10 minutes). The host correctly went into EngineMaybeAway since it failed starting the engine VM due to sanlock, the issue is EngineMaybeAway -> EngineUnexpectedlyDown due to a timeout after that.
(In reply to Simone Tiraboschi from comment #6) > MainThread::INFO::2018-08-16 > 14:59:17,822::state_decorators::92::ovirt_hosted_engine_ha.agent. > hosted_engine.HostedEngine::(check) Timeout cleared while transitioning > <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class > 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> > ... > MainThread::INFO::2018-08-16 > 14:59:18,454::state_decorators::92::ovirt_hosted_engine_ha.agent. > hosted_engine.HostedEngine::(check) Timeout cleared while transitioning > <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class > 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'> We should have a 40 seconds timeout here: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/agent/constants.py.in#L48 but it moved from EngineStarting to EngineMaybeAway and then to EngineUnexpectedlyDown in less than a second. Maybe we introduced a regression with: https://gerrit.ovirt.org/#/c/90531/
(In reply to Martin Sivák from comment #4) > The fact the agent tries to start the VM again might not be an issue as the > status from the destination might have not arrived yet. > > But the code should identify the "lock is taken" situation and not end up in > unexpectedly down state (which penalizes the host for 10 minutes). > > Polina: > > - I presume the engine is happily up and running on the second host right? > - Next time please attach the output of hosted-engine --vm-status from both > hosts to make the situation clear (both are needed to check for whiteboard > sync errors) Yes, the engine is successfully Up on the second host. Since the problem is shown up while automation build running it is hard to present the "hosted-engine --vm-status". If the situation happens while manual testing, it will allow to get the --vm-status info. (In reply to Simone Tiraboschi from comment #5) > (In reply to Polina from comment #0) > > Steps to Reproduce: > > 1.migrate the HE VM > > How? triggering it from the engine? > Yes, the migration is triggered by user (by sending rest command to engine).
Looking at the vdsm logs, the VM failed to start because of libvirt error: 2018-08-16 14:59:11,729+0300 ERROR (vm/62c16ea2) [virt.vm] (vmId='62c16ea2-e456-4161-89f8-cd6061b43716') The vm start process failed (vm:948) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 877, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2895, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: qemu unexpectedly closed the monitor: 2018-08-16T11:59:11.027056Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2018-08-16T11:59:11.114622Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-f6390d8a-3aad-4dd2-a03c-723e11a038b1,id=ua-f6390d8a-3aad-4dd2-a03c-723e11a038b1,bootindex=1: Failed to get shared "write" lock Is another process using the image? This error probably means that the sanlock prevented VM from starting. There is also an unrelated warning before the error. The engine_health sub-monitor compares a specific string to this error message to detect failure to start because of sanlock. Here the message is different than what it expects, so it interprets the VM Down state as an unexpected failure.
Verification hint: the issue happens on file based storage domains (NFS, gluster) when qemu internal locking got triggered/detected before sanlock one so it's not 100% systematic. In order to be sure you really hit the issue please ensure you see 'Is another process using the image?' qemu error message under vdsm logs.
Verified on ovirt-release42-snapshot-4.2.7-0.2.rc2.20181007015214.gitfb30674.el7.noarch Hosted Storage on NFS. Tested by running several times the automation test which found the problem - rhevmtests/compute/sla/ha_vm/ha_vm_test.py
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.