Bug 1323969 - race on recovery prevents events to be delivered
Summary: race on recovery prevents events to be delivered
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.17.30
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.0.0-beta
: 4.17.999
Assignee: Piotr Kliczewski
QA Contact: Pavel Stehlik
URL:
Whiteboard: Infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-05 08:25 UTC by Francesco Romani
Modified: 2016-08-17 14:36 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-08-17 14:36:58 UTC
oVirt Team: Infra
Embargoed:
oourfali: ovirt-4.0.0?
rule-engine: planning_ack+
mperina: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
full vdsm log attempting container recovery (152.18 KB, text/plain)
2016-04-05 08:25 UTC, Francesco Romani
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 56028 0 master MERGED events: ignore events when jsonrpc binding not available 2016-04-13 13:44:02 UTC

Description Francesco Romani 2016-04-05 08:25:01 UTC
Description of problem:
Could be split between virt and infra, targeting infra for the time being
Experienced only during recovery only while tinkering with containers (https://github.com/mojaves/ovirt-site/blob/container-support/source/feature/container-support.html.md)
I believe this is triggered by recovery being much faster with containers. Not sure it is a real risk using VMs.

We need to wait for JSONRPC binding availability before to start recovery, otherwise VM recovery will fail at the final step because

Thread-12::DEBUG::2016-04-05 10:11:22,733::__init__::207::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4376331650, "dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5": {"status": "Up", "displayInfo": [{"tlsPort": "-1", "ipAddress": "0", "type": "vnc", "port": "-1"}], "hash": "-8241192262902672289", "cpuUser": "0.00", "displayIp": "0", "monitorResponse": "0", "elapsedTime": "630", "displayType": "vnc", "cpuSys": "0.00", "displayPort": "-1", "displaySecurePort": "-1", "timeOffset": "0", "clientIp": ""}}, "jsonrpc": "2.0", "method": "|virt|VM_status|dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5"}
Thread-12::ERROR::2016-04-05 10:11:22,733::vm::759::virt.vm::(_startUnderlyingVm) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 740, in _startUnderlyingVm
    self.send_status_event(**self._getRunningVmStats())
  File "/usr/share/vdsm/virt/vm.py", line 349, in send_status_event
    self._notify('VM_status', stats)
  File "/usr/share/vdsm/virt/vm.py", line 353, in _notify
    self.cif.notify(sub_id, **{self.id: params})
  File "/usr/share/vdsm/clientIF.py", line 160, in notify
    notification.emit(**kwargs)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 208, in emit
    self._cb(notification)
  File "/usr/share/vdsm/clientIF.py", line 163, in _send_notification
    self.bindings['jsonrpc'].reactor.server.send(message,
KeyError: 'jsonrpc'

The binding will be booted only later:

MainThread::DEBUG::2016-04-05 10:11:22,835::protocoldetector::206::vds.MultiProtocolAcceptor::(add_detector) Adding detector <yajsonrpc.stompreactor.StompDetector instance at 0x1696830>
MainThread::DEBUG::2016-04-05 10:11:22,836::executor::69::Executor::(start) Starting executor
MainThread::DEBUG::2016-04-05 10:11:22,837::executor::161::Executor::(__init__) Starting worker jsonrpc.Executor/0
MainThread::DEBUG::2016-04-05 10:11:22,837::executor::161::Executor::(__init__) Starting worker jsonrpc.Executor/1


Version-Release number of selected component (if applicable):
Vdsm master

How reproducible:
I've only seen that using containers.

Steps to Reproduce:
Hopefully there are shorter ways, however that's how I do it:
1. Patch Vdsm to run containers: https://github.com/mojaves/ovirt-site/blob/container-support/source/feature/container-support.html.md#how-to-try-the-feature
2. run 1+ container
3. restart Vdsm

Actual results:
Vdsm fails to recovery with the aforementioned stack trace

Expected results:
Vdsm recovers succesfully

Additional info:
Please check full Vdsm log attached

Comment 1 Francesco Romani 2016-04-05 08:25:32 UTC
Created attachment 1143709 [details]
full vdsm log attempting container recovery

Comment 2 Oved Ourfali 2016-04-07 13:15:38 UTC
Francesco, trying to understand the impact, is only the event gets lost, or are there additional issues?

Comment 3 Francesco Romani 2016-04-07 13:19:52 UTC
(In reply to Oved Ourfali from comment #2)
> Francesco, trying to understand the impact, is only the event gets lost, or
> are there additional issues?

Should we face this bug, the recovery of VMs fail alltogether, 

Thread-12::INFO::2016-04-05 10:11:22,804::vm::1291::virt.vm::(setDownStatus) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::Changed state to Down: 'jsonrpc' (code=1)
Thread-12::INFO::2016-04-05 10:11:22,805::guestagent::345::virt.vm::(stop) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::Stopping connection
Thread-12::DEBUG::2016-04-05 10:11:22,805::vmchannels::234::vds::(unregister) Delete fileno 24 from listener.
Thread-12::DEBUG::2016-04-05 10:11:22,806::vmchannels::66::vds::(_unregister_fd) Failed to unregister FD from epoll (ENOENT): 24
Thread-12::DEBUG::2016-04-05 10:11:22,833::__init__::207::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4376331750, "dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5": {"status": "Down", "timeOffset": "0", "exitReason": 1, "exitMessage": "'jsonrpc'", "exitCode": 1}}, "jsonrpc": "2.0", "method": "|virt|VM_status|dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5"}
Thread-12::ERROR::2016-04-05 10:11:22,833::utils::374::root::(wrapper) Unhandled exception

And the VMs are thus destroyed by Engine:

jsonrpc.Executor/7::DEBUG::2016-04-05 10:11:37,813::__init__::511::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.destroy' in bridge with {u'vmID': u'dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5'}
jsonrpc.Executor/7::DEBUG::2016-04-05 10:11:37,815::API::310::vds::(destroy) About to destroy VM dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5
jsonrpc.Executor/7::DEBUG::2016-04-05 10:11:37,816::vm::3960::virt.vm::(destroy) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::destroy Called
jsonrpc.Executor/7::INFO::2016-04-05 10:11:37,816::vm::3875::virt.vm::(releaseVm) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::Release VM resources
jsonrpc.Executor/7::WARNING::2016-04-05 10:11:37,816::vm::335::virt.vm::(_set_lastStatus) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::trying to set state to Powering down when already Down
jsonrpc.Executor/7::INFO::2016-04-05 10:11:37,817::guestagent::345::virt.vm::(stop) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::Stopping connection
jsonrpc.Executor/7::INFO::2016-04-05 10:11:37,817::vm::3907::virt.vm::(_destroyVm) vmId=`dd38b452-0a9a-4e9e-a349-2f4ca73cc8c5`::_destroyVmGraceful attempt #0

OTOH I'd like to remark that I've never encountered this misebehaviour without my experimental container patches: I never seen this happen on Vdsm master.

Comment 4 Pavel Stehlik 2016-08-16 18:48:13 UTC
Closed due to capacity, if still reproduce, please reopen.


Note You need to log in before you can comment on or make changes to this bug.