Bug 1176407

Summary: JSON-RPC | engine gets stuck randomly | ERROR (SSL Stomp Reactor) Unable to process messages
Product: Red Hat Enterprise Virtualization Manager Reporter: Meni Yakove <myakove>
Component: ovirt-engineAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: aberezin, amureini, bazulay, danken, ecohen, gklein, iheim, lpeer, lsurette, lvernia, myakove, nsednev, oourfali, pstehlik, ratamir, rbalakri, Rhev-m-bugs, yeylon
Target Milestone: ---Keywords: AutomationBlocker
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: vt13.5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1176528 (view as bug list) Environment:
Last Closed: 2015-02-17 17:13:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1164308, 1164311, 1176417    
Attachments:
Description Flags
vnic_profile_job_logs
none
new logs none

Description Meni Yakove 2014-12-21 13:26:36 UTC
Created attachment 971693 [details]
vnic_profile_job_logs

Description of problem:
Our automation jobs fail randomly with:
Engine:
2014-12-21 10:59:14,282 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages
2014-12-21 10:59:14,284 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing
2014-12-21 10:59:14,284 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages
2014-12-21 10:59:14,285 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing

Server:
2014-12-21 10:43:51,129 ERROR [stderr] (Response tracker) Exception in thread "Response tracker" java.lang.NullPointerException
2014-12-21 10:43:51,130 ERROR [stderr] (Response tracker)       at org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker.removeRequestFromTracking(ResponseTracker.java:58)
2014-12-21 10:43:51,131 ERROR [stderr] (Response tracker)       at org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker.run(ResponseTracker.java:93)
2014-12-21 10:43:51,131 ERROR [stderr] (Response tracker)       at java.lang.Thread.run(Thread.java:745)

switching to xml-rpc solve the issue

Version-Release number of selected component (if applicable):
rhevm-3.5.0-0.26.el6ev.noarch

How reproducible:


Steps to Reproduce:
1. Automation jobs:
Port mirroring 
vNIC Profiles
Network migration

Comment 1 Piotr Kliczewski 2014-12-21 16:13:08 UTC
I checked vdsm logs and this is the issue found on vdsm:

Thread-2882::ERROR::2014-12-19 23:08:33,431::API::1499::vds::(_rollback) Cannot delete network u'sw162': It doesn't exist in the system
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1497, in _rollback
    yield rollbackCtx
  File "/usr/share/vdsm/API.py", line 1385, in setupNetworks
    supervdsm.getProxy().setupNetworks(networks, bondings, options)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
ConfigNetworkError: (27, "Cannot delete network u'sw162': It doesn't exist in the system")


which causes error on the engine side.

I found additional issue in the engine logs which seems to be unrelated to networking issue:

2014-12-19 21:55:19,167 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (DefaultQuartzScheduler_Worker-30) [371dfc3f] transaction rolled back
2014-12-19 21:55:19,168 ERROR [org.ovirt.engine.core.bll.AddVmFromScratchCommand] (DefaultQuartzScheduler_Worker-30) [371dfc3f] Command org.ovirt.engine.core.bll.AddVmFromScratchCommand throw exception: java.lang.NullPointerException
	at org.ovirt.engine.core.bll.AddVmCommand.addDiskPermissions(AddVmCommand.java:1247) [bll.jar:]
	at org.ovirt.engine.core.bll.AddVmCommand$2.runInTransaction(AddVmCommand.java:825) [bll.jar:]
	at org.ovirt.engine.core.bll.AddVmCommand$2.runInTransaction(AddVmCommand.java:820) [bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:210) [utils.jar:]
	at org.ovirt.engine.core.bll.AddVmCommand.executeVmCommand(AddVmCommand.java:820) [bll.jar:]
	at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:104) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1191) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1330) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1955) [bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
	at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1354) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:352) [bll.jar:]
	at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:430) [bll.jar:]
	at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:411) [bll.jar:]
	at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:617) [bll.jar:]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_71]

Comment 2 Oved Ourfali 2014-12-22 07:28:15 UTC
Dan - can you look at this as well?

Comment 3 Piotr Kliczewski 2014-12-22 08:27:07 UTC
I pushed a patch to remove side effect (NPE) of setupNetwork failure. If test case covers the failure then we have a fix for this bug.

We need to understand nature of setupNetworks failure.

Comment 4 Allon Mureinik 2014-12-22 09:05:04 UTC
(In reply to Piotr Kliczewski from comment #1)
> I found additional issue in the engine logs which seems to be unrelated to
> networking issue:
Agreed. 
I cloned it to bug 1176528 - let's track it there.

Comment 5 Nikolai Sednev 2014-12-22 09:25:29 UTC
Might be connected to 1162774 ?

Comment 6 Oved Ourfali 2014-12-22 09:40:00 UTC
(In reply to Nikolai Sednev from comment #5)
> Might be connected to 1162774 ?

Doesn't seem related.

Comment 7 Allon Mureinik 2014-12-22 09:41:13 UTC
(In reply to Oved Ourfali from comment #6)
> (In reply to Nikolai Sednev from comment #5)
> > Might be connected to 1162774 ?
> 
> Doesn't seem related.
Agreed.

Comment 8 Nikolai Sednev 2014-12-22 16:36:55 UTC
I was also unable to add RHEL7.0 two hosts to RHEVM vt13.4 with JSONRPC enabled on both hosts by default, until I've removed this functionality from them.
Used components:
vdsm-4.16.8.1-4.el7ev.x86_64
qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64
sanlock-3.2.2-2.el7.x86_64
libvirt-client-1.2.8-10.el7.x86_64
rhevm-3.5.0-0.26.el6ev.noarch

2014-12-22 11:01:11,587 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to brown-vdsc.qa.lab.tlv.redhat.com/10.
35.106.12
2014-12-22 11:01:11,590 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing
2014-12-22 11:01:12,672 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-6) [58aefbe4] Command Ge
tCapabilitiesVDSCommand(HostName = purple-vds1.qa.lab.tlv.redhat.com, HostId = 90d25f9c-0e64-4c8e-9165-78e80b18219c, vds=Host[purple-vds1.qa.lab.tlv.redhat
.com,90d25f9c-0e64-4c8e-9165-78e80b18219c]) execution failed. Exception: VDSNetworkException: java.net.SocketTimeoutException: connect timed out
2014-12-22 11:01:12,673 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-6) [58aefbe4] Host purple-vds1.qa.lab.tlv.redhat.
com is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.
2014-12-22 11:01:12,686 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-6) [58aefbe4] Failure to refresh Vds ru
ntime info: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: connect timed out
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:126) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source) [:1.7.0_71]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_71]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_71]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method) [rt.jar:1.7.0_71]
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) [rt.jar:1.7.0_71]
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) [rt.jar:1.7.0_71]
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) [rt.jar:1.7.0_71]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [rt.jar:1.7.0_71]
        at java.net.Socket.connect(Socket.java:579) [rt.jar:1.7.0_71]
        at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618) [jsse.jar:1.7.0_71]
        at org.ovirt.engine.core.utils.ssl.AuthSSLProtocolSocketFactory.createSocket(AuthSSLProtocolSocketFactory.java:293) [utils.jar:]
        at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) [commons-httpclient.jar:]
        at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361) [commo
ns-httpclient.jar:]
        at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) [commons-httpclient.jar:]
:

Comment 9 Meni Yakove 2014-12-29 08:48:12 UTC
vt13.5

Comment 10 Raz Tamir 2014-12-30 13:56:38 UTC
Still occurs - vt13.5

ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages
11:59:26 2014-12-30 11:59:05,216 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages
11:59:26 2014-12-30 11:59:05,220 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing
11:59:26 2014-12-30 11:59:07,745 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler_Worker-56) Exception during connection
11:59:26 2014-12-30 11:59:07,752 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing
11:59:26 2014-12-30 11:59:07,758 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-56) [7e2170c7] Command SpmStatusVDSCommand(HostName = host_mixed_2, HostId = 825c1a69-19d0-4d64-8824-af76990d8c9e, storagePoolId = d83e1463-29dc-48a0-86e2-3fac3a41b15c) execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused

logs attached

Comment 11 Raz Tamir 2014-12-30 13:57:13 UTC
Created attachment 974407 [details]
new logs

Comment 12 Oved Ourfali 2014-12-30 17:45:35 UTC
Can you specify the flow in which it happened? Was the host responsive afterwards? Does it reproduce?

Comment 13 Piotr Kliczewski 2014-12-30 18:44:13 UTC
I checked the logs and the issue results is the same but cause is different:

Thread-64528::DEBUG::2014-12-30 11:59:04,007::API::545::vds::(migrationCreate) Migration create
Thread-64528::INFO::2014-12-30 11:59:04,007::clientIF::394::vds::(createVm) vmContainerLock acquired by vm 777439ba-28fe-4bd4-bad2-d602ea13b24d
Thread-64528::WARNING::2014-12-30 11:59:04,008::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7
Thread-64528::CRITICAL::2014-12-30 11:59:04,009::libvirtconnection::137::root::(wrapper) taking calling process down.
MainThread::DEBUG::2014-12-30 11:59:04,010::vdsm::58::vds::(sigtermHandler) Received signal 15
Thread-64528::DEBUG::2014-12-30 11:59:04,010::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error client socket is closed
MainThread::DEBUG::2014-12-30 11:59:04,010::protocoldetector::144::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor
Thread-64528::INFO::2014-12-30 11:59:04,010::vm::6098::root::(_getNetworkIp) network rhevm: using 0
MainThread::INFO::2014-12-30 11:59:04,011::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC Server
Detector thread::DEBUG::2014-12-30 11:59:04,012::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup) Cleaning Acceptor
MainThread::INFO::2014-12-30 11:59:04,014::vmchannels::188::vds::(stop) VM channels listener was stopped.
MainThread::INFO::2014-12-30 11:59:04,014::momIF::91::MOM::(stop) Shutting down MOM


It seems that during vm migration vdsm stopped/crashed. I suggest to open another bug.

Comment 14 Oved Ourfali 2014-12-30 19:04:21 UTC
Based on comment #13, moving back to on_qa. Raz, if not reproduced with the scenario in the description, please move to verified. Also, please open a new bug on the new scenario, and attach the relevant logs there.

Comment 15 Raz Tamir 2014-12-31 10:17:20 UTC
Based on comment #14 - moving to verify and I will open new bug (if needed)

Comment 16 Eyal Edri 2015-02-17 17:13:39 UTC
rhev 3.5.0 was released. closing.