Bug 1387949 - Engine commands stuck on hosts with: Unrecognized protocol: 'SUBSCRI'.
Summary: Engine commands stuck on hosts with: Unrecognized protocol: 'SUBSCRI'.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm-jsonrpc-java
Version: 3.6.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Piotr Kliczewski
QA Contact: Jiri Belka
URL:
Whiteboard:
Depends On:
Blocks: 1391370
TreeView+ depends on / blocked
 
Reported: 2016-10-24 01:21 UTC by Germano Veit Michel
Modified: 2020-03-11 15:19 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1391370 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:56:42 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2724451 0 None None None 2016-10-24 23:37:14 UTC
Red Hat Product Errata RHEA-2017:1008 0 normal SHIPPED_LIVE vdsm-jsonrpc-java bug fix and enhancement update 2017-04-18 20:18:18 UTC
oVirt gerrit 60262 0 None MERGED control messages need specific order 2020-06-29 12:28:33 UTC

Comment 2 Germano Veit Michel 2016-10-24 01:40:41 UTC
By the way there are quite a few logging typos in lib/yajsonrpc/stomp.py, even in latest 4.18.5. Examples:

        raise ValueError("Contains illigal charachter `:`")

        raise ValueError("Containes invalid escape squence `\\%s`" % e.args[0])

Comment 20 Germano Veit Michel 2016-10-26 05:10:49 UTC
Ok, now here is something VERY interesting.

I've just setup a RHEL6.8-latest host in my 3.6.9-latest environment. Surprise, hit the same the customer is hitting.

First the versions, which look the same as the customer (all very latest)

Host:
# rpm -qa | egrep 'vdsm|json'
vdsm-python-4.16.38-1.el6ev.noarch
vdsm-yajsonrpc-4.16.38-1.el6ev.noarch
vdsm-xmlrpc-4.16.38-1.el6ev.noarch
vdsm-jsonrpc-4.16.38-1.el6ev.noarch
vdsm-4.16.38-1.el6ev.x86_64
vdsm-python-zombiereaper-4.16.38-1.el6ev.noarch
vdsm-cli-4.16.38-1.el6ev.noarch
python-simplejson-2.0.9-3.1.el6.x86_64

Engine:
# rpm -qa | egrep 'rhevm|json'
rhevm-iso-uploader-3.6.0-1.el6ev.noarch
rhevm-setup-plugin-websocket-proxy-3.6.9.2-0.1.el6.noarch
rhevm-setup-plugin-vmconsole-proxy-helper-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x86-msi-3.6-7.el6.noarch
rhevm-image-uploader-3.6.1-2.el6ev.noarch
rhevm-dependencies-3.6.1-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.9.2-0.1.el6.noarch
rhevm-restapi-3.6.9.2-0.1.el6.noarch
rhevm-extensions-api-impl-3.6.9.2-0.1.el6.noarch
rhevm-dbscripts-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x64-cab-3.6-7.el6.noarch
rhevm-cli-3.6.9.0-1.el6ev.noarch
rhevm-branding-rhev-3.6.0-10.el6ev.noarch
rhevm-websocket-proxy-3.6.9.2-0.1.el6.noarch
rhevm-doc-3.6.9-1.el6eng.noarch
rhevm-webadmin-portal-3.6.9.2-0.1.el6.noarch
rhevm-setup-plugins-3.6.5-1.el6ev.noarch
rhevm-tools-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x64-msi-3.6-7.el6.noarch
vdsm-jsonrpc-java-1.1.12-1.el6ev.noarch
json-c-0.11-12.el6.x86_64
rhevm-setup-base-3.6.9.2-0.1.el6.noarch
rhevm-backend-3.6.9.2-0.1.el6.noarch
rhevm-3.6.9.2-0.1.el6.noarch
rhevm-lib-3.6.9.2-0.1.el6.noarch
rhevm-setup-plugin-ovirt-engine-common-3.6.9.2-0.1.el6.noarch
rhevm-vmconsole-proxy-helper-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x86-cab-3.6-7.el6.noarch
rhevm-guest-agent-common-1.0.11-6.el6ev.noarch
rhevm-log-collector-3.6.1-1.el6ev.noarch
rhevm-tools-backup-3.6.9.2-0.1.el6.noarch
rhevm-setup-3.6.9.2-0.1.el6.noarch
rhevm-userportal-3.6.9.2-0.1.el6.noarch
rhevm-sdk-python-3.6.9.1-1.el6ev.noarch

During host installation it already had a hiccup. The engine timing out talking to vdsm, host went non-responding then back again.

Examples:

Engine:

2016-10-26 14:38:13,360 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-5) [2f44fe1d] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host RHEL6.8 (Address: 10.64.0.195).

2016-10-26 14:43:51,839 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-16) [] Command 'ListVDSCommand(HostName = RHEL6.8, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29c1dd12-4542-4c43-a83c-3727f42e1b2a', vds='Host[RHEL6.8,29c1dd12-4542-4c43-a83c-3727f42e1b2a]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

While in the host:

Detector thread::WARNING::2016-10-26 14:23:52,895::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:24:01,362::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:24:29,425::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:34:39,427::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:38:14,712::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:40:32,966::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'

I haven't tracked message by message like I did in comment #0, because the times are out of sync, but it looks exactly the same issue to me. 

And this time is in our LAN where there are no network issues, no VMs running, nothing. It's all idle.

Seems 100% reproducible to me.

While I am writing this my host already went Non-Responsive and then back to Up twice!

I assume you will want to reproduce this in your env to study it in more detail. Please let me know if you need anything else. 

Thanks,
Germano

Comment 22 Piotr Kliczewski 2016-10-26 07:20:14 UTC
Please enable debug log in your environment and provide them.

Comment 24 Germano Veit Michel 2016-10-27 05:11:20 UTC
Hi Piotr,

I actually modified the setup from yesterday by adding a second RHEL6 host. I captured about 10 minutes of logs, where you can see multiple timeouts and failures similar to the BZ description. And now with two Hosts the SPM is bouncing as well.

Debug logs are attached. Please see the audit.log file first.

Thanks,
Germano

Comment 25 Piotr Kliczewski 2016-10-28 09:35:06 UTC
Based on the logs I see that during setupNetworks the reconnect took sometime due to wrong order of stomp frames. After sometime the connection was established but we need to fix the order anyway.

Comment 34 Piotr Kliczewski 2016-11-04 08:10:27 UTC
Fix verified by using ovirt-engine-3.6.9.2 and vdsm-4.16.38

Comment 35 Germano Veit Michel 2016-11-07 04:00:03 UTC
Hi Piotr,

I have deployed the patched vdsm-jsonrpc-java-client-1.1.13.jar you provided me into my reproducer.

Within a bit more than 1 hour of testing (that is manually switching the SPM around), I had just one Not Responding/Bounce issue. Whereas before I usually to get one for each host every couple minutes. 

It seems much better, but I am not comfortable with this one occurrence. I am unable to reproduce it again though whereas it was extremely easy with the previous version.

I don't think anything else changed in this env in the meantime, so even if the issue is not 100% fixed it looks like we have a huge improvement.

I'll let it run for another 24h and report back if it happens again, looks like it won't.

By the way, that unrecognized protocol SUBSCRI error is gone as well (did not show up for that single occurrence as well).

Thanks,

Comment 36 Germano Veit Michel 2016-11-08 06:43:03 UTC
Hi Piotr,

More than 24h now. All good, did not happen again, logs all clean.

Thanks

Comment 37 Jiri Belka 2017-01-23 16:53:21 UTC
ok, vdsm-jsonrpc-java-1.3.7-1.el7ev.noarch


Note You need to log in before you can comment on or make changes to this bug.