Bug 1387949
| Summary: | Engine commands stuck on hosts with: Unrecognized protocol: 'SUBSCRI'. | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
| Component: | vdsm-jsonrpc-java | Assignee: | Piotr Kliczewski <pkliczew> | |
| Status: | CLOSED ERRATA | QA Contact: | Jiri Belka <jbelka> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.6.9 | CC: | bazulay, dmoessne, gklein, gveitmic, jentrena, lsurette, mperina, oourfali, pkliczew, pstehlik, rabraham, rbalakri, Rhev-m-bugs, srevivo, ykaul | |
| Target Milestone: | ovirt-4.1.0-alpha | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1391370 (view as bug list) | Environment: | ||
| Last Closed: | 2017-04-25 00:56:42 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1391370 | |||
Ok, now here is something VERY interesting.
I've just setup a RHEL6.8-latest host in my 3.6.9-latest environment. Surprise, hit the same the customer is hitting.
First the versions, which look the same as the customer (all very latest)
Host:
# rpm -qa | egrep 'vdsm|json'
vdsm-python-4.16.38-1.el6ev.noarch
vdsm-yajsonrpc-4.16.38-1.el6ev.noarch
vdsm-xmlrpc-4.16.38-1.el6ev.noarch
vdsm-jsonrpc-4.16.38-1.el6ev.noarch
vdsm-4.16.38-1.el6ev.x86_64
vdsm-python-zombiereaper-4.16.38-1.el6ev.noarch
vdsm-cli-4.16.38-1.el6ev.noarch
python-simplejson-2.0.9-3.1.el6.x86_64
Engine:
# rpm -qa | egrep 'rhevm|json'
rhevm-iso-uploader-3.6.0-1.el6ev.noarch
rhevm-setup-plugin-websocket-proxy-3.6.9.2-0.1.el6.noarch
rhevm-setup-plugin-vmconsole-proxy-helper-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x86-msi-3.6-7.el6.noarch
rhevm-image-uploader-3.6.1-2.el6ev.noarch
rhevm-dependencies-3.6.1-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.9.2-0.1.el6.noarch
rhevm-restapi-3.6.9.2-0.1.el6.noarch
rhevm-extensions-api-impl-3.6.9.2-0.1.el6.noarch
rhevm-dbscripts-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x64-cab-3.6-7.el6.noarch
rhevm-cli-3.6.9.0-1.el6ev.noarch
rhevm-branding-rhev-3.6.0-10.el6ev.noarch
rhevm-websocket-proxy-3.6.9.2-0.1.el6.noarch
rhevm-doc-3.6.9-1.el6eng.noarch
rhevm-webadmin-portal-3.6.9.2-0.1.el6.noarch
rhevm-setup-plugins-3.6.5-1.el6ev.noarch
rhevm-tools-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x64-msi-3.6-7.el6.noarch
vdsm-jsonrpc-java-1.1.12-1.el6ev.noarch
json-c-0.11-12.el6.x86_64
rhevm-setup-base-3.6.9.2-0.1.el6.noarch
rhevm-backend-3.6.9.2-0.1.el6.noarch
rhevm-3.6.9.2-0.1.el6.noarch
rhevm-lib-3.6.9.2-0.1.el6.noarch
rhevm-setup-plugin-ovirt-engine-common-3.6.9.2-0.1.el6.noarch
rhevm-vmconsole-proxy-helper-3.6.9.2-0.1.el6.noarch
rhevm-spice-client-x86-cab-3.6-7.el6.noarch
rhevm-guest-agent-common-1.0.11-6.el6ev.noarch
rhevm-log-collector-3.6.1-1.el6ev.noarch
rhevm-tools-backup-3.6.9.2-0.1.el6.noarch
rhevm-setup-3.6.9.2-0.1.el6.noarch
rhevm-userportal-3.6.9.2-0.1.el6.noarch
rhevm-sdk-python-3.6.9.1-1.el6ev.noarch
During host installation it already had a hiccup. The engine timing out talking to vdsm, host went non-responding then back again.
Examples:
Engine:
2016-10-26 14:38:13,360 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-5) [2f44fe1d] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host RHEL6.8 (Address: 10.64.0.195).
2016-10-26 14:43:51,839 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-16) [] Command 'ListVDSCommand(HostName = RHEL6.8, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29c1dd12-4542-4c43-a83c-3727f42e1b2a', vds='Host[RHEL6.8,29c1dd12-4542-4c43-a83c-3727f42e1b2a]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
While in the host:
Detector thread::WARNING::2016-10-26 14:23:52,895::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:24:01,362::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:24:29,425::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:34:39,427::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:38:14,712::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
Detector thread::WARNING::2016-10-26 14:40:32,966::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI'
I haven't tracked message by message like I did in comment #0, because the times are out of sync, but it looks exactly the same issue to me.
And this time is in our LAN where there are no network issues, no VMs running, nothing. It's all idle.
Seems 100% reproducible to me.
While I am writing this my host already went Non-Responsive and then back to Up twice!
I assume you will want to reproduce this in your env to study it in more detail. Please let me know if you need anything else.
Thanks,
Germano
Please enable debug log in your environment and provide them. Hi Piotr, I actually modified the setup from yesterday by adding a second RHEL6 host. I captured about 10 minutes of logs, where you can see multiple timeouts and failures similar to the BZ description. And now with two Hosts the SPM is bouncing as well. Debug logs are attached. Please see the audit.log file first. Thanks, Germano Based on the logs I see that during setupNetworks the reconnect took sometime due to wrong order of stomp frames. After sometime the connection was established but we need to fix the order anyway. Fix verified by using ovirt-engine-3.6.9.2 and vdsm-4.16.38 Hi Piotr, I have deployed the patched vdsm-jsonrpc-java-client-1.1.13.jar you provided me into my reproducer. Within a bit more than 1 hour of testing (that is manually switching the SPM around), I had just one Not Responding/Bounce issue. Whereas before I usually to get one for each host every couple minutes. It seems much better, but I am not comfortable with this one occurrence. I am unable to reproduce it again though whereas it was extremely easy with the previous version. I don't think anything else changed in this env in the meantime, so even if the issue is not 100% fixed it looks like we have a huge improvement. I'll let it run for another 24h and report back if it happens again, looks like it won't. By the way, that unrecognized protocol SUBSCRI error is gone as well (did not show up for that single occurrence as well). Thanks, Hi Piotr, More than 24h now. All good, did not happen again, logs all clean. Thanks ok, vdsm-jsonrpc-java-1.3.7-1.el7ev.noarch |
By the way there are quite a few logging typos in lib/yajsonrpc/stomp.py, even in latest 4.18.5. Examples: raise ValueError("Contains illigal charachter `:`") raise ValueError("Containes invalid escape squence `\\%s`" % e.args[0])