By the way there are quite a few logging typos in lib/yajsonrpc/stomp.py, even in latest 4.18.5. Examples: raise ValueError("Contains illigal charachter `:`") raise ValueError("Containes invalid escape squence `\\%s`" % e.args[0])
Ok, now here is something VERY interesting. I've just setup a RHEL6.8-latest host in my 3.6.9-latest environment. Surprise, hit the same the customer is hitting. First the versions, which look the same as the customer (all very latest) Host: # rpm -qa | egrep 'vdsm|json' vdsm-python-4.16.38-1.el6ev.noarch vdsm-yajsonrpc-4.16.38-1.el6ev.noarch vdsm-xmlrpc-4.16.38-1.el6ev.noarch vdsm-jsonrpc-4.16.38-1.el6ev.noarch vdsm-4.16.38-1.el6ev.x86_64 vdsm-python-zombiereaper-4.16.38-1.el6ev.noarch vdsm-cli-4.16.38-1.el6ev.noarch python-simplejson-2.0.9-3.1.el6.x86_64 Engine: # rpm -qa | egrep 'rhevm|json' rhevm-iso-uploader-3.6.0-1.el6ev.noarch rhevm-setup-plugin-websocket-proxy-3.6.9.2-0.1.el6.noarch rhevm-setup-plugin-vmconsole-proxy-helper-3.6.9.2-0.1.el6.noarch rhevm-spice-client-x86-msi-3.6-7.el6.noarch rhevm-image-uploader-3.6.1-2.el6ev.noarch rhevm-dependencies-3.6.1-1.el6ev.noarch rhevm-setup-plugin-ovirt-engine-3.6.9.2-0.1.el6.noarch rhevm-restapi-3.6.9.2-0.1.el6.noarch rhevm-extensions-api-impl-3.6.9.2-0.1.el6.noarch rhevm-dbscripts-3.6.9.2-0.1.el6.noarch rhevm-spice-client-x64-cab-3.6-7.el6.noarch rhevm-cli-3.6.9.0-1.el6ev.noarch rhevm-branding-rhev-3.6.0-10.el6ev.noarch rhevm-websocket-proxy-3.6.9.2-0.1.el6.noarch rhevm-doc-3.6.9-1.el6eng.noarch rhevm-webadmin-portal-3.6.9.2-0.1.el6.noarch rhevm-setup-plugins-3.6.5-1.el6ev.noarch rhevm-tools-3.6.9.2-0.1.el6.noarch rhevm-spice-client-x64-msi-3.6-7.el6.noarch vdsm-jsonrpc-java-1.1.12-1.el6ev.noarch json-c-0.11-12.el6.x86_64 rhevm-setup-base-3.6.9.2-0.1.el6.noarch rhevm-backend-3.6.9.2-0.1.el6.noarch rhevm-3.6.9.2-0.1.el6.noarch rhevm-lib-3.6.9.2-0.1.el6.noarch rhevm-setup-plugin-ovirt-engine-common-3.6.9.2-0.1.el6.noarch rhevm-vmconsole-proxy-helper-3.6.9.2-0.1.el6.noarch rhevm-spice-client-x86-cab-3.6-7.el6.noarch rhevm-guest-agent-common-1.0.11-6.el6ev.noarch rhevm-log-collector-3.6.1-1.el6ev.noarch rhevm-tools-backup-3.6.9.2-0.1.el6.noarch rhevm-setup-3.6.9.2-0.1.el6.noarch rhevm-userportal-3.6.9.2-0.1.el6.noarch rhevm-sdk-python-3.6.9.1-1.el6ev.noarch During host installation it already had a hiccup. The engine timing out talking to vdsm, host went non-responding then back again. Examples: Engine: 2016-10-26 14:38:13,360 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-5) [2f44fe1d] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host RHEL6.8 (Address: 10.64.0.195). 2016-10-26 14:43:51,839 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-16) [] Command 'ListVDSCommand(HostName = RHEL6.8, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29c1dd12-4542-4c43-a83c-3727f42e1b2a', vds='Host[RHEL6.8,29c1dd12-4542-4c43-a83c-3727f42e1b2a]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues While in the host: Detector thread::WARNING::2016-10-26 14:23:52,895::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI' Detector thread::WARNING::2016-10-26 14:24:01,362::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI' Detector thread::WARNING::2016-10-26 14:24:29,425::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI' Detector thread::WARNING::2016-10-26 14:34:39,427::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI' Detector thread::WARNING::2016-10-26 14:38:14,712::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI' Detector thread::WARNING::2016-10-26 14:40:32,966::protocoldetector::242::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'SUBSCRI' I haven't tracked message by message like I did in comment #0, because the times are out of sync, but it looks exactly the same issue to me. And this time is in our LAN where there are no network issues, no VMs running, nothing. It's all idle. Seems 100% reproducible to me. While I am writing this my host already went Non-Responsive and then back to Up twice! I assume you will want to reproduce this in your env to study it in more detail. Please let me know if you need anything else. Thanks, Germano
Please enable debug log in your environment and provide them.
Hi Piotr, I actually modified the setup from yesterday by adding a second RHEL6 host. I captured about 10 minutes of logs, where you can see multiple timeouts and failures similar to the BZ description. And now with two Hosts the SPM is bouncing as well. Debug logs are attached. Please see the audit.log file first. Thanks, Germano
Based on the logs I see that during setupNetworks the reconnect took sometime due to wrong order of stomp frames. After sometime the connection was established but we need to fix the order anyway.
Fix verified by using ovirt-engine-3.6.9.2 and vdsm-4.16.38
Hi Piotr, I have deployed the patched vdsm-jsonrpc-java-client-1.1.13.jar you provided me into my reproducer. Within a bit more than 1 hour of testing (that is manually switching the SPM around), I had just one Not Responding/Bounce issue. Whereas before I usually to get one for each host every couple minutes. It seems much better, but I am not comfortable with this one occurrence. I am unable to reproduce it again though whereas it was extremely easy with the previous version. I don't think anything else changed in this env in the meantime, so even if the issue is not 100% fixed it looks like we have a huge improvement. I'll let it run for another 24h and report back if it happens again, looks like it won't. By the way, that unrecognized protocol SUBSCRI error is gone as well (did not show up for that single occurrence as well). Thanks,
Hi Piotr, More than 24h now. All good, did not happen again, logs all clean. Thanks
ok, vdsm-jsonrpc-java-1.3.7-1.el7ev.noarch