Description of problem: ovsdb-server-nb.log is logging SSL connection drops, ovirt-provider-ovn seems to function correctly. Seems like a ovirt-provider-ovn component (python openvswitch or ovsdbapp) trying to connect wrongly to OVSDB NouthBound DB. Version-Release number of selected component (if applicable): python-openvswitch-2.9.0-1.el7fdp.noarch openvswitch-ovn-common-2.9.0-1.el7fdp.x86_64 openvswitch-2.9.0-1.el7fdp.x86_64 openvswitch-ovn-central-2.9.0-1.el7fdp.x86_64 RHV 4.2.2.6-0.1.el7 How reproducible: 100% Steps to Reproduce: 1. Keep ovirt-provider-ovn running (OVSDB NB client). Actual results: errors in the ovsdb-server-db.log: 2018-04-04T07:57:07.783Z|00070|reconnect|WARN|ssl:127.0.0.1:54216: connection dropped (Protocol error) 2018-04-04T08:02:08.281Z|00071|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2018-04-04T08:02:08.281Z|00072|jsonrpc|WARN|ssl:127.0.0.1:54264: receive error: Protocol error Expected results: No errors should appear. Additional info:
Created attachment 1417128 [details] ovn logs
Created attachment 1417129 [details] packet capture
One observation from the logs is that the error happens every 5 minutes. Trying to identify operations that run every 5 minutes would be a good first step towards identifying the problem. I'm confused about the pcap file. It only covers about a quarter of a second, and none of the port numbers in the pcap match the port numbers in the log file.
Created attachment 1417332 [details] pair of pcap and logs
(In reply to Mark Michelson from comment #4) > One observation from the logs is that the error happens every 5 minutes. > Trying to identify operations that run every 5 minutes would be a good first > step towards identifying the problem. > This is a listing of networks triggerd by ovirt engine. > I'm confused about the pcap file. It only covers about a quarter of a > second, and none of the port numbers in the pcap match the port numbers in > the log file. Please find a matching pair in attachment 1417332 [details].
Looking at traffic trace I guess the issue could be due to the sequence of operations performed at the connection tear-down. In particular, I think the client does not run 'shutdown' syscall but just closes the socket as soon as it has no more data to transmit. When the client tcp layer receives new data from the ovn-nb db sends back a RST segment to tear-down the connection. What client are you using? ovsdbapp?
(In reply to lorenzo bianconi from comment #7) > What client are you using? ovsdbapp? Yes, we are using ovsdbapp like in https://github.com/oVirt/ovirt-provider-ovn/blob/master/provider/ovndb/ovn_north.py#L98
Created attachment 1476242 [details] Try to close the idl when the Connection is stopped Can you test out this patch to see if it fixes the issue for you? I don't really know how RHEV is using ovsdbapp, and the 5-minute interval is suspicious and I wouldn't expect that to come from ovsdbapp itself. If whatever is handling the ovsdbapp Connection is exiting w/o calling Connection.stop(), I'm not sure if this patch will help, so you might also ensure that Connection.stop() is called in addition to applying the patch.
Sorry Terry - I currently not able to verify your patch because I don't have it installed in my environment, and I also moved to a different team along the way. To the QE that takes it: the verification should be easily, install RHV 4.2 with ovirt-provider-ovn (engine-setup asks for it), keep it running for a while, watch the ovsdb-server-nb.log for errors as per the bug description. Dominik, do you think this error is triggered by ovirt-provider-ovn? or is it enough to give it idle time to run and watch the log for errors?
(In reply to Terry Wilson from comment #11) > Created attachment 1476242 [details] > Try to close the idl when the Connection is stopped > > Can you test out this patch to see if it fixes the issue for you? I don't > really know how RHEV is using ovsdbapp, and the 5-minute interval is > suspicious and I wouldn't expect that to come from ovsdbapp itself. > > If whatever is handling the ovsdbapp Connection is exiting w/o calling > Connection.stop(), I'm not sure if this patch will help, so you might also > ensure that Connection.stop() is called in addition to applying the patch. OK so network team gave me environment to check this. The issue is still reproducible on RHV 4.2.6.3_SNAPSHOT-94. I applied the patch on connection.py, but I still see the error in the logs. See attached archive. Versions: ovirt-provider-ovn-1.2.14-1.el7ev.noarch openvswitch-2.9.0-55.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch openvswitch-ovn-central-2.9.0-55.el7fdp.x86_64 openvswitch-ovn-common-2.9.0-55.el7fdp.x86_64 python-openvswitch-2.9.0-55.el7fdp.noarch python-ovsdbapp-0.6.0-1.el7ev.noarch
Created attachment 1476873 [details] logs and patched connections.py
ovirt-provider-ovn is using python2-ovsdbapp-0.6.0-1, which shuts down using 'close()' instead of 'stop()'. Let's see if we can update ovirt-provider-ovn to use a newer version vsdbapp. Is there a minimum version required for the new connection shut down behavior?
Looking at the code here: https://github.com/oVirt/ovirt-provider-ovn/blob/2cfaeefb01d5ad112e98a09abfca032465affd8c/provider/neutron/neutron_api.py#L1164 it looks like it specifically calls close() on the 'ovsidl' returned by python-ovs. instead of calling stop() on the connection ovsdbapp returned OvnNbApiIdlImpl which the patch would affect. Changing the self.ovsidl.close() in __exit__ referenced at the link above to self.idl.ovsdb_connection.stop() I think would exercise the patch.
Created attachment 1480123 [details] Call stop() on the ovsdbapp Connection This patch to the ovn provider would be required to actually use the previous patch.
(In reply to Terry Wilson from comment #17) > Created attachment 1480123 [details] > Call stop() on the ovsdbapp Connection > > This patch to the ovn provider would be required to actually use the > previous patch. There is actually a couple of other places where this could happen. For example in ovsdbapp/backend/ovs_idl/idlutils.py in get_schema_helper. We use an rpc connection to get the schema. We run this code: from ovs import stream from ovs import jsonrpc import ovs ovs.stream.Stream.ssl_set_private_key_file('/etc/pki/ovirt-engine/keys/ovirt-provider-ovn.key.nopass') ovs.stream.Stream.ssl_set_certificate_file('/etc/pki/ovirt-engine/certs/ovirt-provider-ovn.cer') ovs.stream.Stream.ssl_set_ca_cert_file('/etc/pki/ovirt-engine/ca.pem') err, strm = stream.Stream.open_block(stream.Stream.open('ssl:127.0.0.1:6641')) rpc = jsonrpc.Connection(strm) req = jsonrpc.Message.create_request('get_schema', ['OVN_Northbound']) err, resp = rpc.transact_block(req) rpc.close()
Hello, I have a customer who is facing this same issue. The customer is running on 4.2.6 and i believe they were not seeing these errors on 4.2.5 Could you let us know on why is this happening, also any workaround or timeline for this issue to get resolved?. Thanks, Siddhant Rao
Any progress on this issue? Seeing those errors on an empty installation of RHV 4.2.6 Also I noticed when I edit the OVS Provider via the RHV Manager and click on test I always have to acknowledge the self signed ssl cert, I thought it should store the cert or maybe have a configurable CA. Side node: testing the connection also throws those error messages.
and I can get rid of it by disabling "Automatic Synchronization". I do not use the default network provider so I guess it is safe to disable it.
Support said as long as I do not use the ovirt ovn provider I can safely disable "Automatic Synchronization" for it and therefore get rid of the error every 5 minutes.
Comment on attachment 1476242 [details] Try to close the idl when the Connection is stopped tested this patch - including the ovirt-provider-ovn change, that enables this one to be triggered - and the problem persists.
Created attachment 1520038 [details] patch verification process and logs the process listed in this patch can be followed to reproduce the issue, and work in it, if desired.
The solution proposed in patches [0] and [1] didn't work. Terry, is there any extra info you require ? [0] - https://bugzilla.redhat.com/attachment.cgi?id=1480123 [1] - https://bugzilla.redhat.com/attachment.cgi?id=1476242
clear needinfo on me, because Miguel replied.
Created attachment 1578130 [details] reproducer
Created attachment 1578131 [details] python-ovs fix
This issue isn't in ovsdapp, but instead is in python-ovs--see the attached reproducer script. ovs.stream.SSLStream has no special close() handling an so just closes the socket. tcpdumps showed "close notify" from the server, but not from the client. A patch adding a close() method for SSLStream that calls shutdown and ignores any errors is attached. It fixed the issue for me. Can someone experiencing the issue verify that the patch works for them?
Created attachment 1589825 [details] terminal log of testing attachment 1578131 [details]
Comment on attachment 1589825 [details] terminal log of testing attachment 1578131 [details] Thanks, Terry, your fix seems to work!
reproduced on python-openvswitch-2.9.0-123.el7fdp.x86_64: [root@hp-dl380pg8-12 bz1780745]# bash -x setup.sh + systemctl start ovn-northd + ovs-pki init --force Creating controllerca... Creating switchca... + pushd /etc/openvswitch /etc/openvswitch ~/bz1780745 + ovs-pki req+sign northdb controller northdb-req.pem Wed Dec 18 01:50:03 EST 2019 fingerprint a0c1cb6e217047f7f5c30f415123add2099ad0b5 + ovs-pki req+sign nbctl controller nbctl-req.pem Wed Dec 18 01:50:03 EST 2019 fingerprint 3dd43257df24557e7a4e4c2c6da4d92e2b39e27d + popd ~/bz1780745 + chown -R openvswitch /etc/openvswitch + chown -R openvswitch /var/lib/openvswitch + ovn-nbctl set-ssl /etc/openvswitch/northdb-privkey.pem /etc/openvswitch/northdb-cert.pem /var/lib/openvswitch/pki/controllerca/cacert.pem + ovn-nbctl set-connection pssl:6641 + python ovsdb_ssl_test.py (0, None, "[u'hello world']") + tail -n 10 /var/log/messages Dec 18 01:50:02 hp-dl380pg8-12 ovsdb-server: ovs|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server-sb.log Dec 18 01:50:02 hp-dl380pg8-12 ovsdb-server: ovs|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.9.0 Dec 18 01:50:02 hp-dl380pg8-12 ovn-sbctl: ovs|00001|sbctl|INFO|Called as ovn-sbctl init Dec 18 01:50:02 hp-dl380pg8-12 ovn-ctl: Starting ovn-northd [ OK ] Dec 18 01:50:02 hp-dl380pg8-12 systemd: Started OVN northd management daemon. Dec 18 01:50:03 hp-dl380pg8-12 ovn-nbctl: ovs|00001|nbctl|INFO|Called as ovn-nbctl set-ssl /etc/openvswitch/northdb-privkey.pem /etc/openvswitch/northdb-cert.pem /var/lib/openvswitch/pki/controllerca/cacert.pem Dec 18 01:50:03 hp-dl380pg8-12 ovn-nbctl: ovs|00001|nbctl|INFO|Called as ovn-nbctl set-connection pssl:6641 Dec 18 01:50:03 hp-dl380pg8-12 ovsdb-server: ovs|00003|stream_ssl|WARN|SSL_read: unexpected SSL connection close Dec 18 01:50:03 hp-dl380pg8-12 ovsdb-server: ovs|00004|jsonrpc|WARN|ssl:127.0.0.1:43324: receive error: Protocol error Dec 18 01:50:03 hp-dl380pg8-12 ovsdb-server: ovs|00005|reconnect|WARN|ssl:127.0.0.1:43324: connection dropped (Protocol error) <==== error message [root@hp-dl380pg8-12 bz1780745]# rpm -qa | grep openvswitch kernel-kernel-networking-openvswitch-ovn-regression-bz1771854_replicate_old_schema-1.0-2.noarch openvswitch-ovn-host-2.9.0-123.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-common-1.0-6.noarch openvswitch-2.9.0-123.el7fdp.x86_64 openvswitch-ovn-central-2.9.0-123.el7fdp.x86_64 python-openvswitch-2.9.0-123.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch openvswitch-ovn-common-2.9.0-123.el7fdp.x86_64 Verified on python-openvswitch-2.9.0-124.el7fdp.x86_64: [root@hp-dl380pg8-12 bz1780745]# bash -x setup.sh + systemctl start ovn-northd + ovs-pki init --force Creating controllerca... Creating switchca... + pushd /etc/openvswitch /etc/openvswitch ~/bz1780745 + ovs-pki req+sign northdb controller northdb-req.pem Wed Dec 18 01:53:46 EST 2019 fingerprint 299b89c654e1ffd36f10459e04bb63fe5189d84f + ovs-pki req+sign nbctl controller nbctl-req.pem Wed Dec 18 01:53:47 EST 2019 fingerprint 55cf8f65fd158748bd57626e6f11cc64620d338e + popd ~/bz1780745 + chown -R openvswitch /etc/openvswitch + chown -R openvswitch /var/lib/openvswitch + ovn-nbctl set-ssl /etc/openvswitch/northdb-privkey.pem /etc/openvswitch/northdb-cert.pem /var/lib/openvswitch/pki/controllerca/cacert.pem + ovn-nbctl set-connection pssl:6641 + python ovsdb_ssl_test.py (0, None, "[u'hello world']") + tail -n 10 /var/log/messages Dec 18 01:53:46 hp-dl380pg8-12 ovn-nbctl: ovs|00001|nbctl|INFO|Called as ovn-nbctl init Dec 18 01:53:46 hp-dl380pg8-12 ovn-ctl: /var/lib/openvswitch/ovnsb_db.db does not exist ... (warning). Dec 18 01:53:46 hp-dl380pg8-12 ovn-ctl: Creating empty database /var/lib/openvswitch/ovnsb_db.db [ OK ] Dec 18 01:53:46 hp-dl380pg8-12 ovsdb-server: ovs|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server-sb.log Dec 18 01:53:46 hp-dl380pg8-12 ovsdb-server: ovs|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.9.0 Dec 18 01:53:46 hp-dl380pg8-12 ovn-sbctl: ovs|00001|sbctl|INFO|Called as ovn-sbctl init Dec 18 01:53:46 hp-dl380pg8-12 ovn-ctl: Starting ovn-northd [ OK ] Dec 18 01:53:46 hp-dl380pg8-12 systemd: Started OVN northd management daemon. Dec 18 01:53:47 hp-dl380pg8-12 ovn-nbctl: ovs|00001|nbctl|INFO|Called as ovn-nbctl set-ssl /etc/openvswitch/northdb-privkey.pem /etc/openvswitch/northdb-cert.pem /var/lib/openvswitch/pki/controllerca/cacert.pem Dec 18 01:53:47 hp-dl380pg8-12 ovn-nbctl: ovs|00001|nbctl|INFO|Called as ovn-nbctl set-connection pssl:6641 <===== no error message [root@hp-dl380pg8-12 bz1780745]# rpm -qa | grep openvswitch kernel-kernel-networking-openvswitch-ovn-regression-bz1771854_replicate_old_schema-1.0-2.noarch openvswitch-ovn-central-2.9.0-124.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-common-1.0-6.noarch openvswitch-ovn-common-2.9.0-124.el7fdp.x86_64 openvswitch-ovn-host-2.9.0-124.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch openvswitch-2.9.0-124.el7fdp.x86_64 python-openvswitch-2.9.0-124.el7fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:0165