Created attachment 1649028 [details] jornalctl engine and vdsm logs Description of problem: I see this issue on multiple ENVs causing one of my hosts(1 out of 3 hosts) to be unresponsive in this build after clean reprovisioning(install RHEL8.1 + install vdsm) to rhv-4.4.0-13. The issue occurs just after host deploy(ansible install finished). Looks like ovn-controller.service status shows many "|Private key does not match certificate public key: " Errors on the problematic host. If I manually move the host to maintenance -> enroll certificate -> restart ovn-controller.service-> reinstall the host this will solve the issue. From journalctl of the host(storage-ge7-vdsm2.scl.lab.tlv.redhat.com): Dec 23 11:46:32 storage-ge7-vdsm2.scl.lab.tlv.redhat.com vdsm[23859]: ERROR uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.162.6', 36176, 0, 0) at 0x7effe86139e8> (<class 'ssl.SSLError'>:[X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:3563) [/usr/lib64/python3.6/asyncore.py|readwrite|110] [/usr/lib64/python3.6/asyncore.py|handle_write_event|442] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_write|74] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|handle_write|190] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_handle_io|194] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_set_up_socket|156]) Dec 23 11:46:32 storage-ge7-vdsm2.scl.lab.tlv.redhat.com ovn-controller[20628]: ovs|00040|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private key assigned Engine log( +7 hours timezone) : 2019-12-23 18:46:32,207+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] EVENT_ID: VDS_ANSIBLE_INSTALL_FINISHED(561), Ansible host-deploy playbook execution has successfully finished on host host_mixed_2. 2019-12-23 18:46:32,343+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [7fc25b0f] Connecting to storage-ge7-vdsm2.scl.lab.tlv.redhat.com/10.35.82.76 2019-12-23 18:46:35,323+02 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (EE-ManagedThreadFactory-engine-Thread-4) [365d29e1] Retry failed 2019-12-23 18:48:34,677+02 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (EE-ManagedThreadFactory-engine-Thread-60) [365d29e1] Retry failed 2019-12-23 18:48:34,678+02 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] Host installation failed for host 'a5e32c47-498b-4d24-92a1-9a8bf332a03e', 'host_mixed_2': Network error during communication with the host 2019-12-23 18:48:34,685+02 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='a5e32c47-498b-4d24-92a1-9a8bf332a03e', status='NonResponsive', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2785942f 2019-12-23 18:48:34,694+02 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] FINISH, SetVdsStatusVDSCommand, return: , log id: 2785942f 2019-12-23 18:48:34,706+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_2 installation failed. Network error during communication with the host.?365d29e1? Version-Release number of selected component (if applicable): http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-13/ Engine: ovirt-engine-setup-4.4.0-0.13.master.el7.noarch Host: ovn2.11-2.11.1-8.el8fdp.x86_64 rhv-openvswitch-ovn-host-2.11-6.el8ev.noarch vdsm-4.40.0-164.git38a19bb.el8ev.x86_64 How reproducible: 30% , happens on 1 out of 3 hosts. Steps to Reproduce: 1. Engine machine(not HE) installed with rhv-4.4.0-13. 2. Clean reprovision(install RHEL8.1 from scratch and install vdsm)on 3 hosts. 3. Add the 3 hosts (done via automation) to engine with trigger host deploy. Actual results: See that host-deploy playbook execution has successfully finished also on the problematic host host_mixed_2 (storage-ge7-vdsm2.scl.lab.tlv.redhat.com) but when the engine tries to connect to this host to continue the installation it fails. Trying to connect from engine-> host fails with : 2019-12-23 18:46:32,343+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [7fc25b0f] Connecting to storage-ge7-vdsm2.scl.lab.tlv.redhat.com/10.35.82.76 2019-12-23 18:46:35,323+02 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (EE-ManagedThreadFactory-engine-Thread-4) [365d29e1] Retry failed host goes to 'non-responsive' forever. Expected results: Additional info: A) Checking this on another ENV with the same issue I saw that: 1) ovn-controller.service status shows many "|Private key does not match certificate public key: " Errors. [root@storage-ge11-vdsm3 ~]# systemctl status ovn-controller.service ● ovn-controller.service - OVN controller daemon Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-01-01 03:04:12 EST; 50min ago Process: 1410 ExecStop=/usr/share/openvswitch/scripts/ovn-ctl stop_controller (code=exited, status=0/SUCCESS) Process: 1428 ExecStart=/usr/share/openvswitch/scripts/ovn-ctl --no-monitor start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS) Main PID: 1460 (ovn-controller) Tasks: 4 (limit: 17442) Memory: 3.5M CGroup: /system.slice/ovn-controller.service └─1460 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --private-key=/etc/pki/vdsm/keys/vdsmkey.pem --certificate=/etc/pki/vdsm/certs/vdsmcert.pem --ca-cert=> Jan 01 03:53:17 storage-ge11-vdsm3.scl.lab.tlv.redhat.com ovn-controller[1460]: ovs|00389|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key> B) Workaround found: 1) Move host to maintenance (if you can't sometimes restarting engine +vdsm will help) 2) on webadmin -> hosts -> choose host -> installation -> enroll certificate -> wait untill its finished 3) Connect to host via ssh -> systemctl restart ovn-controller.service 4) on webadmin -> hosts -> choose host -> installation -> Reinstall-> wait untill its finished The host is up and running :)
Created attachment 1649029 [details] 2nd ENV with the same issue engine and host journalctl logs Added 2nd reproduction of the same issue engine and host journalctl logs
Avihai, why do you think that vdsm is not responsive due to ovn? I would guess that both are failing due to incomplete key setup. I would guess this is a deployment bug, not network-specific. I do not see vdsm.log in your attached log, it should show failed connections from Engine. Does vdsm-cli work locally? On another matter: this bug should not be assigned to me. Was it set so by default?
For some reason you assigned as default. Network team didn't saw this issue. But i did saw this issue on Avihai's env..
(In reply to Dan Kenigsberg from comment #2) > Avihai, why do you think that vdsm is not responsive due to ovn? I would > guess that both are failing due to incomplete key setup. I would guess this > is a deployment bug, not network-specific. Indeed I'm not sure but this is the only issue/error but I see in journalctl log in the host that ovn-controller has this ssh key issue and in the service logs I see the same issue thus I assume it's the root cause here. Journal log excerpt: Dec 22 11:32:00 storage-ge8-vdsm2.scl.lab.tlv.redhat.com ovn-controller[24448]: ovs|00075|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private key assigned Also when I restarted the restart ovn-controller.service and reinstalled the host the issue was resolved. But sure I'm not the expert here, you or Dominic are :) > I do not see vdsm.log in your attached log, it should show failed > connections from Engine. I'll add it as well. This is what you see there: 2019-12-23 11:46:32,357-0500 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.162.6', 36176, 0, 0) at 0x7effe861 39e8> (<class 'ssl.SSLError'>:[X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:3563) [/usr/lib64/python3.6/asyncore.py|readwrite|110] [/usr/lib64/python3.6/asyncore.py|handle_write_event|442] [/usr/lib/p ython3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_write|74] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|handle_wr ite|190] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_handle_io|194] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_set_up_socket|156]) (betterAsyncore:179) > Does vdsm-cli work locally? do you mean 'vdsm-client' ? if so than yes. > On another matter: this bug should not be assigned to me. Was it set so by > default? Answered by Michael B.
Created attachment 1649076 [details] vdsm log with the issue
Other inputs that may be useful about the workaround/bug: 1) I also tried to reinstall the host with the correct user/password (not ssh key) just to see if ssh key was the issue but I still saw the same issue occurs. This was done WITHOUT enroll certificate + restart ovn-controller.service. 2)enroll certificate play a key part in workaround this issue: Before I manually do via webadmin-> host-> installation -> enroll certificate I see the SSL errors in the ovn-controller service[1]. After enroll certificate + restart ovn-controller.service the errors do not exist anymore. Maybe this step "enroll certificate" somehow gets missed on one out of 3 hosts when we do multiple add hosts via ansible? [1] BEFORE: [root@rose11 ~]# systemctl status ovn-controller.service ● ovn-controller.service - OVN controller daemon Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2020-01-02 04:29:41 EST; 1min 46s ago Process: 23900 ExecStop=/usr/share/openvswitch/scripts/ovn-ctl stop_controller (code=exited, status=0/SUCCESS) Process: 23917 ExecStart=/usr/share/openvswitch/scripts/ovn-ctl --no-monitor start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS) Main PID: 23951 (ovn-controller) Tasks: 4 (limit: 26213) Memory: 5.9M CGroup: /system.slice/ovn-controller.service └─23951 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --private-key=/etc/pki/vdsm/keys/vdsmkey.pem --certificate=/etc/pki/vdsm/certs/vdsmcert.pem --ca-cert> Jan 02 04:30:12 rose11.scl.lab.tlv.redhat.com ovn-controller[23951]: ovs|00023|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private> Jan 02 04:30:20 rose11.scl.lab.tlv.redhat.com ovn-controller[23951]: ovs|00024|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private> Jan 02 04:30:28 rose11.scl.lab.tlv.redhat.com ovn-controller[23951]: ovs|00025|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private> [2] AFTER Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com systemd[1]: Started OVN controller daemon. [root@rose11 ~]# systemctl status ovn-controller.service ● ovn-controller.service - OVN controller daemon Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2020-01-02 04:31:30 EST; 3min 58s ago Process: 26914 ExecStop=/usr/share/openvswitch/scripts/ovn-ctl stop_controller (code=exited, status=0/SUCCESS) Process: 26931 ExecStart=/usr/share/openvswitch/scripts/ovn-ctl --no-monitor start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS) Main PID: 26963 (ovn-controller) Tasks: 4 (limit: 26213) Memory: 6.4M CGroup: /system.slice/ovn-controller.service └─26963 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --private-key=/etc/pki/vdsm/keys/vdsmkey.pem --certificate=/etc/pki/vdsm/certs/vdsmcert.pem --ca-cert> Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com systemd[1]: Starting OVN controller daemon... Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com ovn-ctl[26931]: Starting ovn-controller [ OK ] Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com systemd[1]: Started OVN controller daemon.
Also I notice that the workaround mentioned usually worked but for (in 4 ENV's) but for some ENV's only removing the host and re-adding it after the workaround procedure was finished did the trick.
Moving to infra, I was able to reproduce certificate issue, when 2 hosts are deployed at the same time, but it's some race, because it happened only during 2nd attempt, 1st attempt was successful
however this seems to be a concurrency issue. it seems to work ok when you install one by one.
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Tag 'ovirt-engine-4.4.0' doesn't contain patch 'https://gerrit.ovirt.org/106357'] gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.4.0 For more info please contact: infra
We don't see it in GE build anymore. Checked run with ovirt-engine-4.4.0-0.25.master.el8ev.noarch
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.