Bug 1787195 - [4.4.0-13] 1 out of 3 hosts non-responsive after ansible install finished - ovn-controller[20628]: ovs|00040|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private key assigned
Summary: [4.4.0-13] 1 out of 3 hosts non-responsive after ansible install finished - o...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: ovirt-host-deploy-ansible
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.4.0
: ---
Assignee: Dana
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-01 09:45 UTC by Avihai
Modified: 2020-05-20 20:03 UTC (History)
7 users (show)

Fixed In Version: ansible-runner-service-1.0.1-3
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 20:03:53 UTC
oVirt Team: Infra
Embargoed:
pm-rhel: ovirt-4.4+
mperina: blocker?


Attachments (Terms of Use)
jornalctl engine and vdsm logs (1.10 MB, application/gzip)
2020-01-01 09:45 UTC, Avihai
no flags Details
2nd ENV with the same issue engine and host journalctl logs (826.25 KB, application/gzip)
2020-01-01 09:47 UTC, Avihai
no flags Details
vdsm log with the issue (645.41 KB, application/x-xz)
2020-01-01 15:48 UTC, Avihai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 106357 0 master MERGED runner: Synchronize the execute action 2021-02-01 18:17:48 UTC

Description Avihai 2020-01-01 09:45:57 UTC
Created attachment 1649028 [details]
jornalctl engine and vdsm logs

Description of problem:
I see this issue on multiple ENVs causing one of my hosts(1 out of 3 hosts) to be unresponsive in this build after clean reprovisioning(install RHEL8.1 + install vdsm) to rhv-4.4.0-13.

The issue occurs just after host deploy(ansible install finished).

Looks like ovn-controller.service status shows many "|Private key does not match certificate public key: " Errors on the problematic host.

If I manually move the host to maintenance -> enroll certificate -> restart ovn-controller.service-> reinstall the host this will solve the issue.

From journalctl of the host(storage-ge7-vdsm2.scl.lab.tlv.redhat.com):

Dec 23 11:46:32 storage-ge7-vdsm2.scl.lab.tlv.redhat.com vdsm[23859]: ERROR uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.162.6', 36176, 0, 0) at 0x7effe86139e8> (<class 'ssl.SSLError'>:[X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:3563) [/usr/lib64/python3.6/asyncore.py|readwrite|110] [/usr/lib64/python3.6/asyncore.py|handle_write_event|442] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_write|74] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|handle_write|190] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_handle_io|194] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_set_up_socket|156])
Dec 23 11:46:32 storage-ge7-vdsm2.scl.lab.tlv.redhat.com ovn-controller[20628]: ovs|00040|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private key assigned

Engine log( +7 hours timezone) :
2019-12-23 18:46:32,207+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] EVENT_ID: VDS_ANSIBLE_INSTALL_FINISHED(561), Ansible
host-deploy playbook execution has successfully finished on host host_mixed_2.
2019-12-23 18:46:32,343+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [7fc25b0f] Connecting to storage-ge7-vdsm2.scl.lab.tlv.redhat.com/10.35.82.76
2019-12-23 18:46:35,323+02 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (EE-ManagedThreadFactory-engine-Thread-4) [365d29e1] Retry failed

2019-12-23 18:48:34,677+02 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (EE-ManagedThreadFactory-engine-Thread-60) [365d29e1] Retry failed
2019-12-23 18:48:34,678+02 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] Host installation failed for host 'a5e32c47-498b-4d24-92a1-9a8bf332a03e', 'host_mixed_2': Network error during communication with the host
2019-12-23 18:48:34,685+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='a5e32c47-498b-4d24-92a1-9a8bf332a03e', status='NonResponsive', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2785942f
2019-12-23 18:48:34,694+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] FINISH, SetVdsStatusVDSCommand, return: , log id: 2785942f
2019-12-23 18:48:34,706+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2) [365d29e1] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_2 installation failed. Network error during communication with the host.?365d29e1?


Version-Release number of selected component (if applicable):
http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-13/
Engine:
ovirt-engine-setup-4.4.0-0.13.master.el7.noarch

Host:
ovn2.11-2.11.1-8.el8fdp.x86_64
rhv-openvswitch-ovn-host-2.11-6.el8ev.noarch
vdsm-4.40.0-164.git38a19bb.el8ev.x86_64

How reproducible:
30% , happens on 1 out of 3 hosts.

Steps to Reproduce:
1. Engine machine(not HE) installed with rhv-4.4.0-13.

2. Clean reprovision(install RHEL8.1 from scratch and install vdsm)on 3 hosts.

3. Add the 3 hosts (done via automation) to engine with trigger host deploy.

Actual results:
See that host-deploy playbook execution has successfully finished also on the problematic host host_mixed_2 (storage-ge7-vdsm2.scl.lab.tlv.redhat.com) but when the engine tries to connect to this host to continue the installation it fails.

Trying to connect from engine-> host fails with :
2019-12-23 18:46:32,343+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [7fc25b0f] Connecting to storage-ge7-vdsm2.scl.lab.tlv.redhat.com/10.35.82.76
2019-12-23 18:46:35,323+02 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (EE-ManagedThreadFactory-engine-Thread-4) [365d29e1] Retry failed

host goes to 'non-responsive' forever.

Expected results:


Additional info:

A) Checking this on another ENV with the same issue I saw that:

1) ovn-controller.service status shows many "|Private key does not match certificate public key: " Errors.

[root@storage-ge11-vdsm3 ~]# systemctl status ovn-controller.service 
● ovn-controller.service - OVN controller daemon
   Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-01-01 03:04:12 EST; 50min ago
  Process: 1410 ExecStop=/usr/share/openvswitch/scripts/ovn-ctl stop_controller (code=exited, status=0/SUCCESS)
  Process: 1428 ExecStart=/usr/share/openvswitch/scripts/ovn-ctl --no-monitor start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 1460 (ovn-controller)
    Tasks: 4 (limit: 17442)
   Memory: 3.5M
   CGroup: /system.slice/ovn-controller.service
           └─1460 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --private-key=/etc/pki/vdsm/keys/vdsmkey.pem --certificate=/etc/pki/vdsm/certs/vdsmcert.pem --ca-cert=>

Jan 01 03:53:17 storage-ge11-vdsm3.scl.lab.tlv.redhat.com ovn-controller[1460]: ovs|00389|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key>

B) Workaround found:
1) Move host to maintenance (if you can't sometimes restarting engine +vdsm will help)
2) on webadmin -> hosts -> choose host -> installation -> enroll certificate -> wait untill its finished
3) Connect to host via ssh -> systemctl restart ovn-controller.service
4) on webadmin -> hosts -> choose host -> installation -> Reinstall-> wait untill its finished
The host is up and running :)

Comment 1 Avihai 2020-01-01 09:47:49 UTC
Created attachment 1649029 [details]
2nd ENV with the same issue engine and host journalctl logs

Added 2nd reproduction of the same issue engine and host journalctl logs

Comment 2 Dan Kenigsberg 2020-01-01 10:07:10 UTC
Avihai, why do you think that vdsm is not responsive due to ovn? I would guess that both are failing due to incomplete key setup. I would guess this is a deployment bug, not network-specific.

I do not see vdsm.log in your attached log, it should show failed connections from Engine.
Does vdsm-cli work locally?

On another matter: this bug should not be assigned to me. Was it set so by default?

Comment 3 Michael Burman 2020-01-01 10:10:05 UTC
For some reason you assigned as default. 
Network team didn't saw this issue. But i did saw this issue on Avihai's env..

Comment 4 Avihai 2020-01-01 15:47:24 UTC
(In reply to Dan Kenigsberg from comment #2)
> Avihai, why do you think that vdsm is not responsive due to ovn? I would
> guess that both are failing due to incomplete key setup. I would guess this
> is a deployment bug, not network-specific.

Indeed I'm not sure but this is the only issue/error but I see in journalctl log in the host that ovn-controller has this ssh key issue and in the service logs I see the same issue thus I assume it's the root cause here.

Journal log excerpt:
Dec 22 11:32:00 storage-ge8-vdsm2.scl.lab.tlv.redhat.com ovn-controller[24448]: ovs|00075|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private key assigned


Also when I restarted the restart ovn-controller.service and reinstalled the host the issue was resolved.
But sure I'm not the expert here, you or Dominic are :)


> I do not see vdsm.log in your attached log, it should show failed
> connections from Engine.
I'll add it as well.

This is what you see there:
2019-12-23 11:46:32,357-0500 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.162.6', 36176, 0, 0) at 0x7effe861
39e8> (<class 'ssl.SSLError'>:[X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:3563) [/usr/lib64/python3.6/asyncore.py|readwrite|110] [/usr/lib64/python3.6/asyncore.py|handle_write_event|442] [/usr/lib/p
ython3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_write|74] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|handle_wr
ite|190] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_handle_io|194] [/usr/lib/python3.6/site-packages/vdsm/sslutils.py|_set_up_socket|156]) (betterAsyncore:179)


> Does vdsm-cli work locally?
do you mean 'vdsm-client' ? if so than yes.

> On another matter: this bug should not be assigned to me. Was it set so by
> default?
Answered by Michael B.

Comment 5 Avihai 2020-01-01 15:48:25 UTC
Created attachment 1649076 [details]
vdsm log with the issue

Comment 6 Avihai 2020-01-02 09:40:36 UTC
Other inputs that may be useful about the workaround/bug:

1) I also tried to reinstall the host with the correct user/password (not ssh key) just to see if ssh key was the issue but I still saw the same issue occurs.
This was done WITHOUT enroll certificate + restart ovn-controller.service.

2)enroll certificate play a key part in workaround this issue:


Before I manually do via webadmin-> host-> installation -> enroll certificate I see the SSL errors in the ovn-controller service[1].

After enroll certificate + restart ovn-controller.service the errors do not exist anymore.

Maybe this step "enroll certificate" somehow gets missed on one out of 3 hosts when we do multiple add hosts via ansible?


[1] BEFORE:
[root@rose11 ~]# systemctl status ovn-controller.service 
● ovn-controller.service - OVN controller daemon
   Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-01-02 04:29:41 EST; 1min 46s ago
  Process: 23900 ExecStop=/usr/share/openvswitch/scripts/ovn-ctl stop_controller (code=exited, status=0/SUCCESS)
  Process: 23917 ExecStart=/usr/share/openvswitch/scripts/ovn-ctl --no-monitor start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 23951 (ovn-controller)
    Tasks: 4 (limit: 26213)
   Memory: 5.9M
   CGroup: /system.slice/ovn-controller.service
           └─23951 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --private-key=/etc/pki/vdsm/keys/vdsmkey.pem --certificate=/etc/pki/vdsm/certs/vdsmcert.pem --ca-cert>

Jan 02 04:30:12 rose11.scl.lab.tlv.redhat.com ovn-controller[23951]: ovs|00023|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private>
Jan 02 04:30:20 rose11.scl.lab.tlv.redhat.com ovn-controller[23951]: ovs|00024|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private>
Jan 02 04:30:28 rose11.scl.lab.tlv.redhat.com ovn-controller[23951]: ovs|00025|stream_ssl|ERR|Private key does not match certificate public key: error:140A80BE:SSL routines:SSL_CTX_check_private_key:no private>

[2] AFTER 
Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com systemd[1]: Started OVN controller daemon.
[root@rose11 ~]# systemctl status ovn-controller.service 
● ovn-controller.service - OVN controller daemon
   Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-01-02 04:31:30 EST; 3min 58s ago
  Process: 26914 ExecStop=/usr/share/openvswitch/scripts/ovn-ctl stop_controller (code=exited, status=0/SUCCESS)
  Process: 26931 ExecStart=/usr/share/openvswitch/scripts/ovn-ctl --no-monitor start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 26963 (ovn-controller)
    Tasks: 4 (limit: 26213)
   Memory: 6.4M
   CGroup: /system.slice/ovn-controller.service
           └─26963 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --private-key=/etc/pki/vdsm/keys/vdsmkey.pem --certificate=/etc/pki/vdsm/certs/vdsmcert.pem --ca-cert>

Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com systemd[1]: Starting OVN controller daemon...
Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com ovn-ctl[26931]: Starting ovn-controller [  OK  ]
Jan 02 04:31:30 rose11.scl.lab.tlv.redhat.com systemd[1]: Started OVN controller daemon.

Comment 7 Avihai 2020-01-02 09:44:45 UTC
Also I notice that the workaround mentioned usually worked but for (in 4 ENV's) but for some ENV's only removing the host and re-adding it after the workaround procedure was finished did the trick.

Comment 9 Martin Perina 2020-01-06 10:24:50 UTC
Moving to infra, I was able to reproduce certificate issue, when 2 hosts are deployed at the same time, but it's some race, because it happened only during 2nd attempt, 1st attempt was successful

Comment 10 Michal Skrivanek 2020-01-06 11:51:39 UTC
however this seems to be a concurrency issue. it seems to work ok when you install one by one.

Comment 11 RHV bug bot 2020-01-24 19:50:16 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Tag 'ovirt-engine-4.4.0' doesn't contain patch 'https://gerrit.ovirt.org/106357']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.4.0

For more info please contact: infra

Comment 14 Lucie Leistnerova 2020-03-17 15:31:37 UTC
We don't see it in GE build anymore.

Checked run with ovirt-engine-4.4.0-0.25.master.el8ev.noarch

Comment 16 Sandro Bonazzola 2020-05-20 20:03:53 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.