2115035 – After a Controller reboot ovn_metadata_agent goes into unhealhy state

Bug 2115035 - After a Controller reboot ovn_metadata_agent goes into unhealhy state

Summary: After a Controller reboot ovn_metadata_agent goes into unhealhy state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ga
Target Release:	17.0
Assignee:	Miro Tomaska
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2114617 (view as bug list)
Depends On:
Blocks:	2114617
TreeView+	depends on / blocked

Reported:	2022-08-03 18:20 UTC by Julia Marciano
Modified:	2022-10-10 16:05 UTC (History)
CC List:	13 users (show)
Fixed In Version:	openvswitch2.17-2.17.0-32.1
Doc Type:	Bug Fix
Doc Text:	This update fixes a bug that caused intermittent SSL connection problems between services such as ovn-metadata-agent and the OVN southbound database.
Clone Of:
Environment:
Last Closed:	2022-09-21 12:24:42 UTC
Target Upstream Version:
Embargoed:
Flags:	mtomaska: needinfo- mtomaska: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-18010	0	None	None	None	2022-08-03 18:34:43 UTC
Red Hat Product Errata	RHEA-2022:6543	0	None	None	None	2022-09-21 12:25:14 UTC

Description Julia Marciano 2022-08-03 18:20:51 UTC

Description of problem:
On TLS-Everywhere env., after rebooting of controller node(s), connection to the cirros instance that was created after the reboot, had been refused:

[stack@undercloud-0 ~]$ ssh cirros.0.213
sss_ssh_knownhostsproxy: connect to host 10.0.0.213 port 22: Connection refused
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

The instance was hosted on compute-0. The ovn_metadata agent on this node appeared as unhealthy:
[root@compute-0 ~]# podman ps |grep ovn_metadata
d1a5e59b6515  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-neutron-metadata-agent-ovn:17.0_20220721.1  kolla_start           4 hours ago  Up 4 hours ago (unhealthy)              ovn_metadata_agent

[root@compute-1 ~]# less /var/log/containers/neutron/ovn-metadata-agent.log 
…
2022-08-02 13:22:04.837 25475 ERROR ovsdbapp.backend.ovs_idl.connection   File "/usr/lib64/python3.9/ssl.py", line 1170, in send
2022-08-02 13:22:04.837 25475 ERROR ovsdbapp.backend.ovs_idl.connection     raise ValueError(
2022-08-02 13:22:04.837 25475 ERROR ovsdbapp.backend.ovs_idl.connection ValueError: non-zero flags not allowed in calls to send() on <class 'eventlet.green.ssl.GreenSSLSocket'>



Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Deploy TLS-e HA Overcloud.
2. Reboot controller that holds OC main VIP (can be found in output of 'pcs status' command on controller node).
3. Boot a vm.
4. Try to ssh to the VM.

Actual results:
Connection to the VM is refused.
ovn_metadata_agent container is in unhealthy state.

Expected results:
The vm is reachable via ssh.
All containers are healthy.

Additional info:

Comment 4 Miro Tomaska 2022-08-05 18:29:38 UTC

Looks like the root cause of this issue is OVS switching from pyOpenSSL to python std library socket module. [1].
Python socket.send[2] does not allow non-zero flag for SSL. Which was ignored in pyOpenSSL send function[3] 

[1] https://github.com/openvswitch/ovs/commit/68543dd523bd00f53fa7b91777b962ccb22ce679 
[2] https://github.com/python/cpython/blob/main/Lib/ssl.py#L1141-L1156
[3] https://github.com/pyca/pyopenssl/blob/38f9b4e524ac6479d57021bba2270df84d85b672/src/OpenSSL/SSL.py#L1844

Comment 7 Miro Tomaska 2022-08-08 13:17:25 UTC

Patch is posted upstream for review.
https://github.com/ovsrobot/ovs/commit/f09a55946cc83583c2e93be632e50f51ea830322

Comment 9 spower 2022-08-09 16:17:52 UTC

trac team deemed this a GA blocker but not a blocker for beta

Comment 16 Julia Marciano 2022-08-23 00:16:05 UTC

Verified:

[stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS-17.0-RHEL-9-20220816.n.2[stack@undercloud-0 ~]$ 

[root@controller-0 ~]# rpm -qa|grep openvsw
openvswitch2.17-2.17.0-32.1.el9fdp.x86_64

After hard reboot (echo b > /proc/sysrq-trigger) of controller-2, ovn-metadata-agents are healthy on both the compute nodes:
[heat-admin@compute-0 ~]$ sudo -i
[root@compute-0 ~]# podman ps|grep meta
00534cbdb30e  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-neutron-metadata-agent-ovn:17.0_20220816.1  kolla_start           23 hours ago  Up 23 hours ago (healthy)              ovn_metadata_agent

[root@compute-1 ~]# podman ps|grep metadata
1a553fa027e7  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-neutron-metadata-agent-ovn:17.0_20220816.1  kolla_start           23 hours ago  Up 23 hours ago (healthy)              ovn_metadata_agent
[root@compute-1 ~]# 

ssh connection to the newly created instance succeeded:
[Tue Aug 23 12:05:49 AM UTC 2022] Trying to ssh to 10.0.0.161
cirros
Instance instance_d1f5085f0e is reachable via 10.0.0.161

Werified by automated tests as well:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2017.0/view/PidOne/job/DFG-pidone-sanity-17.0_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-ansible-sts-sanity-tls-everywhere/75/artifact/.sh/ansible_sts-ha-tests.log

Comment 17 Miro Tomaska 2022-08-23 14:07:50 UTC

*** Bug 2114617 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2022-09-21 12:24:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Note You need to log in before you can comment on or make changes to this bug.