1867992 – [OVN] shared gateway does not work with RHEL worker nodes

Bug 1867992 - [OVN] shared gateway does not work with RHEL worker nodes

Summary: [OVN] shared gateway does not work with RHEL worker nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jacob Tanenbaum
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1871935
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-11 10:56 UTC by zhaozhanqi
Modified:	2020-10-27 16:27 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:27:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovn-node-logs (9.69 KB, text/plain) 2020-08-11 12:37 UTC, zhaozhanqi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2022	0	None	closed	Bug 1867992: Support RHEL7 workers by removing 'jq' commands from ovs setup	2020-12-17 17:51:55 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:27:50 UTC

Description zhaozhanqi 2020-08-11 10:56:33 UTC

Description of problem:

when scale up RHEL node for OVN cluster. with following error on ovn-node pod:

I0810 15:28:15.842012 21101 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 30935
I0810 15:28:15.842969 21101 ovs.go:157] exec(5): /usr/bin/ovs-vsctl -timeout=15 - port-to-br br-ex
I0810 15:28:15.848787 21101 ovs.go:160] exec(5): stdout: ""
I0810 15:28:15.848814 21101 ovs.go:161] exec(5): stderr: "ovs-vsctl: no port named br-ex\n"
I0810 15:28:15.848821 21101 ovs.go:163] exec(5): err: exit status 1
I0810 15:28:15.848833 21101 ovs.go:157] exec(6): /usr/bin/ovs-vsctl -timeout=15 - br-exists br-ex
I0810 15:28:15.854481 21101 ovs.go:160] exec(6): stdout: ""
I0810 15:28:15.854508 21101 ovs.go:161] exec(6): stderr: ""
I0810 15:28:15.854512 21101 ovs.go:163] exec(6): err: exit status 2
F0810 15:28:15.854613 21101 ovnkube.go:129] failed to convert br-ex to OVS bridge: Link not found

Version-Release number of selected component (if applicable):
4.6

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:



Expected results:


Additional info:

Comment 1 zhaozhanqi 2020-08-11 12:37:49 UTC

Created attachment 1711078 [details]
ovn-node-logs

Comment 3 Tim Rozet 2020-08-12 15:42:52 UTC

Looking at your setup, your new nodes failed during ovs-configuration service:
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Starting Configures OVS with proper host networking configuration...
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface=
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + counter=0
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + '[' 0 -lt 12 ']'
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ ip -j route show default
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service: main process exited, code=exited, status=127/n/a
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ jq -r '.[0].dev'
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: /usr/local/bin/configure-ovs.sh: line 14: jq: command not found
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: Option "-j" is unknown, try "ip -help".
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Unit ovs-configuration.service entered failed state.
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface=
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service failed.
sh-4.2# which jq
which: no jq in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

you are missing jq somehow on this node. Any ideas how that would be possible? Are you sure the new nodes have the right OS image?

Comment 4 Tim Rozet 2020-08-12 15:43:50 UTC

Looks like your new nodes have wrong OS image:
zzhaoovn46-bpqgh-master-2                  Ready      master   28h   v1.19.0-rc.2+5241b27-dirty   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 46.82.202008102140-0 (Ootpa)   4.18.0-211.el8.x86_64         cri-o://1.19.0-71.rhaos4.6.git19455e9.el8-dev
zzhaoovn46-bpqgh-rhel-0                    NotReady   worker   27h   v1.19.0-rc.2+9932f63-dirty   10.0.1.6      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.18.2.el7.x86_64   cri-o://1.19.0-71.rhaos4.6.git19455e9.el7-dev

Comment 5 Tim Rozet 2020-08-12 18:43:31 UTC

I think we should be using RHEL 8.2 and not 7.8. Can someone confirm? If so, for RHEL 8.2 we need to answer the following questions:

1. Does rhel 8.2 have jq by default? If not, thats a problem
2. There were NetworkManager specific fixes that went into a hotfix build for RHCOS 4.6, that are supposed to land in a different RHEL 8.2 z stream later, so without that this also wont work:
https://bugzilla.redhat.com/show_bug.cgi?id=1857775
https://bugzilla.redhat.com/show_bug.cgi?id=1820052

Comment 6 zhaozhanqi 2020-08-13 01:29:58 UTC

RHEL78 is always supported in 4.x version (4.3/4.4/4.5) and no issue before.

Comment 7 Dan Winship 2020-08-14 12:58:02 UTC

I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we need to just make sure jq gets installed on them. I assume there must already be infrastructure somewhere (MCO?) for ensuring that the RPMs we need are available on all nodes...

Comment 8 Tim Rozet 2020-08-14 21:37:08 UTC

(In reply to Dan Winship from comment #7)
> I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we
> need to just make sure jq gets installed on them. I assume there must
> already be infrastructure somewhere (MCO?) for ensuring that the RPMs we
> need are available on all nodes...

It's BYO RHEL, so I think the user would have to include the package. I'm not sure if MCO can install the package. An alternative is we could just remove using jq from the script. Additionally we need fixes backported for NM OVS from 8.2 into 7.9z:
https://bugzilla.redhat.com/show_bug.cgi?id=1852106
https://bugzilla.redhat.com/show_bug.cgi?id=1820052

Comment 9 Vadim Rutkovsky 2020-08-17 18:32:33 UTC

cc'ing Russel as `jq` needs to be installed on hosts using openshift-ansible

Comment 10 Russell Teague 2020-08-17 18:57:22 UTC

Support packages are installed on RHEL workers based on this list:
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/defaults/main.yml#L20

If jq is required, it would need to be added to that list.  `jq` has not been a requirement for any components previously.

Comment 11 Tim Rozet 2020-08-17 20:33:15 UTC

We can remove using jq, I was going to hold off until we can verify if we can get backports for the NM OVS bugs.

Comment 17 errata-xmlrpc 2020-10-27 16:27:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.