Bug 1867992 - [OVN] shared gateway does not work with RHEL worker nodes
Summary: [OVN] shared gateway does not work with RHEL worker nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Jacob Tanenbaum
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1871935
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-11 10:56 UTC by zhaozhanqi
Modified: 2020-10-27 16:27 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:27:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovn-node-logs (9.69 KB, text/plain)
2020-08-11 12:37 UTC, zhaozhanqi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2022 0 None closed Bug 1867992: Support RHEL7 workers by removing 'jq' commands from ovs setup 2020-12-17 17:51:55 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:27:50 UTC

Description zhaozhanqi 2020-08-11 10:56:33 UTC
Description of problem:

when scale up RHEL node for OVN cluster. with following error on ovn-node pod:

I0810 15:28:15.842012 21101 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 30935
I0810 15:28:15.842969 21101 ovs.go:157] exec(5): /usr/bin/ovs-vsctl -timeout=15 - port-to-br br-ex
I0810 15:28:15.848787 21101 ovs.go:160] exec(5): stdout: ""
I0810 15:28:15.848814 21101 ovs.go:161] exec(5): stderr: "ovs-vsctl: no port named br-ex\n"
I0810 15:28:15.848821 21101 ovs.go:163] exec(5): err: exit status 1
I0810 15:28:15.848833 21101 ovs.go:157] exec(6): /usr/bin/ovs-vsctl -timeout=15 - br-exists br-ex
I0810 15:28:15.854481 21101 ovs.go:160] exec(6): stdout: ""
I0810 15:28:15.854508 21101 ovs.go:161] exec(6): stderr: ""
I0810 15:28:15.854512 21101 ovs.go:163] exec(6): err: exit status 2
F0810 15:28:15.854613 21101 ovnkube.go:129] failed to convert br-ex to OVS bridge: Link not found

Version-Release number of selected component (if applicable):
4.6

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:



Expected results:


Additional info:

Comment 1 zhaozhanqi 2020-08-11 12:37:49 UTC
Created attachment 1711078 [details]
ovn-node-logs

Comment 3 Tim Rozet 2020-08-12 15:42:52 UTC
Looking at your setup, your new nodes failed during ovs-configuration service:
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Starting Configures OVS with proper host networking configuration...
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface=
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + counter=0
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + '[' 0 -lt 12 ']'
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ ip -j route show default
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service: main process exited, code=exited, status=127/n/a
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ jq -r '.[0].dev'
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: /usr/local/bin/configure-ovs.sh: line 14: jq: command not found
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: Option "-j" is unknown, try "ip -help".
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Unit ovs-configuration.service entered failed state.
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface=
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service failed.
sh-4.2# which jq
which: no jq in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

you are missing jq somehow on this node. Any ideas how that would be possible? Are you sure the new nodes have the right OS image?

Comment 4 Tim Rozet 2020-08-12 15:43:50 UTC
Looks like your new nodes have wrong OS image:
zzhaoovn46-bpqgh-master-2                  Ready      master   28h   v1.19.0-rc.2+5241b27-dirty   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 46.82.202008102140-0 (Ootpa)   4.18.0-211.el8.x86_64         cri-o://1.19.0-71.rhaos4.6.git19455e9.el8-dev
zzhaoovn46-bpqgh-rhel-0                    NotReady   worker   27h   v1.19.0-rc.2+9932f63-dirty   10.0.1.6      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.18.2.el7.x86_64   cri-o://1.19.0-71.rhaos4.6.git19455e9.el7-dev

Comment 5 Tim Rozet 2020-08-12 18:43:31 UTC
I think we should be using RHEL 8.2 and not 7.8. Can someone confirm? If so, for RHEL 8.2 we need to answer the following questions:

1. Does rhel 8.2 have jq by default? If not, thats a problem
2. There were NetworkManager specific fixes that went into a hotfix build for RHCOS 4.6, that are supposed to land in a different RHEL 8.2 z stream later, so without that this also wont work:
https://bugzilla.redhat.com/show_bug.cgi?id=1857775
https://bugzilla.redhat.com/show_bug.cgi?id=1820052

Comment 6 zhaozhanqi 2020-08-13 01:29:58 UTC
RHEL78 is always supported in 4.x version (4.3/4.4/4.5) and no issue before.

Comment 7 Dan Winship 2020-08-14 12:58:02 UTC
I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we need to just make sure jq gets installed on them. I assume there must already be infrastructure somewhere (MCO?) for ensuring that the RPMs we need are available on all nodes...

Comment 8 Tim Rozet 2020-08-14 21:37:08 UTC
(In reply to Dan Winship from comment #7)
> I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we
> need to just make sure jq gets installed on them. I assume there must
> already be infrastructure somewhere (MCO?) for ensuring that the RPMs we
> need are available on all nodes...

It's BYO RHEL, so I think the user would have to include the package. I'm not sure if MCO can install the package. An alternative is we could just remove using jq from the script. Additionally we need fixes backported for NM OVS from 8.2 into 7.9z:
https://bugzilla.redhat.com/show_bug.cgi?id=1852106
https://bugzilla.redhat.com/show_bug.cgi?id=1820052

Comment 9 Vadim Rutkovsky 2020-08-17 18:32:33 UTC
cc'ing Russel as `jq` needs to be installed on hosts using openshift-ansible

Comment 10 Russell Teague 2020-08-17 18:57:22 UTC
Support packages are installed on RHEL workers based on this list:
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/defaults/main.yml#L20

If jq is required, it would need to be added to that list.  `jq` has not been a requirement for any components previously.

Comment 11 Tim Rozet 2020-08-17 20:33:15 UTC
We can remove using jq, I was going to hold off until we can verify if we can get backports for the NM OVS bugs.

Comment 17 errata-xmlrpc 2020-10-27 16:27:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.