Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1867992

Summary: [OVN] shared gateway does not work with RHEL worker nodes
Product: OpenShift Container Platform Reporter: zhaozhanqi <zzhao>
Component: NetworkingAssignee: Jacob Tanenbaum <jtanenba>
Networking sub component: ovn-kubernetes QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: anbhat, anusaxen, danw, huirwang, jtanenba, ricarril, rteague, trozet, vrutkovs, yanyang
Version: 4.6Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:27:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1871935    
Bug Blocks:    
Attachments:
Description Flags
ovn-node-logs none

Description zhaozhanqi 2020-08-11 10:56:33 UTC
Description of problem:

when scale up RHEL node for OVN cluster. with following error on ovn-node pod:

I0810 15:28:15.842012 21101 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 30935
I0810 15:28:15.842969 21101 ovs.go:157] exec(5): /usr/bin/ovs-vsctl -timeout=15 - port-to-br br-ex
I0810 15:28:15.848787 21101 ovs.go:160] exec(5): stdout: ""
I0810 15:28:15.848814 21101 ovs.go:161] exec(5): stderr: "ovs-vsctl: no port named br-ex\n"
I0810 15:28:15.848821 21101 ovs.go:163] exec(5): err: exit status 1
I0810 15:28:15.848833 21101 ovs.go:157] exec(6): /usr/bin/ovs-vsctl -timeout=15 - br-exists br-ex
I0810 15:28:15.854481 21101 ovs.go:160] exec(6): stdout: ""
I0810 15:28:15.854508 21101 ovs.go:161] exec(6): stderr: ""
I0810 15:28:15.854512 21101 ovs.go:163] exec(6): err: exit status 2
F0810 15:28:15.854613 21101 ovnkube.go:129] failed to convert br-ex to OVS bridge: Link not found

Version-Release number of selected component (if applicable):
4.6

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:



Expected results:


Additional info:

Comment 1 zhaozhanqi 2020-08-11 12:37:49 UTC
Created attachment 1711078 [details]
ovn-node-logs

Comment 3 Tim Rozet 2020-08-12 15:42:52 UTC
Looking at your setup, your new nodes failed during ovs-configuration service:
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Starting Configures OVS with proper host networking configuration...
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface=
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + counter=0
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + '[' 0 -lt 12 ']'
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ ip -j route show default
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service: main process exited, code=exited, status=127/n/a
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ jq -r '.[0].dev'
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: /usr/local/bin/configure-ovs.sh: line 14: jq: command not found
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: Option "-j" is unknown, try "ip -help".
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Unit ovs-configuration.service entered failed state.
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface=
Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service failed.
sh-4.2# which jq
which: no jq in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

you are missing jq somehow on this node. Any ideas how that would be possible? Are you sure the new nodes have the right OS image?

Comment 4 Tim Rozet 2020-08-12 15:43:50 UTC
Looks like your new nodes have wrong OS image:
zzhaoovn46-bpqgh-master-2                  Ready      master   28h   v1.19.0-rc.2+5241b27-dirty   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 46.82.202008102140-0 (Ootpa)   4.18.0-211.el8.x86_64         cri-o://1.19.0-71.rhaos4.6.git19455e9.el8-dev
zzhaoovn46-bpqgh-rhel-0                    NotReady   worker   27h   v1.19.0-rc.2+9932f63-dirty   10.0.1.6      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.18.2.el7.x86_64   cri-o://1.19.0-71.rhaos4.6.git19455e9.el7-dev

Comment 5 Tim Rozet 2020-08-12 18:43:31 UTC
I think we should be using RHEL 8.2 and not 7.8. Can someone confirm? If so, for RHEL 8.2 we need to answer the following questions:

1. Does rhel 8.2 have jq by default? If not, thats a problem
2. There were NetworkManager specific fixes that went into a hotfix build for RHCOS 4.6, that are supposed to land in a different RHEL 8.2 z stream later, so without that this also wont work:
https://bugzilla.redhat.com/show_bug.cgi?id=1857775
https://bugzilla.redhat.com/show_bug.cgi?id=1820052

Comment 6 zhaozhanqi 2020-08-13 01:29:58 UTC
RHEL78 is always supported in 4.x version (4.3/4.4/4.5) and no issue before.

Comment 7 Dan Winship 2020-08-14 12:58:02 UTC
I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we need to just make sure jq gets installed on them. I assume there must already be infrastructure somewhere (MCO?) for ensuring that the RPMs we need are available on all nodes...

Comment 8 Tim Rozet 2020-08-14 21:37:08 UTC
(In reply to Dan Winship from comment #7)
> I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we
> need to just make sure jq gets installed on them. I assume there must
> already be infrastructure somewhere (MCO?) for ensuring that the RPMs we
> need are available on all nodes...

It's BYO RHEL, so I think the user would have to include the package. I'm not sure if MCO can install the package. An alternative is we could just remove using jq from the script. Additionally we need fixes backported for NM OVS from 8.2 into 7.9z:
https://bugzilla.redhat.com/show_bug.cgi?id=1852106
https://bugzilla.redhat.com/show_bug.cgi?id=1820052

Comment 9 Vadim Rutkovsky 2020-08-17 18:32:33 UTC
cc'ing Russel as `jq` needs to be installed on hosts using openshift-ansible

Comment 10 Russell Teague 2020-08-17 18:57:22 UTC
Support packages are installed on RHEL workers based on this list:
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/defaults/main.yml#L20

If jq is required, it would need to be added to that list.  `jq` has not been a requirement for any components previously.

Comment 11 Tim Rozet 2020-08-17 20:33:15 UTC
We can remove using jq, I was going to hold off until we can verify if we can get backports for the NM OVS bugs.

Comment 17 errata-xmlrpc 2020-10-27 16:27:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196