Description of problem: when scale up RHEL node for OVN cluster. with following error on ovn-node pod: I0810 15:28:15.842012 21101 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 30935 I0810 15:28:15.842969 21101 ovs.go:157] exec(5): /usr/bin/ovs-vsctl -timeout=15 - port-to-br br-ex I0810 15:28:15.848787 21101 ovs.go:160] exec(5): stdout: "" I0810 15:28:15.848814 21101 ovs.go:161] exec(5): stderr: "ovs-vsctl: no port named br-ex\n" I0810 15:28:15.848821 21101 ovs.go:163] exec(5): err: exit status 1 I0810 15:28:15.848833 21101 ovs.go:157] exec(6): /usr/bin/ovs-vsctl -timeout=15 - br-exists br-ex I0810 15:28:15.854481 21101 ovs.go:160] exec(6): stdout: "" I0810 15:28:15.854508 21101 ovs.go:161] exec(6): stderr: "" I0810 15:28:15.854512 21101 ovs.go:163] exec(6): err: exit status 2 F0810 15:28:15.854613 21101 ovnkube.go:129] failed to convert br-ex to OVS bridge: Link not found Version-Release number of selected component (if applicable): 4.6 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1711078 [details] ovn-node-logs
Looking at your setup, your new nodes failed during ovs-configuration service: Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Starting Configures OVS with proper host networking configuration... Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface= Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + counter=0 Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + '[' 0 -lt 12 ']' Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ ip -j route show default Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service: main process exited, code=exited, status=127/n/a Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: ++ jq -r '.[0].dev' Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: /usr/local/bin/configure-ovs.sh: line 14: jq: command not found Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Failed to start Configures OVS with proper host networking configuration. Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: Option "-j" is unknown, try "ip -help". Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: Unit ovs-configuration.service entered failed state. Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 configure-ovs.sh[1038]: + iface= Aug 11 12:31:54 zzhaoovn46-bpqgh-rhel-0 systemd[1]: ovs-configuration.service failed. sh-4.2# which jq which: no jq in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin) you are missing jq somehow on this node. Any ideas how that would be possible? Are you sure the new nodes have the right OS image?
Looks like your new nodes have wrong OS image: zzhaoovn46-bpqgh-master-2 Ready master 28h v1.19.0-rc.2+5241b27-dirty 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 46.82.202008102140-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-71.rhaos4.6.git19455e9.el8-dev zzhaoovn46-bpqgh-rhel-0 NotReady worker 27h v1.19.0-rc.2+9932f63-dirty 10.0.1.6 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.18.2.el7.x86_64 cri-o://1.19.0-71.rhaos4.6.git19455e9.el7-dev
I think we should be using RHEL 8.2 and not 7.8. Can someone confirm? If so, for RHEL 8.2 we need to answer the following questions: 1. Does rhel 8.2 have jq by default? If not, thats a problem 2. There were NetworkManager specific fixes that went into a hotfix build for RHCOS 4.6, that are supposed to land in a different RHEL 8.2 z stream later, so without that this also wont work: https://bugzilla.redhat.com/show_bug.cgi?id=1857775 https://bugzilla.redhat.com/show_bug.cgi?id=1820052
RHEL78 is always supported in 4.x version (4.3/4.4/4.5) and no issue before.
I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we need to just make sure jq gets installed on them. I assume there must already be infrastructure somewhere (MCO?) for ensuring that the RPMs we need are available on all nodes...
(In reply to Dan Winship from comment #7) > I'm not sure exactly how non-RHCOS RHEL nodes work, but it sounds like we > need to just make sure jq gets installed on them. I assume there must > already be infrastructure somewhere (MCO?) for ensuring that the RPMs we > need are available on all nodes... It's BYO RHEL, so I think the user would have to include the package. I'm not sure if MCO can install the package. An alternative is we could just remove using jq from the script. Additionally we need fixes backported for NM OVS from 8.2 into 7.9z: https://bugzilla.redhat.com/show_bug.cgi?id=1852106 https://bugzilla.redhat.com/show_bug.cgi?id=1820052
cc'ing Russel as `jq` needs to be installed on hosts using openshift-ansible
Support packages are installed on RHEL workers based on this list: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/defaults/main.yml#L20 If jq is required, it would need to be added to that list. `jq` has not been a requirement for any components previously.
We can remove using jq, I was going to hold off until we can verify if we can get backports for the NM OVS bugs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196