Description of problem: the high cpu usage of ovn makes internal network unstable, this fails ingress controller to route to the services Version-Release number of selected component (if applicable): 4.6 How reproducible: good Steps to Reproduce: 1. install cluster 2. add some namspaces and services with pods 3. generate multiple routes Actual results: unstable ingress routing to services Expected results: stable ingress routing Additional info: https://drive.google.com/file/d/1Q9q-b5PTVBSyYvgBgAFfDgawKny1vLIf/view?usp=sharing
Some more details from OKD bugreport - https://github.com/openshift/okd/issues/405: * doesn't affect 4.5 * NetworkManager and ovn-controller are top CPU consumers * seems to be caused by a race, as after several reboots this doesn't happen anymore
See also here. Same problem. https://bugzilla.redhat.com/show_bug.cgi?id=1905579
*** Bug 1905579 has been marked as a duplicate of this bug. ***
I still see this happening in OKD 4.7: ``` $ oc adm release info quay.io/openshift/okd:4.7.0-0.okd-2021-02-25-144700 --commit-urls | grep ovn ovn-kubernetes https://github.com/openshift/ovn-kubernetes/commit/ef03521f5daede4fe0f8afd9f42035259636006b ``` In https://github.com/openshift/okd/issues/405 Dan mentioned it can be a mismatch between node name and a hostname: >Given the investigation on the OVS bug bugzilla.redhat.com/show_bug.cgi?id=1905579 it seems something is making ovn-controller add/remove GENEVE ports to other nodes. We recently fixed a similar issue that was related to hostnames, because hostnames are used as the chassis record index. This was fixed in upstream ovn-kubernetes in ovn-org/ovn-kubernetes#1653 and fixed downstream in OpenShift via openshift/ovn-kubernetes#279 >Is each machine's hostname the same as the node name in the Kube API? In my case node hostname matches the nodename: ``` [root@bmo core]# hostname bmo.vrutkovs.eu [root@bmo core]# oc get nodes NAME STATUS ROLES AGE VERSION bmo.vrutkovs.eu Ready master,worker 204d v1.20.0+5fbfd19-1046 neptr.vrutkovs.eu Ready worker 204d v1.20.0+5fbfd19-1046 ```
Seems we're hitting https://bugzilla.redhat.com/show_bug.cgi?id=1903210, fixed in ovn 20.12.0-20 (https://github.com/ovn-org/ovn/commit/e7788554a7f5e824fc0d8afc6cbf20e94fe4245f). `ovnkube-node` is using `ovn2.13-20.09.0-21.el8fdn.x86_64`
So how will we progress and solve the problem? How can we trigger that the update (that we hope it will solve the problem) will get into the product stream? And thus then also into OKD?
The fix is unlikely to come to OKD 4.6; but is certainly possible for 4.7. FWIW.
(In reply to Dan Williams from comment #7) > The fix is unlikely to come to OKD 4.6; but is certainly possible for 4.7. > FWIW. Perfect, thanks. OKD stable has moved to 4.7 builds, so backporting this to 4.6 is not a priority
We also don't care for which version the fix is implemented ... as long as it is done ASAP.
verified with 4.7.0-0.nightly-2021-03-14-223051 and passed. the ovn is updated to ovn2.13-20.12.0-24.el8fdp sh-4.4# rpm -qa | grep ovn ovn2.13-20.12.0-24.el8fdp.x86_64 ovn2.13-host-20.12.0-24.el8fdp.x86_64 ovn2.13-central-20.12.0-24.el8fdp.x86_64 ovn2.13-vtep-20.12.0-24.el8fdp.x86_64 created 100 namespaces and pod,service,route in each namespace, all ingress routing are working well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.3 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0821