Bug 1925475 - high CPU usage fails ingress controller
Summary: high CPU usage fails ingress controller
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Daniel Mellado
QA Contact: Hongan Li
URL:
Whiteboard:
: 1905579 (view as bug list)
Depends On: 1935604
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-05 10:23 UTC by Alexander Niebuhr
Modified: 2021-03-25 01:53 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1935604 (view as bug list)
Environment:
Last Closed: 2021-03-25 01:53:01 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 451 0 None open Bug 1925475: [release-4.7] Bump OVN to ovn2.13-20.12.0-24.el8fdp 2021-03-05 09:17:44 UTC
Red Hat Product Errata RHBA-2021:0821 0 None None None 2021-03-25 01:53:16 UTC

Description Alexander Niebuhr 2021-02-05 10:23:04 UTC
Description of problem:
the high cpu usage of ovn makes internal network unstable, this fails ingress controller to route to the services

Version-Release number of selected component (if applicable):
4.6

How reproducible:
good

Steps to Reproduce:
1. install cluster
2. add some namspaces and services with pods
3. generate multiple routes

Actual results:
unstable ingress routing to services

Expected results:
stable ingress routing

Additional info:
https://drive.google.com/file/d/1Q9q-b5PTVBSyYvgBgAFfDgawKny1vLIf/view?usp=sharing

Comment 1 Vadim Rutkovsky 2021-02-05 10:31:23 UTC
Some more details from OKD bugreport - https://github.com/openshift/okd/issues/405:

* doesn't affect 4.5
* NetworkManager and ovn-controller are top CPU consumers
* seems to be caused by a race, as after several reboots this doesn't happen anymore

Comment 2 Kai-Uwe Rommel 2021-02-05 11:12:50 UTC
See also here. Same problem.
https://bugzilla.redhat.com/show_bug.cgi?id=1905579

Comment 3 Vadim Rutkovsky 2021-02-25 22:29:48 UTC
*** Bug 1905579 has been marked as a duplicate of this bug. ***

Comment 4 Vadim Rutkovsky 2021-02-25 22:37:25 UTC
I still see this happening in OKD 4.7:
```
$ oc adm release info quay.io/openshift/okd:4.7.0-0.okd-2021-02-25-144700 --commit-urls | grep ovn      
  ovn-kubernetes                                 https://github.com/openshift/ovn-kubernetes/commit/ef03521f5daede4fe0f8afd9f42035259636006b
```

In https://github.com/openshift/okd/issues/405 Dan mentioned it can be a mismatch between node name and a hostname:
>Given the investigation on the OVS bug bugzilla.redhat.com/show_bug.cgi?id=1905579 it seems something is making ovn-controller add/remove GENEVE ports to other nodes. We recently fixed a similar issue that was related to hostnames, because hostnames are used as the chassis record index. This was fixed in upstream ovn-kubernetes in ovn-org/ovn-kubernetes#1653 and fixed downstream in OpenShift via openshift/ovn-kubernetes#279
>Is each machine's hostname the same as the node name in the Kube API?

In my case node hostname matches the nodename:
```
[root@bmo core]# hostname
bmo.vrutkovs.eu
[root@bmo core]# oc get nodes
NAME                STATUS   ROLES           AGE    VERSION
bmo.vrutkovs.eu     Ready    master,worker   204d   v1.20.0+5fbfd19-1046
neptr.vrutkovs.eu   Ready    worker          204d   v1.20.0+5fbfd19-1046

```

Comment 5 Vadim Rutkovsky 2021-02-25 22:49:58 UTC
Seems we're hitting https://bugzilla.redhat.com/show_bug.cgi?id=1903210, fixed in ovn 20.12.0-20 (https://github.com/ovn-org/ovn/commit/e7788554a7f5e824fc0d8afc6cbf20e94fe4245f).

`ovnkube-node` is using `ovn2.13-20.09.0-21.el8fdn.x86_64`

Comment 6 Kai-Uwe Rommel 2021-02-26 09:20:48 UTC
So how will we progress and solve the problem?
How can we trigger that the update (that we hope it will solve the problem) will get into the product stream?
And thus then also into OKD?

Comment 7 Dan Williams 2021-03-01 16:28:36 UTC
The fix is unlikely to come to OKD 4.6; but is certainly possible for 4.7. FWIW.

Comment 8 Vadim Rutkovsky 2021-03-01 17:38:30 UTC
(In reply to Dan Williams from comment #7)
> The fix is unlikely to come to OKD 4.6; but is certainly possible for 4.7.
> FWIW.

Perfect, thanks. OKD stable has moved to 4.7 builds, so backporting this to 4.6 is not a priority

Comment 9 Kai-Uwe Rommel 2021-03-01 18:39:34 UTC
We also don't care for which version the fix is implemented ... as long as it is done ASAP.

Comment 12 Hongan Li 2021-03-16 06:46:56 UTC
verified with 4.7.0-0.nightly-2021-03-14-223051 and passed.

the ovn is updated to ovn2.13-20.12.0-24.el8fdp

sh-4.4# rpm -qa | grep ovn
ovn2.13-20.12.0-24.el8fdp.x86_64
ovn2.13-host-20.12.0-24.el8fdp.x86_64
ovn2.13-central-20.12.0-24.el8fdp.x86_64
ovn2.13-vtep-20.12.0-24.el8fdp.x86_64


created 100 namespaces and pod,service,route in each namespace, all ingress routing are working well.

Comment 14 errata-xmlrpc 2021-03-25 01:53:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.3 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0821


Note You need to log in before you can comment on or make changes to this bug.