Bug 1905579

Summary:

high CPU usage (100%) of NetworkManager and ovsdb-server

Product:

[Fedora] Fedora

Reporter:

Kai-Uwe Rommel <kai-uwe.rommel>

Component:

ovn

Assignee:

Numan Siddique <nusiddiq>

Status:

CLOSED DUPLICATE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

CC:

acardace, aconole, bgalvani, cglombek, chrisw, dcbw, fgiudici, fleitner, lkundrak, mclasen, mf.flip, numan.siddique, nusiddiq, philipp.dallig, rsevilla, rstrode, sandmann, tgraf, tredaelli, vrutkovs

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-25 22:29:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
NetworkManager trace	none
ovsdb-server.log and ovs-vswitchd.log	none

Description Kai-Uwe Rommel 2020-12-08 15:26:20 UTC

Description of problem:
After installing an OKD 4.6-20201127 cluster, on some (but not all) nodes (both master and workers affected) there is very high CPU load caused by NetworkManager and ovsdb-server. After rebooting the cluster node, CPU load is normal but the problem comes back after a couple of hours/days.

Version-Release number of selected component (if applicable):
Fedora CoreOS 33.20201124.10.1
installed on vSphere UPI

How reproducible:
100%, see above

Steps to Reproduce:
See above.

Actual results:
Much too high CPU usage of OKD cluster nodes.

Expected results:
Lower CPU usage.

Additional info:
https://github.com/openshift/okd/issues/405

Comment 1 Kai-Uwe Rommel 2020-12-21 18:44:34 UTC

Anyone care to comment?

Comment 2 Kai-Uwe Rommel 2021-01-27 15:04:14 UTC

Anyone working on this ???

Comment 3 Christian Glombek 2021-01-29 15:00:43 UTC

@rsevilla, I'm not entirely sure you're the right person to tag here, but could you take a look?

Comment 4 Flavio Leitner 2021-02-05 11:26:13 UTC

Hi,

The ovsdb-server is reactive, so most probably its load is due to work requested by NetworkManager.
Therefore, I am going to reassign this to NetworkManager to understand what is going on.
If this turns out to be an ovsdb-server issue please reassign back.
Thanks
fbl

Comment 5 Beniamino Galvani 2021-02-05 16:27:58 UTC

Please run the following commands as root when NetworkManager shows high CPU usage:

nmcli general logging level trace
sleep 20
nmcli general logging level info
journalctl -u NetworkManager --since="20 seconds ago" > log.txt
gzip log.txt

and attach the resulting log.txt.gz file. Thank you.

Comment 6 Kai-Uwe Rommel 2021-02-05 20:57:52 UTC

Created attachment 1755303 [details]
NetworkManager trace

Comment 7 Beniamino Galvani 2021-02-08 08:46:13 UTC

In the log I see a constant stream of "update" notifications (up to 30 per second) sent by the ovsdb to NM, which cause CPU usage on NM. From what I can see, those updates are not triggered by NM. Reassigning back to ovs to investigate the source of those messages.

Comment 8 Kai-Uwe Rommel 2021-02-09 16:14:41 UTC

Who will be picking up?

Comment 9 Flavio Leitner 2021-02-11 13:04:56 UTC

Could you please attach ovs-vswitchd.log and ovsdb-server.log?
Thanks
fbl

Comment 10 Flavio Leitner 2021-02-11 13:08:39 UTC

And also the output of 'ovsdb-tool -m show-log' please.

Comment 11 Kai-Uwe Rommel 2021-02-11 13:24:51 UTC

Created attachment 1756365 [details]
ovsdb-server.log and ovs-vswitchd.log

This OKD cluster was set up from scratch on Feb 7th.
One master and one (of the total of three) worker node currently show the problem.
The logs are from the worker node.

Comment 12 Kai-Uwe Rommel 2021-02-11 13:27:33 UTC

This cluster is a dev cluster and currently not yet in use.
If you wish to take a closer look yourself, we can certainly arrange a screen sharing session.

But: so far, all (!) OKD 4.6 clusters that I have installed (all UPI on vSphere) show this problem.
And other users have reported in the linked Github issue that they see the same problem.
So you should also easily be able to reproduce the problem.

Comment 13 Flavio Leitner 2021-02-11 17:32:55 UTC

The connection with ovsdb-server is not stable:
2021-02-11T00:00:00.872Z|68128|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2021-02-11T00:00:01.556Z|68129|jsonrpc|WARN|unix#114061: send error: Broken pipe
2021-02-11T00:00:01.556Z|68130|reconnect|WARN|unix#114061: connection dropped (Broken pipe)
2021-02-11T00:00:01.699Z|68131|jsonrpc|WARN|unix#114067: send error: Broken pipe
2021-02-11T00:00:01.702Z|68132|reconnect|WARN|unix#114067: connection dropped (Broken pipe)
2021-02-11T00:00:02.198Z|68133|jsonrpc|WARN|unix#114087: send error: Broken pipe
2021-02-11T00:00:02.198Z|68134|reconnect|WARN|unix#114087: connection dropped (Broken pipe)
[...]


And something is adding and removing ports in a loop:
2021-02-08T23:45:03.246Z|06661|bridge|INFO|bridge br-int: added interface 8afd5d7763b626a on port 640
2021-02-08T23:45:07.264Z|06668|bridge|INFO|bridge br-int: deleted interface 8afd5d7763b626a on port 640
[...]
2021-02-08T23:45:17.755Z|06678|bridge|INFO|bridge br-int: added interface 6e70d311fdae791 on port 642
2021-02-08T23:45:35.258Z|06684|bridge|INFO|bridge br-int: deleted interface 6e70d311fdae791 on port 642
[...]

But what I think is really happening seems to be this:
record 31064: 2021-02-11 13:18:15.826 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'"
  table Bridge row "br-int" (493fe701):

record 31065: 2021-02-11 13:18:15.860 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'"
  table Bridge row "br-int" (493fe701):

record 31066: 2021-02-11 13:18:15.889 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'"
  table Bridge row "br-int" (493fe701):

record 31067: 2021-02-11 13:18:15.922 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'"
  table Bridge row "br-int" (493fe701):


It seems like ovn-controller is modifying that tunnel 32x per second, which then an update is pushed to NetworkManager.


I am going to reassign to OVN to see why it is doing that.


fbl

Comment 14 Kai-Uwe Rommel 2021-02-23 15:18:12 UTC

Any news? Almost two weeks have passed by again since last entry here...

Comment 15 Numan Siddique 2021-02-23 15:50:12 UTC

What is the OVN version used here ?

Can you please attach ovn-controller logs ?

Looks to me there could be some misconfiguration. Can you also attach OVN databases ?
ovnnb_db.db, ovnsb_db.db and the local ovsdb-server conf.db

Thanks
Numan

Comment 16 Kai-Uwe Rommel 2021-02-23 15:58:11 UTC

Hello Numan, what exactly do you mean with ovn-controller logs?
Pod logs from the ovnkube-master/ovnkube-node pods? Something else?
And where would I find the db files?
Sorry, I'm not into the details of OVNKubernetes ... :-(

Comment 17 Numan Siddique 2021-02-23 19:13:45 UTC

(In reply to Kai-Uwe Rommel from comment #16)
> Hello Numan, what exactly do you mean with ovn-controller logs?
> Pod logs from the ovnkube-master/ovnkube-node pods? Something else?
> And where would I find the db files?
> Sorry, I'm not into the details of OVNKubernetes ... :-(

Yes. Logs from the ovnkube pods.

I think ovn-controller is run in the ovnkube-node pods.  And ovn dbs should be running in ovnkube-master.

If you login to nbdb and sbdb containers runnng in ovnkube-master, the ovn db will be present - /etc/ovn/ovnnb_db.db 
and /etc/ovn/ovnsb_db.db

Thanks

Comment 18 Kai-Uwe Rommel 2021-02-23 19:22:32 UTC

There is a ovnkube-node pod on each node of the cluster. And a total of three ovnkube-master pods. Logs and db's from each?
Or just samples from one of each?

Comment 19 Numan Siddique 2021-02-23 19:25:55 UTC

Logs of all ovn-controllers running in ovnkube-nodes and dbs from any one of the master node is good enough.

Comment 20 Numan Siddique 2021-02-23 19:27:33 UTC

And also the ovsdb conf.db from the node where you are seeing this issue. This file should be present in th host - /etc/openvswitch

Comment 21 Kai-Uwe Rommel 2021-02-24 14:01:00 UTC

I wanted to make the ticket public. Christian Glombek apparently changed this now (I couldn't).
I'll upload the requested logs/files later today.
I was distracted by other important work. Sorry.

Comment 22 Kai-Uwe Rommel 2021-02-24 15:13:06 UTC

I'm seeing the CPU issue at the moment on all three worker and on one master node of my sample cluster.
It's different on other clusters.
If you look at the ovnkube-node (ovn-controller) logs, the four large ones are those which show the problem.

I copied the /etc/openvswitch/conf.db from the master-03 node which shows the problem. But it is huge (1151 MB)!
It's huge on all nodes which show the problem and small (around 40k) on the nodes without the problem.

I zipped up everything as logs20210224.zip and uploaded here.
The files compressed very well...

Comment 23 Kai-Uwe Rommel 2021-02-24 15:21:36 UTC

Ok, the upload here does not work, I only get some "POST internal error" even though I'm below the specified limit.
Please get the requested log files here: http://domino.ars.de/file/logs20210224.7z
(The download link will be valid for 7 days only.)

Comment 24 Numan Siddique 2021-02-24 18:48:20 UTC

I'm pretty sure you're hitting this issue - https://bugzilla.redhat.com/show_bug.cgi?id=1903210

The patch to fix this issue is already merged upstream - https://github.com/ovn-org/ovn/commit/e7788554a7f5e824fc0d8afc6cbf20e94fe4245f

The fix is available in Fedora in the version - ovn-20.12.0-19

https://src.fedoraproject.org/rpms/ovn/blob/rawhide/f/ovn.spec#_554

Please update your ovnkube container images with this or higher version of OVN packages.

Thanks
Numan

Comment 25 Kai-Uwe Rommel 2021-02-25 21:46:26 UTC

How could I update the ovnkube images ... this is OKD and OKD is updating itself.
So when will this be in OKD?

Comment 26 Vadim Rutkovsky 2021-02-25 22:29:48 UTC

Wrong product and component, OKD doesn't use OVN from Fedora

*** This bug has been marked as a duplicate of bug 1925475 ***