Description of problem: After installing an OKD 4.6-20201127 cluster, on some (but not all) nodes (both master and workers affected) there is very high CPU load caused by NetworkManager and ovsdb-server. After rebooting the cluster node, CPU load is normal but the problem comes back after a couple of hours/days. Version-Release number of selected component (if applicable): Fedora CoreOS 33.20201124.10.1 installed on vSphere UPI How reproducible: 100%, see above Steps to Reproduce: See above. Actual results: Much too high CPU usage of OKD cluster nodes. Expected results: Lower CPU usage. Additional info: https://github.com/openshift/okd/issues/405
Anyone care to comment?
Anyone working on this ???
@rsevilla, I'm not entirely sure you're the right person to tag here, but could you take a look?
Hi, The ovsdb-server is reactive, so most probably its load is due to work requested by NetworkManager. Therefore, I am going to reassign this to NetworkManager to understand what is going on. If this turns out to be an ovsdb-server issue please reassign back. Thanks fbl
Please run the following commands as root when NetworkManager shows high CPU usage: nmcli general logging level trace sleep 20 nmcli general logging level info journalctl -u NetworkManager --since="20 seconds ago" > log.txt gzip log.txt and attach the resulting log.txt.gz file. Thank you.
Created attachment 1755303 [details] NetworkManager trace
In the log I see a constant stream of "update" notifications (up to 30 per second) sent by the ovsdb to NM, which cause CPU usage on NM. From what I can see, those updates are not triggered by NM. Reassigning back to ovs to investigate the source of those messages.
Who will be picking up?
Could you please attach ovs-vswitchd.log and ovsdb-server.log? Thanks fbl
And also the output of 'ovsdb-tool -m show-log' please.
Created attachment 1756365 [details] ovsdb-server.log and ovs-vswitchd.log This OKD cluster was set up from scratch on Feb 7th. One master and one (of the total of three) worker node currently show the problem. The logs are from the worker node.
This cluster is a dev cluster and currently not yet in use. If you wish to take a closer look yourself, we can certainly arrange a screen sharing session. But: so far, all (!) OKD 4.6 clusters that I have installed (all UPI on vSphere) show this problem. And other users have reported in the linked Github issue that they see the same problem. So you should also easily be able to reproduce the problem.
The connection with ovsdb-server is not stable: 2021-02-11T00:00:00.872Z|68128|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log 2021-02-11T00:00:01.556Z|68129|jsonrpc|WARN|unix#114061: send error: Broken pipe 2021-02-11T00:00:01.556Z|68130|reconnect|WARN|unix#114061: connection dropped (Broken pipe) 2021-02-11T00:00:01.699Z|68131|jsonrpc|WARN|unix#114067: send error: Broken pipe 2021-02-11T00:00:01.702Z|68132|reconnect|WARN|unix#114067: connection dropped (Broken pipe) 2021-02-11T00:00:02.198Z|68133|jsonrpc|WARN|unix#114087: send error: Broken pipe 2021-02-11T00:00:02.198Z|68134|reconnect|WARN|unix#114087: connection dropped (Broken pipe) [...] And something is adding and removing ports in a loop: 2021-02-08T23:45:03.246Z|06661|bridge|INFO|bridge br-int: added interface 8afd5d7763b626a on port 640 2021-02-08T23:45:07.264Z|06668|bridge|INFO|bridge br-int: deleted interface 8afd5d7763b626a on port 640 [...] 2021-02-08T23:45:17.755Z|06678|bridge|INFO|bridge br-int: added interface 6e70d311fdae791 on port 642 2021-02-08T23:45:35.258Z|06684|bridge|INFO|bridge br-int: deleted interface 6e70d311fdae791 on port 642 [...] But what I think is really happening seems to be this: record 31064: 2021-02-11 13:18:15.826 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'" table Bridge row "br-int" (493fe701): record 31065: 2021-02-11 13:18:15.860 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'" table Bridge row "br-int" (493fe701): record 31066: 2021-02-11 13:18:15.889 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'" table Bridge row "br-int" (493fe701): record 31067: 2021-02-11 13:18:15.922 "ovn-controller: modifying OVS tunnels '56689d89-4ec6-4ea1-976e-b2555eba6b31'" table Bridge row "br-int" (493fe701): It seems like ovn-controller is modifying that tunnel 32x per second, which then an update is pushed to NetworkManager. I am going to reassign to OVN to see why it is doing that. fbl
Any news? Almost two weeks have passed by again since last entry here...
What is the OVN version used here ? Can you please attach ovn-controller logs ? Looks to me there could be some misconfiguration. Can you also attach OVN databases ? ovnnb_db.db, ovnsb_db.db and the local ovsdb-server conf.db Thanks Numan
Hello Numan, what exactly do you mean with ovn-controller logs? Pod logs from the ovnkube-master/ovnkube-node pods? Something else? And where would I find the db files? Sorry, I'm not into the details of OVNKubernetes ... :-(
(In reply to Kai-Uwe Rommel from comment #16) > Hello Numan, what exactly do you mean with ovn-controller logs? > Pod logs from the ovnkube-master/ovnkube-node pods? Something else? > And where would I find the db files? > Sorry, I'm not into the details of OVNKubernetes ... :-( Yes. Logs from the ovnkube pods. I think ovn-controller is run in the ovnkube-node pods. And ovn dbs should be running in ovnkube-master. If you login to nbdb and sbdb containers runnng in ovnkube-master, the ovn db will be present - /etc/ovn/ovnnb_db.db and /etc/ovn/ovnsb_db.db Thanks
There is a ovnkube-node pod on each node of the cluster. And a total of three ovnkube-master pods. Logs and db's from each? Or just samples from one of each?
Logs of all ovn-controllers running in ovnkube-nodes and dbs from any one of the master node is good enough.
And also the ovsdb conf.db from the node where you are seeing this issue. This file should be present in th host - /etc/openvswitch
I wanted to make the ticket public. Christian Glombek apparently changed this now (I couldn't). I'll upload the requested logs/files later today. I was distracted by other important work. Sorry.
I'm seeing the CPU issue at the moment on all three worker and on one master node of my sample cluster. It's different on other clusters. If you look at the ovnkube-node (ovn-controller) logs, the four large ones are those which show the problem. I copied the /etc/openvswitch/conf.db from the master-03 node which shows the problem. But it is huge (1151 MB)! It's huge on all nodes which show the problem and small (around 40k) on the nodes without the problem. I zipped up everything as logs20210224.zip and uploaded here. The files compressed very well...
Ok, the upload here does not work, I only get some "POST internal error" even though I'm below the specified limit. Please get the requested log files here: http://domino.ars.de/file/logs20210224.7z (The download link will be valid for 7 days only.)
I'm pretty sure you're hitting this issue - https://bugzilla.redhat.com/show_bug.cgi?id=1903210 The patch to fix this issue is already merged upstream - https://github.com/ovn-org/ovn/commit/e7788554a7f5e824fc0d8afc6cbf20e94fe4245f The fix is available in Fedora in the version - ovn-20.12.0-19 https://src.fedoraproject.org/rpms/ovn/blob/rawhide/f/ovn.spec#_554 Please update your ovnkube container images with this or higher version of OVN packages. Thanks Numan
How could I update the ovnkube images ... this is OKD and OKD is updating itself. So when will this be in OKD?
Wrong product and component, OKD doesn't use OVN from Fedora *** This bug has been marked as a duplicate of bug 1925475 ***