Bug 2214885 - [16.2.2] high ovs-vswitchd CPU usage on controller (most spent in native_queued_spin_lock_slowpath)
Summary: [16.2.2] high ovs-vswitchd CPU usage on controller (most spent in native_queu...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Timothy Redaelli
QA Contact: qding
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-14 03:26 UTC by Robin Cernin
Modified: 2023-07-13 07:25 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:
ldenny: needinfo? (tredaelli)
rcernin: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2982 0 None None None 2023-06-23 09:58:05 UTC
Red Hat Issue Tracker OSP-25780 0 None None None 2023-06-14 03:28:43 UTC

Description Robin Cernin 2023-06-14 03:26:05 UTC
Description of problem:

Note: When we are referring to a non-working node, it is the node that we are observing high CPU usage.

It is the controller node, on the contrary the compute node doesn't show any problems with the same configuration.

We are observing continuous flow of the following messages in /var/log/messages:

~~~
kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
~~~

Number of flows:

(working) ovs-dpctl dump-flows  | wc -l                                  │itch: ovs-system: deferred action limit reached, drop recirc action
2509

(non-working) ovs-dpctl dump-flows  | wc -l
8634

(working) grep "drop recirc action" /var/log/messages | wc -l
0

(non-working) grep "drop recirc action" /var/log/messages | wc -l
225783


Version-Release number of selected component (if applicable):

16.2.2

How reproducible:

# perf top -g -p $(pidof ovs-vswitchd)

We can see most of the time: ~90% on all handlerX threads spent in  native_queued_spin_lock_slowpath

Actual results:

The total CPU usage util of ovs-vswitchd ~600% with peaks of 1500%

Expected results:

The total CPU usage util of ovs-vswitchd <100%

Additional info:

We are not sure whether the two are related, but first we would like to understand why:

~~~
kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
~~~

is appearing in the messages.

secondly the native_queued_spin_lock_slowpath, we would like to understand why the ovs-vswitchd spends so much CPU time in it.

Would lowering the number of n-threads for revalidator and handler help?

However both of the nodes are with high number of CPU cores >160.

Comment 2 ldenny 2023-06-14 03:47:28 UTC
One thing I would like to mention we discovered during the troubleshooting is this environment does not have OVN DVR enabled, so a lot of the traffic coming from the compute nodes needs to route through the controller nodes. Maybe that is the difference between the controller and compute ovs-vswitchd cpu usage but we're unsure.

This upstream bug looks quite accurate [1] but we don't see any `blocked 1000 ms waiting for revalidator127 to quiesce` messages.

[1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264

Comment 4 Rodolfo Alonso 2023-06-23 09:56:29 UTC
Hello:

In the Neutron team we initially thought that this issue was related to the router HA VRRP traffic. But this environment is using OVN thus this is not the problem.

Investigating a bit I found that this problem could be related to an outdated glibc library. According to the U/S bugs [1][2][3], this issue was fixed in [4], target milestone 2.29. The version installed in a OSP16.2 deployment, using RHEL8.4, is glibc-2.28-151.el8.x86_64.

Regards.

[1]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264
[2]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1839592
[3]https://github.com/openvswitch/ovs-issues/issues/175
[4]https://sourceware.org/bugzilla/show_bug.cgi?id=23861

Comment 7 ldenny 2023-06-27 03:13:32 UTC
Hi Rodolfo, 

Is the glibc version something we can get updated in RHEL?

I'm glad you agree this seems to be the issue but I'm not sure how we can prove this, are we able to compile a test version of OVS with the updated version?

There is a reproducer program linked here [1] I will test in my RHOSP16.2 lab, if we can get a test version of OVS I could install or compile in the lab and see if it fixes the reproducer program at least.


Not sure who to set needinfo on sorry so just being cautious.
 
Thanks!

Comment 9 Rodolfo Alonso 2023-06-27 08:18:14 UTC
Hi Lewis:

Sorry, I was expecting you to know the answer to this question. I guess that if there is a bug in a kernel library, we can update it. In any case, if you can compile and test this new glibc version, proving that fixes the issue in OVS, we can call kernel folks to push this fix.

Regards.

Comment 10 ldenny 2023-06-27 09:46:47 UTC
Okay cool, 

We will test compiling ovs with the glibc 2.29 and report back, if we can prove it resolves the issue we will have good argument for the kernel folks to update.

Cheers!

Comment 11 ldenny 2023-07-04 01:23:49 UTC
After looking into this some more, I've come to the conclusion that it's not really possible to test with glibc 2.9. 

Firstly it's my understanding that OVS is consuming libc as a dynamic library so compiling OVS won't be necessary, and updating libc on the host is not as straight forward or safe as I assumed[1]. 

I can't reproduce the issue in my lab and I can't recommend the customer attempt to update libc in their production environment.

Are we able to get some other ideas from the OVS team? 

One solution would be deploying RHOSP17 which shouldn't have this issue as we've updated to RHEL9 and libc 2.34 

[1] https://access.redhat.com/discussions/3244811#comment-2024011


Note You need to log in before you can comment on or make changes to this bug.