Bug 2214885

Summary: [16.2.2] high ovs-vswitchd CPU usage on controller (most spent in native_queued_spin_lock_slowpath)
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Robin Cernin <rcernin>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
openvswitch sub component: other QA Contact: qding
Status: NEW --- Docs Contact:
Severity: medium    
Priority: medium CC: apevec, casantos, chrisbro, chrisw, ctrautma, hakhande, jappleii, ldenny, qding, ralonsoh, scohen, tredaelli
Version: RHEL 8.0Flags: ldenny: needinfo? (tredaelli)
rcernin: needinfo-
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robin Cernin 2023-06-14 03:26:05 UTC
Description of problem:

Note: When we are referring to a non-working node, it is the node that we are observing high CPU usage.

It is the controller node, on the contrary the compute node doesn't show any problems with the same configuration.

We are observing continuous flow of the following messages in /var/log/messages:

~~~
kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
~~~

Number of flows:

(working) ovs-dpctl dump-flows  | wc -l                                  │itch: ovs-system: deferred action limit reached, drop recirc action
2509

(non-working) ovs-dpctl dump-flows  | wc -l
8634

(working) grep "drop recirc action" /var/log/messages | wc -l
0

(non-working) grep "drop recirc action" /var/log/messages | wc -l
225783


Version-Release number of selected component (if applicable):

16.2.2

How reproducible:

# perf top -g -p $(pidof ovs-vswitchd)

We can see most of the time: ~90% on all handlerX threads spent in  native_queued_spin_lock_slowpath

Actual results:

The total CPU usage util of ovs-vswitchd ~600% with peaks of 1500%

Expected results:

The total CPU usage util of ovs-vswitchd <100%

Additional info:

We are not sure whether the two are related, but first we would like to understand why:

~~~
kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
~~~

is appearing in the messages.

secondly the native_queued_spin_lock_slowpath, we would like to understand why the ovs-vswitchd spends so much CPU time in it.

Would lowering the number of n-threads for revalidator and handler help?

However both of the nodes are with high number of CPU cores >160.

Comment 2 ldenny 2023-06-14 03:47:28 UTC
One thing I would like to mention we discovered during the troubleshooting is this environment does not have OVN DVR enabled, so a lot of the traffic coming from the compute nodes needs to route through the controller nodes. Maybe that is the difference between the controller and compute ovs-vswitchd cpu usage but we're unsure.

This upstream bug looks quite accurate [1] but we don't see any `blocked 1000 ms waiting for revalidator127 to quiesce` messages.

[1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264

Comment 4 Rodolfo Alonso 2023-06-23 09:56:29 UTC
Hello:

In the Neutron team we initially thought that this issue was related to the router HA VRRP traffic. But this environment is using OVN thus this is not the problem.

Investigating a bit I found that this problem could be related to an outdated glibc library. According to the U/S bugs [1][2][3], this issue was fixed in [4], target milestone 2.29. The version installed in a OSP16.2 deployment, using RHEL8.4, is glibc-2.28-151.el8.x86_64.

Regards.

[1]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264
[2]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1839592
[3]https://github.com/openvswitch/ovs-issues/issues/175
[4]https://sourceware.org/bugzilla/show_bug.cgi?id=23861

Comment 7 ldenny 2023-06-27 03:13:32 UTC
Hi Rodolfo, 

Is the glibc version something we can get updated in RHEL?

I'm glad you agree this seems to be the issue but I'm not sure how we can prove this, are we able to compile a test version of OVS with the updated version?

There is a reproducer program linked here [1] I will test in my RHOSP16.2 lab, if we can get a test version of OVS I could install or compile in the lab and see if it fixes the reproducer program at least.


Not sure who to set needinfo on sorry so just being cautious.
 
Thanks!

Comment 9 Rodolfo Alonso 2023-06-27 08:18:14 UTC
Hi Lewis:

Sorry, I was expecting you to know the answer to this question. I guess that if there is a bug in a kernel library, we can update it. In any case, if you can compile and test this new glibc version, proving that fixes the issue in OVS, we can call kernel folks to push this fix.

Regards.

Comment 10 ldenny 2023-06-27 09:46:47 UTC
Okay cool, 

We will test compiling ovs with the glibc 2.29 and report back, if we can prove it resolves the issue we will have good argument for the kernel folks to update.

Cheers!

Comment 11 ldenny 2023-07-04 01:23:49 UTC
After looking into this some more, I've come to the conclusion that it's not really possible to test with glibc 2.9. 

Firstly it's my understanding that OVS is consuming libc as a dynamic library so compiling OVS won't be necessary, and updating libc on the host is not as straight forward or safe as I assumed[1]. 

I can't reproduce the issue in my lab and I can't recommend the customer attempt to update libc in their production environment.

Are we able to get some other ideas from the OVS team? 

One solution would be deploying RHOSP17 which shouldn't have this issue as we've updated to RHEL9 and libc 2.34 

[1] https://access.redhat.com/discussions/3244811#comment-2024011