Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2214885

Summary:	[16.2.2] high ovs-vswitchd CPU usage on controller (most spent in native_queued_spin_lock_slowpath)
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Robin Cernin <rcernin>
Component:	openvswitch	Assignee:	Timothy Redaelli <tredaelli>
openvswitch sub component:	other	QA Contact:	qding
Status:	CLOSED EOL	Docs Contact:
Severity:	medium
Priority:	medium	CC:	apevec, casantos, chrisbro, chrisw, ctrautma, dmaley, hakhande, jappleii, jmitterm, ldenny, qding, ralonsoh, scohen, tredaelli
Version:	RHEL 8.0	Flags:	rcernin: needinfo-
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-10-08 17:49:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robin Cernin 2023-06-14 03:26:05 UTC

Description of problem:

Note: When we are referring to a non-working node, it is the node that we are observing high CPU usage.

It is the controller node, on the contrary the compute node doesn't show any problems with the same configuration.

We are observing continuous flow of the following messages in /var/log/messages:

~~~
kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
~~~

Number of flows:

(working) ovs-dpctl dump-flows  | wc -l                                  │itch: ovs-system: deferred action limit reached, drop recirc action
2509

(non-working) ovs-dpctl dump-flows  | wc -l
8634

(working) grep "drop recirc action" /var/log/messages | wc -l
0

(non-working) grep "drop recirc action" /var/log/messages | wc -l
225783


Version-Release number of selected component (if applicable):

16.2.2

How reproducible:

# perf top -g -p $(pidof ovs-vswitchd)

We can see most of the time: ~90% on all handlerX threads spent in  native_queued_spin_lock_slowpath

Actual results:

The total CPU usage util of ovs-vswitchd ~600% with peaks of 1500%

Expected results:

The total CPU usage util of ovs-vswitchd <100%

Additional info:

We are not sure whether the two are related, but first we would like to understand why:

~~~
kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
~~~

is appearing in the messages.

secondly the native_queued_spin_lock_slowpath, we would like to understand why the ovs-vswitchd spends so much CPU time in it.

Would lowering the number of n-threads for revalidator and handler help?

However both of the nodes are with high number of CPU cores >160.

Comment 2 ldenny 2023-06-14 03:47:28 UTC

One thing I would like to mention we discovered during the troubleshooting is this environment does not have OVN DVR enabled, so a lot of the traffic coming from the compute nodes needs to route through the controller nodes. Maybe that is the difference between the controller and compute ovs-vswitchd cpu usage but we're unsure.

This upstream bug looks quite accurate [1] but we don't see any `blocked 1000 ms waiting for revalidator127 to quiesce` messages.

[1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264

Comment 4 Rodolfo Alonso 2023-06-23 09:56:29 UTC

Hello:

In the Neutron team we initially thought that this issue was related to the router HA VRRP traffic. But this environment is using OVN thus this is not the problem.

Investigating a bit I found that this problem could be related to an outdated glibc library. According to the U/S bugs [1][2][3], this issue was fixed in [4], target milestone 2.29. The version installed in a OSP16.2 deployment, using RHEL8.4, is glibc-2.28-151.el8.x86_64.

Regards.

[1]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264
[2]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1839592
[3]https://github.com/openvswitch/ovs-issues/issues/175
[4]https://sourceware.org/bugzilla/show_bug.cgi?id=23861

Comment 7 ldenny 2023-06-27 03:13:32 UTC

Hi Rodolfo, 

Is the glibc version something we can get updated in RHEL?

I'm glad you agree this seems to be the issue but I'm not sure how we can prove this, are we able to compile a test version of OVS with the updated version?

There is a reproducer program linked here [1] I will test in my RHOSP16.2 lab, if we can get a test version of OVS I could install or compile in the lab and see if it fixes the reproducer program at least.


Not sure who to set needinfo on sorry so just being cautious.
 
Thanks!

Comment 8 ldenny 2023-06-27 03:14:06 UTC

[1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1839592/comments/18

Comment 9 Rodolfo Alonso 2023-06-27 08:18:14 UTC

Hi Lewis:

Sorry, I was expecting you to know the answer to this question. I guess that if there is a bug in a kernel library, we can update it. In any case, if you can compile and test this new glibc version, proving that fixes the issue in OVS, we can call kernel folks to push this fix.

Regards.

Comment 10 ldenny 2023-06-27 09:46:47 UTC

Okay cool, 

We will test compiling ovs with the glibc 2.29 and report back, if we can prove it resolves the issue we will have good argument for the kernel folks to update.

Cheers!

Comment 11 ldenny 2023-07-04 01:23:49 UTC

After looking into this some more, I've come to the conclusion that it's not really possible to test with glibc 2.9. 

Firstly it's my understanding that OVS is consuming libc as a dynamic library so compiling OVS won't be necessary, and updating libc on the host is not as straight forward or safe as I assumed[1]. 

I can't reproduce the issue in my lab and I can't recommend the customer attempt to update libc in their production environment.

Are we able to get some other ideas from the OVS team? 

One solution would be deploying RHOSP17 which shouldn't have this issue as we've updated to RHEL9 and libc 2.34 

[1] https://access.redhat.com/discussions/3244811#comment-2024011

Comment 17 ovs-bot 2024-10-08 17:49:14 UTC

This bug did not meet the criteria for automatic migration and is being closed.
If the issue remains, please open a new ticket in https://issues.redhat.com/browse/FDP

Comment 19 Red Hat Bugzilla 2025-02-07 04:25:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days