Description of problem: Note: When we are referring to a non-working node, it is the node that we are observing high CPU usage. It is the controller node, on the contrary the compute node doesn't show any problems with the same configuration. We are observing continuous flow of the following messages in /var/log/messages: ~~~ kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action ~~~ Number of flows: (working) ovs-dpctl dump-flows | wc -l │itch: ovs-system: deferred action limit reached, drop recirc action 2509 (non-working) ovs-dpctl dump-flows | wc -l 8634 (working) grep "drop recirc action" /var/log/messages | wc -l 0 (non-working) grep "drop recirc action" /var/log/messages | wc -l 225783 Version-Release number of selected component (if applicable): 16.2.2 How reproducible: # perf top -g -p $(pidof ovs-vswitchd) We can see most of the time: ~90% on all handlerX threads spent in native_queued_spin_lock_slowpath Actual results: The total CPU usage util of ovs-vswitchd ~600% with peaks of 1500% Expected results: The total CPU usage util of ovs-vswitchd <100% Additional info: We are not sure whether the two are related, but first we would like to understand why: ~~~ kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action ~~~ is appearing in the messages. secondly the native_queued_spin_lock_slowpath, we would like to understand why the ovs-vswitchd spends so much CPU time in it. Would lowering the number of n-threads for revalidator and handler help? However both of the nodes are with high number of CPU cores >160.
One thing I would like to mention we discovered during the troubleshooting is this environment does not have OVN DVR enabled, so a lot of the traffic coming from the compute nodes needs to route through the controller nodes. Maybe that is the difference between the controller and compute ovs-vswitchd cpu usage but we're unsure. This upstream bug looks quite accurate [1] but we don't see any `blocked 1000 ms waiting for revalidator127 to quiesce` messages. [1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264
Hello: In the Neutron team we initially thought that this issue was related to the router HA VRRP traffic. But this environment is using OVN thus this is not the problem. Investigating a bit I found that this problem could be related to an outdated glibc library. According to the U/S bugs [1][2][3], this issue was fixed in [4], target milestone 2.29. The version installed in a OSP16.2 deployment, using RHEL8.4, is glibc-2.28-151.el8.x86_64. Regards. [1]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264 [2]https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1839592 [3]https://github.com/openvswitch/ovs-issues/issues/175 [4]https://sourceware.org/bugzilla/show_bug.cgi?id=23861
Hi Rodolfo, Is the glibc version something we can get updated in RHEL? I'm glad you agree this seems to be the issue but I'm not sure how we can prove this, are we able to compile a test version of OVS with the updated version? There is a reproducer program linked here [1] I will test in my RHOSP16.2 lab, if we can get a test version of OVS I could install or compile in the lab and see if it fixes the reproducer program at least. Not sure who to set needinfo on sorry so just being cautious. Thanks!
[1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1839592/comments/18
Hi Lewis: Sorry, I was expecting you to know the answer to this question. I guess that if there is a bug in a kernel library, we can update it. In any case, if you can compile and test this new glibc version, proving that fixes the issue in OVS, we can call kernel folks to push this fix. Regards.
Okay cool, We will test compiling ovs with the glibc 2.29 and report back, if we can prove it resolves the issue we will have good argument for the kernel folks to update. Cheers!
After looking into this some more, I've come to the conclusion that it's not really possible to test with glibc 2.9. Firstly it's my understanding that OVS is consuming libc as a dynamic library so compiling OVS won't be necessary, and updating libc on the host is not as straight forward or safe as I assumed[1]. I can't reproduce the issue in my lab and I can't recommend the customer attempt to update libc in their production environment. Are we able to get some other ideas from the OVS team? One solution would be deploying RHOSP17 which shouldn't have this issue as we've updated to RHEL9 and libc 2.34 [1] https://access.redhat.com/discussions/3244811#comment-2024011