Bug 2107020 - openvswitch2.15 slowness is observed on the openshift nodes
Summary: openvswitch2.15 slowness is observed on the openshift nodes
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Mike Pattrick
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-14 07:12 UTC by Vinya Nema
Modified: 2022-10-19 13:25 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-12 14:00:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Vinya Nema 2022-07-14 07:12:05 UTC
The logs reveal that OVS is really high CPU on this one node. It looks like after reboot there are a ton of stale ovs interfaces which take 10s of minutes to clean up, at that point ports start binding, but there is overall slowness happening with OVS on that node. Messages like this from the OVS journal stand out:



Jul 13 16:06:24 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00115|ofproto_dpif_upcall(revalidator85)|INFO|Spent an unreasonably long 39112ms dumping flows
Jul 13 16:06:24 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00116|ofproto_dpif_upcall(revalidator85)|WARN|Number of datapath flows (10515) twice as high as current dynamic flow limit (8050).  Starting to delete flows unconditionally as an emergency measure.


Jul 13 16:10:16 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00396|ovs_rcu(urcu4)|WARN|blocked 1001 ms waiting for main to quiesce
Jul 13 16:10:17 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00397|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for main to quiesce
Jul 13 16:10:19 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00398|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for main to quiesce
Jul 13 16:10:23 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00399|ovs_rcu(urcu4)|WARN|blocked 8000 ms waiting for main to quiesce
Jul 13 16:10:31 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00400|ovs_rcu(urcu4)|WARN|blocked 16000 ms waiting for main to quiesce
Jul 13 16:10:47 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00401|ovs_rcu(urcu4)|WARN|blocked 32000 ms waiting for main to quiesce

Jul 13 16:12:08 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|65580|timeval|WARN|Unreasonably long 112909ms poll interval (7ms user, 95ms system)

openvsw+    3671 86.6  0.1 7854036 614884 ?      -    Jul12 1813:36 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach



sosreport: https://access.redhat.com/support/cases/#/case/03263433/discussion?attachmentId=a096R00002pWQ82QAG

Comment 1 Aaron Conole 2022-07-14 13:56:24 UTC
Is it possible that we don't have the pull request from https://github.com/openshift/ovn-kubernetes/pull/914 applied?  This smells a little bit like https://bugzilla.redhat.com/show_bug.cgi?id=1986216 but I am not 100% positive.  Especially because I don't see why cleanup should take 10s of minutes?

Comment 4 Dan Williams 2022-07-14 14:48:10 UTC
(In reply to Aaron Conole from comment #1)
> Is it possible that we don't have the pull request from
> https://github.com/openshift/ovn-kubernetes/pull/914 applied?  This smells a
> little bit like https://bugzilla.redhat.com/show_bug.cgi?id=1986216 but I am
> not 100% positive.  Especially because I don't see why cleanup should take
> 10s of minutes?

4.8.z does not have #914 yet; it was only backported down to 4.9.

But wouldn't it take a lot of hard reboots without waiting to drain the node of pods before we'd build up enough stale interfaces in the OVS DB to cause these problems?

Comment 16 Eelco Chaudron 2022-07-20 10:15:43 UTC
As Aaron is out this week, let me add some notes:

Some feedback on #1, looking at the port information I see you have 476 interfaces.
Is this what is expected in this environment, or are there a lot of stale entries?


Looking at the log messages the following sequence:

2022-07-13T16:04:28.390Z|00379|ovs_rcu(urcu4)|WARN|blocked 1000 ms waiting for main to quiesce
2022-07-13T16:04:29.390Z|00380|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for main to quiesce
2022-07-13T16:04:31.390Z|00381|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for main to quiesce
2022-07-13T16:04:35.390Z|00382|ovs_rcu(urcu4)|WARN|blocked 8000 ms waiting for main to quiesce
2022-07-13T16:04:43.390Z|00383|ovs_rcu(urcu4)|WARN|blocked 16001 ms waiting for main to quiesce
2022-07-13T16:04:59.390Z|00384|ovs_rcu(urcu4)|WARN|blocked 32000 ms waiting for main to quiesce
2022-07-13T16:05:31.390Z|00385|ovs_rcu(urcu4)|WARN|blocked 64000 ms waiting for main to quiesce

I've seen this in the past, and in general, this is due to slow response from syscalls.
So if this happens again we should try to see what is going on with the main thread.

The same might be true for the other thread's "Unreasonably long" log messages:

2022-07-13T16:06:24.740Z|00015|timeval(handler32)|WARN|Unreasonably long 2147ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00024|timeval(handler14)|WARN|Unreasonably long 2146ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00011|timeval(handler20)|WARN|Unreasonably long 2146ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00012|timeval(handler26)|WARN|Unreasonably long 3307ms poll interval (0ms user, 7ms system)
2022-07-13T16:06:24.740Z|00014|timeval(handler44)|WARN|Unreasonably long 2148ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00014|timeval(handler16)|WARN|Unreasonably long 3243ms poll interval (2ms user, 5ms system)

I would like to get the following at the time the "blocked xxx ms waiting for main to quiesce" is happening:

- date
- pidstat -t -p ALL 1 1
- for i in {1..5}; do echo "RUN $i"; more /proc/$(pidof ovs-vswitchd)/task/*/syscall | cat; sleep 0.5; done
- core dump of ovs-vswitchd
- Fresh sos report after the core dump

As a reference take a look at BZ1668554 where a similar issue was troubleshot.

Comment 17 Flavio Leitner 2022-07-22 15:16:58 UTC
Besides the request in comment#16, maybe the stale interfaces are causing a loop in the traffic or something else.

I wonder if it is possible to start ovsdb but not ovs-vswitchd and then delete all unavailable ports using
'ovs-vsctl --no-wait del-port <stale port>' and then see if the node works normally after starting ovs-vswitchd?
The idea is to rule out the stale interfaces.

Another thing calling my attention is this log entry due to a IGMP traffic, which may indicate a problem processing multicast traffic.

Jul 13 16:03:13 <host> ovs-vswitchd[3671]: ovs|65545|dpif|WARN|system@ovs-system: execute ct(zone=151,nat),recirc(0x1461a61),ct(zone=45,nat),recirc(0x1461a62),ct(zone=131,nat),recirc(0x1461a63),ct(zone=176,nat),recirc(0x1461a64),ct(zone=123,nat),recirc(0x1461a65),ct(zone=94,nat),recirc(0x1461a66),ct(zone=22,nat),recirc(0x1461a67),ct(zone=116,nat),recirc(0x1461a68),ct(zone=77,nat),recirc(0x1461a69),ct(zone=80,nat),recirc(0x1461a6a),ct(zone=62,nat),recirc(0x1461a6b),ct(zone=165,nat),recirc(0x1461a6c),ct(zone=252,nat),recirc(0x1461a6d),ct(zone=154,nat),recirc(0x1461a6e),ct(zone=231,nat),recirc(0x1461a6f),ct(zone=184,nat),recirc(0x1461a70),ct(zone=244,nat),recirc(0x1461a71),ct(zone=236,nat),recirc(0x1461a72),ct(zone=173,nat),recirc(0x1461a73),ct(zone=241,nat),recirc(0x1461a74),ct(zone=201,nat),recirc(0x1461a75),ct(zone=208,nat),recirc(0x1461a76),ct(zone=203,nat),recirc(0x1461a77),ct(zone=67,nat),recirc(0x1461a78),ct(zone=228,nat),recirc(0x1461a79),ct(zone=43,nat),recirc(0x1461a7a),ct(zone=136,nat),recirc(0x1461a7b),ct(zone=127,nat),recirc(0x1461a7c),ct(zone=61,nat),recirc(0x1461a7d),ct(zone=166,nat),recirc(0x1461a7e),ct(zone=216,nat),recirc(0x1461a7f),ct(zone=129,nat),recirc(0x1461a80),ct(zone=84,nat),recirc(0x1461a81),ct(zone=70,nat),recirc(0x1461a82),ct(zone=75,nat),recirc(0x1461a83),ct(zone=172,nat),recirc(0x1461a84),ct(zone=192,nat),recirc(0x1461a85),ct(zone=199,nat),recirc(0x1461a86),ct(zone=249,nat),recirc(0x1461a87),ct(zone=230,nat),recirc(0x1461a88),ct(zone=41,nat),recirc(0x1461a89),ct(zone=104,nat),recirc(0x1461a8a),ct(zone=107,nat),recirc(0x1461a8b),ct(zone=164,nat),recirc(0x1461a8c),ct(zone=72,nat),recirc(0x1461a8d),ct(zone=233,nat),recirc(0x1461a8e),ct(zone=175,nat),recirc(0x1461a8f),ct(zone=218,nat),recirc(0x1461a90),ct(zone=158,nat),recirc(0x1461a91),ct(zone=156,nat),recirc(0x1461a92),ct(zone=117,nat),recirc(0x1461a93),ct(zone=179,nat),recirc(0x1461a94),ct(zone=186,nat),recirc(0x1461a95),ct(zone=169,nat),recirc(0x1461a96),ct(zone=159,nat),recirc(0x1461a97),ct(zone=31,nat),recirc(0x1461a98),ct(zone=134,nat),recirc(0x1461a99),ct(zone=222,nat),recirc(0x1461a9a),ct(zone=213,nat),recirc(0x1461a9b),ct(zone=227,nat),recirc(0x1461a9c),ct(zone=30,nat),recirc(0x1461a9d),ct(zone=58,nat),recirc(0x1461a9e),ct(zone=153,nat),recirc(0x1461a9f),ct(zone=97,nat),recirc(0x1461aa0),ct(zone=193,nat),recirc(0x1461aa1),ct(zone=64,nat),recirc(0x1461aa2),ct(zone=125,nat),recirc(0x1461aa3),ct(zone=73,nat),recirc(0x1461aa4),ct(zone=237,nat),recirc(0x1461aa5),ct(zone=168,nat),recirc(0x1461aa6),ct(zone=204,nat),recirc(0x1461aa7),ct(zone=86,nat),recirc(0x1461aa8),ct(zone=210,nat),recirc(0x1461aa9),ct(zone=79,nat),recirc(0x1461aaa),ct(zone=115,nat),recirc(0x1461aab),ct(zone=178,nat),recirc(0x1461aac),ct(zone=51,nat),recirc(0x1461aad),ct(zone=148,nat),recirc(0x1461aae),ct(zone=42,nat),recirc(0x1461aaf),ct(zone=91,nat),recirc(0x1461ab0),ct(zone=122,nat),recirc(0x1461ab1),ct(zone=33,nat),recirc(0x1461ab2),ct(zone=177,nat),recirc(0x1461ab3),ct(zone=211,nat),recirc(0x1461ab4),ct(zone=78,nat),recirc(0x1461ab5),ct(zone=152,nat),recirc(0x1461ab6),ct(zone=251,nat),recirc(0x1461ab7),ct(zone=118,nat),recirc(0x1461ab8),ct(zone=38,nat),recirc(0x1461ab9),ct(zone=195,nat),recirc(0x1461aba),ct(zone=110,nat),recirc(0x1461abb),ct(zone=161,nat),recirc(0x1461abc),ct(zone=56,nat),recirc(0x1461abd),ct(zone=133,nat),recirc(0x1461abe),ct(zone=250,nat),recirc(0x1461abf),ct(zone=28,nat),recirc(0x1461ac0),ct(zone=243,nat),recirc(0x1461ac1),ct(zone=55,nat),recirc(0x1461ac2),ct(zone=246,nat),recirc(0x1461ac3),ct(zone=71,nat),recirc(0x1461ac4),ct(zone=183,nat),recirc(0x1461ac5),ct(zone=113,nat),recirc(0x1461ac6),ct(zone=191,nat),recirc(0x1461ac7),ct(zone=205,nat),recirc(0x1461ac8),ct(zone=143,nat),recirc(0x1461ac9),ct(zone=235,nat),recirc(0x1461aca),ct(zone=69,nat),recirc(0x1461acb),ct(zone=232,nat),recirc(0x1461acc),ct(zone=182,nat),recirc(0x1461acd),ct(zone=194,nat),recirc(0x1461ace),ct(zone=114,nat),recirc(0x1461acf),ct(zone=242,nat),recirc(0x1461ad0),ct(zone=76,nat),recirc(0x1461ad1),ct(zone=163,nat),recirc(0x1461ad2),ct(zone=128,nat),recirc(0x1461ad3),ct(zone=223,nat),recirc(0x1461ad4),ct(zone=24,nat),recirc(0x1461ad5),ct(zone=217,nat),recirc(0x1461ad6),ct(zone=146,nat),recirc(0x1461ad7),ct(zone=44,nat),recirc(0x1461ad8),ct(zone=35,nat),recirc(0x1461ad9),ct(zone=130,nat),recirc(0x1461ada),ct(zone=181,nat),recirc(0x1461adb),ct(zone=124,nat),recirc(0x1461adc),ct(zone=46,nat),recirc(0x1461add),ct(zone=37,nat),recirc(0x1461ade),ct(zone=111,nat),recirc(0x1461adf),ct(zone=234,nat),recirc(0x1461ae0),ct(zone=187,nat),recirc(0x1461ae1),ct(zone=215,nat),recirc(0x1461ae2),ct(zone=226,nat),recirc(0x1461ae3),ct(zone=149,nat),recirc(0x1461ae4),ct(zone=144,nat),recirc(0x1461ae5),ct(zone=167,nat),recirc(0x1461ae6),ct(zone=82,nat),recirc(0x1461ae7),ct(zone=99,nat),recirc(0x1461ae8),ct(zone=214,nat),recirc(0x1461ae9),ct(zone=150,nat),recirc(0x1461aea),ct(zone=174,nat),recirc(0x1461aeb),ct(zone=81,nat),recirc(0x1461aec),ct(zone=90,nat),recirc(0x1461aed),ct(zone=180,nat),recirc(0x1461aee),ct(zone=247,nat),recirc(0x1461aef),ct(zone=202,nat),recirc(0x1461af0),ct(zone=63,nat),recirc(0x1461af1),ct(zone=52,nat),recirc(0x1461af2),ct(zone=137,nat),recirc(0x1461af3),ct(zone=120,nat),recirc(0x1461af4),ct(zone=198,nat),recirc(0x1461af5),ct(zone=87,nat),recirc(0x1461af6),ct(zone=23,nat),recirc(0x1461af7),ct(zone=157,nat),recirc(0x1461af8),ct(zone=200,nat),recirc(0x1461af9),ct(zone=100,nat),recirc(0x1461afa),ct(zone=65,nat),recirc(0x1461afb),ct(zone=209,nat),recirc(0x1461afc),ct(zone=225,nat),recirc(0x1461afd),ct(zone=49,nat),recirc(0x1461afe),ct(zone=60,nat),recirc(0x1461aff),ct(zone=103,nat),recirc(0x1461b00),ct(zone=106,nat),recirc(0x1461b01),ct(zone=83,nat),recirc(0x1461b02),ct(zone=141,nat),recirc(0x1461b03),ct(zone=53,nat),recirc(0x1461b04),ct(zone=220,nat),recirc(0x1461b05),ct(zone=95,nat),recirc(0x1461b06),ct(zone=68,nat),recirc(0x1461b07),ct(zone=147,nat),recirc(0x1461b08),ct(zone=108,nat),recirc(0x1461b09),ct(zone=162,nat),recirc(0x1461b0a),ct(zone=121,nat),recirc(0x1461b0b),ct(zone=27,nat),recirc(0x1461b0c),ct(zone=190,nat),recirc(0x1461b0d),ct(zone=32,nat),recirc(0x1461b0e),ct(zone=92,nat),recirc(0x1461b0f),ct(zone=239,nat),recirc(0x1461b10),ct(zone=170,nat),recirc(0x1461b11),ct(zone=29,nat),recirc(0x1461b12),ct(zone=132,nat),recirc(0x1461b13),ct(zone=96,nat),recirc(0x1461b14),ct(zone=109,nat),recirc(0x1461b15),ct(zone=93,nat),recirc(0x1461b16),ct(zone=224,nat),recirc(0x1461b17),ct(zone=185,nat),recirc(0x1461b18),ct(zone=142,nat),recirc(0x1461b19),ct(zone=253,nat),recirc(0x1461b1a),ct(zone=206,nat),recirc(0x1461b1b),ct(zone=135,nat),recirc(0x1461b1c),ct(zone=88,nat),recirc(0x1461b1d),ct(zone=207,nat),recirc(0x1461b1e),ct(zone=140,nat),recirc(0x1461b1f),ct(zone=240,nat),recirc(0x1461b20),ct(zone=229,nat),recirc(0x1461b21),ct(zone=48,nat),recirc(0x1461b22),ct(zone=119,nat),recirc(0x1461b23),ct(zone=196,nat),recirc(0x1461b24),ct(zone=47,nat),recirc(0x1461b25),ct(zone=171,nat),recirc(0x1461b26),ct(zone=26,nat),recirc(0x1461b27),ct(zone=138,nat),recirc(0x1461b28),ct(zone=74,nat),recirc(0x1461b29),ct(zone=34,nat),recirc(0x1461b2a),ct(zone=112,nat),recirc(0x1461b2b),ct(zone=248,nat),recirc(0x1461b2c),ct(zone=188,nat),recirc(0x1461b2d),ct(zone=160,nat),recirc(0x1461b2e),ct(zone=85,nat),recirc(0x1461b2f),ct(zone=189,nat),recirc(0x1461b30),ct(zone=155,nat),recirc(0x1461b31),ct(zone=36,nat),recirc(0x1461b32),ct(zone=50,nat),recirc(0x1461b33),ct(zone=245,nat),recirc(0x1461b34),ct(zone=89,nat),recirc(0x1461b35),ct(zone=57,nat),recirc(0x1461b36),ct(zone=25,nat),recirc(0x1461b37),ct(zone=126,nat),recirc(0x1461b38),ct(zone=3,nat),recirc(0x1461b39),ct(zone=39,nat),recirc(0x1461b3a),ct(zone=219,nat),recirc(0x1461b3b),ct(zone=212,nat),recirc(0x1461b3c),ct(zone=197,nat),recirc(0x1461b3d),ct(zone=101,nat),recirc(0x1461b3e),ct(zone=59,nat),recirc(0x1461b3f),ct(zone=145,nat),recirc(0x1461b40),ct(zone=66,nat),recirc(0x1461b41),ct(zone=40,nat),recirc(0x1461b42),ct(zone=139,nat),recirc(0x1461b43),ct(zone=221,nat),recirc(0x1461b44),ct(zone=238,nat),recirc(0x1461b45),ct(zone=105,nat),recirc(0x1461b46),ct(zone=54,nat),recirc(0x1461b47)

failed (Message too long) on packet igmp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:01,dl_dst=01:00:5e:00:00:01,nw_src=10.130.0.1,nw_dst=224.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=17,igmp_code=10

Comment 20 Eelco Chaudron 2022-07-26 14:18:15 UTC
(In reply to Flavio Leitner from comment #17)
> 
> Another thing calling my attention is this log entry due to a IGMP traffic,
> which may indicate a problem processing multicast traffic.
... 
> failed (Message too long) on packet
> igmp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:01,dl_dst=01:00:5e:00:00:01,
> nw_src=10.130.0.1,nw_dst=224.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=17,
> igmp_code=10

I talked about this with Dumitru and they are aware of this, but should not cause this problem.


Note You need to log in before you can comment on or make changes to this bug.