The logs reveal that OVS is really high CPU on this one node. It looks like after reboot there are a ton of stale ovs interfaces which take 10s of minutes to clean up, at that point ports start binding, but there is overall slowness happening with OVS on that node. Messages like this from the OVS journal stand out: Jul 13 16:06:24 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00115|ofproto_dpif_upcall(revalidator85)|INFO|Spent an unreasonably long 39112ms dumping flows Jul 13 16:06:24 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00116|ofproto_dpif_upcall(revalidator85)|WARN|Number of datapath flows (10515) twice as high as current dynamic flow limit (8050). Starting to delete flows unconditionally as an emergency measure. Jul 13 16:10:16 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00396|ovs_rcu(urcu4)|WARN|blocked 1001 ms waiting for main to quiesce Jul 13 16:10:17 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00397|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for main to quiesce Jul 13 16:10:19 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00398|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for main to quiesce Jul 13 16:10:23 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00399|ovs_rcu(urcu4)|WARN|blocked 8000 ms waiting for main to quiesce Jul 13 16:10:31 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00400|ovs_rcu(urcu4)|WARN|blocked 16000 ms waiting for main to quiesce Jul 13 16:10:47 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|00401|ovs_rcu(urcu4)|WARN|blocked 32000 ms waiting for main to quiesce Jul 13 16:12:08 hpegencontrol3.hpedl360.dyn.nesc.nokia.net ovs-vswitchd[3671]: ovs|65580|timeval|WARN|Unreasonably long 112909ms poll interval (7ms user, 95ms system) openvsw+ 3671 86.6 0.1 7854036 614884 ? - Jul12 1813:36 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach sosreport: https://access.redhat.com/support/cases/#/case/03263433/discussion?attachmentId=a096R00002pWQ82QAG
Is it possible that we don't have the pull request from https://github.com/openshift/ovn-kubernetes/pull/914 applied? This smells a little bit like https://bugzilla.redhat.com/show_bug.cgi?id=1986216 but I am not 100% positive. Especially because I don't see why cleanup should take 10s of minutes?
(In reply to Aaron Conole from comment #1) > Is it possible that we don't have the pull request from > https://github.com/openshift/ovn-kubernetes/pull/914 applied? This smells a > little bit like https://bugzilla.redhat.com/show_bug.cgi?id=1986216 but I am > not 100% positive. Especially because I don't see why cleanup should take > 10s of minutes? 4.8.z does not have #914 yet; it was only backported down to 4.9. But wouldn't it take a lot of hard reboots without waiting to drain the node of pods before we'd build up enough stale interfaces in the OVS DB to cause these problems?
As Aaron is out this week, let me add some notes: Some feedback on #1, looking at the port information I see you have 476 interfaces. Is this what is expected in this environment, or are there a lot of stale entries? Looking at the log messages the following sequence: 2022-07-13T16:04:28.390Z|00379|ovs_rcu(urcu4)|WARN|blocked 1000 ms waiting for main to quiesce 2022-07-13T16:04:29.390Z|00380|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for main to quiesce 2022-07-13T16:04:31.390Z|00381|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for main to quiesce 2022-07-13T16:04:35.390Z|00382|ovs_rcu(urcu4)|WARN|blocked 8000 ms waiting for main to quiesce 2022-07-13T16:04:43.390Z|00383|ovs_rcu(urcu4)|WARN|blocked 16001 ms waiting for main to quiesce 2022-07-13T16:04:59.390Z|00384|ovs_rcu(urcu4)|WARN|blocked 32000 ms waiting for main to quiesce 2022-07-13T16:05:31.390Z|00385|ovs_rcu(urcu4)|WARN|blocked 64000 ms waiting for main to quiesce I've seen this in the past, and in general, this is due to slow response from syscalls. So if this happens again we should try to see what is going on with the main thread. The same might be true for the other thread's "Unreasonably long" log messages: 2022-07-13T16:06:24.740Z|00015|timeval(handler32)|WARN|Unreasonably long 2147ms poll interval (0ms user, 6ms system) 2022-07-13T16:06:24.740Z|00024|timeval(handler14)|WARN|Unreasonably long 2146ms poll interval (0ms user, 6ms system) 2022-07-13T16:06:24.740Z|00011|timeval(handler20)|WARN|Unreasonably long 2146ms poll interval (0ms user, 6ms system) 2022-07-13T16:06:24.740Z|00012|timeval(handler26)|WARN|Unreasonably long 3307ms poll interval (0ms user, 7ms system) 2022-07-13T16:06:24.740Z|00014|timeval(handler44)|WARN|Unreasonably long 2148ms poll interval (0ms user, 6ms system) 2022-07-13T16:06:24.740Z|00014|timeval(handler16)|WARN|Unreasonably long 3243ms poll interval (2ms user, 5ms system) I would like to get the following at the time the "blocked xxx ms waiting for main to quiesce" is happening: - date - pidstat -t -p ALL 1 1 - for i in {1..5}; do echo "RUN $i"; more /proc/$(pidof ovs-vswitchd)/task/*/syscall | cat; sleep 0.5; done - core dump of ovs-vswitchd - Fresh sos report after the core dump As a reference take a look at BZ1668554 where a similar issue was troubleshot.
Besides the request in comment#16, maybe the stale interfaces are causing a loop in the traffic or something else. I wonder if it is possible to start ovsdb but not ovs-vswitchd and then delete all unavailable ports using 'ovs-vsctl --no-wait del-port <stale port>' and then see if the node works normally after starting ovs-vswitchd? The idea is to rule out the stale interfaces. Another thing calling my attention is this log entry due to a IGMP traffic, which may indicate a problem processing multicast traffic. Jul 13 16:03:13 <host> ovs-vswitchd[3671]: ovs|65545|dpif|WARN|system@ovs-system: execute ct(zone=151,nat),recirc(0x1461a61),ct(zone=45,nat),recirc(0x1461a62),ct(zone=131,nat),recirc(0x1461a63),ct(zone=176,nat),recirc(0x1461a64),ct(zone=123,nat),recirc(0x1461a65),ct(zone=94,nat),recirc(0x1461a66),ct(zone=22,nat),recirc(0x1461a67),ct(zone=116,nat),recirc(0x1461a68),ct(zone=77,nat),recirc(0x1461a69),ct(zone=80,nat),recirc(0x1461a6a),ct(zone=62,nat),recirc(0x1461a6b),ct(zone=165,nat),recirc(0x1461a6c),ct(zone=252,nat),recirc(0x1461a6d),ct(zone=154,nat),recirc(0x1461a6e),ct(zone=231,nat),recirc(0x1461a6f),ct(zone=184,nat),recirc(0x1461a70),ct(zone=244,nat),recirc(0x1461a71),ct(zone=236,nat),recirc(0x1461a72),ct(zone=173,nat),recirc(0x1461a73),ct(zone=241,nat),recirc(0x1461a74),ct(zone=201,nat),recirc(0x1461a75),ct(zone=208,nat),recirc(0x1461a76),ct(zone=203,nat),recirc(0x1461a77),ct(zone=67,nat),recirc(0x1461a78),ct(zone=228,nat),recirc(0x1461a79),ct(zone=43,nat),recirc(0x1461a7a),ct(zone=136,nat),recirc(0x1461a7b),ct(zone=127,nat),recirc(0x1461a7c),ct(zone=61,nat),recirc(0x1461a7d),ct(zone=166,nat),recirc(0x1461a7e),ct(zone=216,nat),recirc(0x1461a7f),ct(zone=129,nat),recirc(0x1461a80),ct(zone=84,nat),recirc(0x1461a81),ct(zone=70,nat),recirc(0x1461a82),ct(zone=75,nat),recirc(0x1461a83),ct(zone=172,nat),recirc(0x1461a84),ct(zone=192,nat),recirc(0x1461a85),ct(zone=199,nat),recirc(0x1461a86),ct(zone=249,nat),recirc(0x1461a87),ct(zone=230,nat),recirc(0x1461a88),ct(zone=41,nat),recirc(0x1461a89),ct(zone=104,nat),recirc(0x1461a8a),ct(zone=107,nat),recirc(0x1461a8b),ct(zone=164,nat),recirc(0x1461a8c),ct(zone=72,nat),recirc(0x1461a8d),ct(zone=233,nat),recirc(0x1461a8e),ct(zone=175,nat),recirc(0x1461a8f),ct(zone=218,nat),recirc(0x1461a90),ct(zone=158,nat),recirc(0x1461a91),ct(zone=156,nat),recirc(0x1461a92),ct(zone=117,nat),recirc(0x1461a93),ct(zone=179,nat),recirc(0x1461a94),ct(zone=186,nat),recirc(0x1461a95),ct(zone=169,nat),recirc(0x1461a96),ct(zone=159,nat),recirc(0x1461a97),ct(zone=31,nat),recirc(0x1461a98),ct(zone=134,nat),recirc(0x1461a99),ct(zone=222,nat),recirc(0x1461a9a),ct(zone=213,nat),recirc(0x1461a9b),ct(zone=227,nat),recirc(0x1461a9c),ct(zone=30,nat),recirc(0x1461a9d),ct(zone=58,nat),recirc(0x1461a9e),ct(zone=153,nat),recirc(0x1461a9f),ct(zone=97,nat),recirc(0x1461aa0),ct(zone=193,nat),recirc(0x1461aa1),ct(zone=64,nat),recirc(0x1461aa2),ct(zone=125,nat),recirc(0x1461aa3),ct(zone=73,nat),recirc(0x1461aa4),ct(zone=237,nat),recirc(0x1461aa5),ct(zone=168,nat),recirc(0x1461aa6),ct(zone=204,nat),recirc(0x1461aa7),ct(zone=86,nat),recirc(0x1461aa8),ct(zone=210,nat),recirc(0x1461aa9),ct(zone=79,nat),recirc(0x1461aaa),ct(zone=115,nat),recirc(0x1461aab),ct(zone=178,nat),recirc(0x1461aac),ct(zone=51,nat),recirc(0x1461aad),ct(zone=148,nat),recirc(0x1461aae),ct(zone=42,nat),recirc(0x1461aaf),ct(zone=91,nat),recirc(0x1461ab0),ct(zone=122,nat),recirc(0x1461ab1),ct(zone=33,nat),recirc(0x1461ab2),ct(zone=177,nat),recirc(0x1461ab3),ct(zone=211,nat),recirc(0x1461ab4),ct(zone=78,nat),recirc(0x1461ab5),ct(zone=152,nat),recirc(0x1461ab6),ct(zone=251,nat),recirc(0x1461ab7),ct(zone=118,nat),recirc(0x1461ab8),ct(zone=38,nat),recirc(0x1461ab9),ct(zone=195,nat),recirc(0x1461aba),ct(zone=110,nat),recirc(0x1461abb),ct(zone=161,nat),recirc(0x1461abc),ct(zone=56,nat),recirc(0x1461abd),ct(zone=133,nat),recirc(0x1461abe),ct(zone=250,nat),recirc(0x1461abf),ct(zone=28,nat),recirc(0x1461ac0),ct(zone=243,nat),recirc(0x1461ac1),ct(zone=55,nat),recirc(0x1461ac2),ct(zone=246,nat),recirc(0x1461ac3),ct(zone=71,nat),recirc(0x1461ac4),ct(zone=183,nat),recirc(0x1461ac5),ct(zone=113,nat),recirc(0x1461ac6),ct(zone=191,nat),recirc(0x1461ac7),ct(zone=205,nat),recirc(0x1461ac8),ct(zone=143,nat),recirc(0x1461ac9),ct(zone=235,nat),recirc(0x1461aca),ct(zone=69,nat),recirc(0x1461acb),ct(zone=232,nat),recirc(0x1461acc),ct(zone=182,nat),recirc(0x1461acd),ct(zone=194,nat),recirc(0x1461ace),ct(zone=114,nat),recirc(0x1461acf),ct(zone=242,nat),recirc(0x1461ad0),ct(zone=76,nat),recirc(0x1461ad1),ct(zone=163,nat),recirc(0x1461ad2),ct(zone=128,nat),recirc(0x1461ad3),ct(zone=223,nat),recirc(0x1461ad4),ct(zone=24,nat),recirc(0x1461ad5),ct(zone=217,nat),recirc(0x1461ad6),ct(zone=146,nat),recirc(0x1461ad7),ct(zone=44,nat),recirc(0x1461ad8),ct(zone=35,nat),recirc(0x1461ad9),ct(zone=130,nat),recirc(0x1461ada),ct(zone=181,nat),recirc(0x1461adb),ct(zone=124,nat),recirc(0x1461adc),ct(zone=46,nat),recirc(0x1461add),ct(zone=37,nat),recirc(0x1461ade),ct(zone=111,nat),recirc(0x1461adf),ct(zone=234,nat),recirc(0x1461ae0),ct(zone=187,nat),recirc(0x1461ae1),ct(zone=215,nat),recirc(0x1461ae2),ct(zone=226,nat),recirc(0x1461ae3),ct(zone=149,nat),recirc(0x1461ae4),ct(zone=144,nat),recirc(0x1461ae5),ct(zone=167,nat),recirc(0x1461ae6),ct(zone=82,nat),recirc(0x1461ae7),ct(zone=99,nat),recirc(0x1461ae8),ct(zone=214,nat),recirc(0x1461ae9),ct(zone=150,nat),recirc(0x1461aea),ct(zone=174,nat),recirc(0x1461aeb),ct(zone=81,nat),recirc(0x1461aec),ct(zone=90,nat),recirc(0x1461aed),ct(zone=180,nat),recirc(0x1461aee),ct(zone=247,nat),recirc(0x1461aef),ct(zone=202,nat),recirc(0x1461af0),ct(zone=63,nat),recirc(0x1461af1),ct(zone=52,nat),recirc(0x1461af2),ct(zone=137,nat),recirc(0x1461af3),ct(zone=120,nat),recirc(0x1461af4),ct(zone=198,nat),recirc(0x1461af5),ct(zone=87,nat),recirc(0x1461af6),ct(zone=23,nat),recirc(0x1461af7),ct(zone=157,nat),recirc(0x1461af8),ct(zone=200,nat),recirc(0x1461af9),ct(zone=100,nat),recirc(0x1461afa),ct(zone=65,nat),recirc(0x1461afb),ct(zone=209,nat),recirc(0x1461afc),ct(zone=225,nat),recirc(0x1461afd),ct(zone=49,nat),recirc(0x1461afe),ct(zone=60,nat),recirc(0x1461aff),ct(zone=103,nat),recirc(0x1461b00),ct(zone=106,nat),recirc(0x1461b01),ct(zone=83,nat),recirc(0x1461b02),ct(zone=141,nat),recirc(0x1461b03),ct(zone=53,nat),recirc(0x1461b04),ct(zone=220,nat),recirc(0x1461b05),ct(zone=95,nat),recirc(0x1461b06),ct(zone=68,nat),recirc(0x1461b07),ct(zone=147,nat),recirc(0x1461b08),ct(zone=108,nat),recirc(0x1461b09),ct(zone=162,nat),recirc(0x1461b0a),ct(zone=121,nat),recirc(0x1461b0b),ct(zone=27,nat),recirc(0x1461b0c),ct(zone=190,nat),recirc(0x1461b0d),ct(zone=32,nat),recirc(0x1461b0e),ct(zone=92,nat),recirc(0x1461b0f),ct(zone=239,nat),recirc(0x1461b10),ct(zone=170,nat),recirc(0x1461b11),ct(zone=29,nat),recirc(0x1461b12),ct(zone=132,nat),recirc(0x1461b13),ct(zone=96,nat),recirc(0x1461b14),ct(zone=109,nat),recirc(0x1461b15),ct(zone=93,nat),recirc(0x1461b16),ct(zone=224,nat),recirc(0x1461b17),ct(zone=185,nat),recirc(0x1461b18),ct(zone=142,nat),recirc(0x1461b19),ct(zone=253,nat),recirc(0x1461b1a),ct(zone=206,nat),recirc(0x1461b1b),ct(zone=135,nat),recirc(0x1461b1c),ct(zone=88,nat),recirc(0x1461b1d),ct(zone=207,nat),recirc(0x1461b1e),ct(zone=140,nat),recirc(0x1461b1f),ct(zone=240,nat),recirc(0x1461b20),ct(zone=229,nat),recirc(0x1461b21),ct(zone=48,nat),recirc(0x1461b22),ct(zone=119,nat),recirc(0x1461b23),ct(zone=196,nat),recirc(0x1461b24),ct(zone=47,nat),recirc(0x1461b25),ct(zone=171,nat),recirc(0x1461b26),ct(zone=26,nat),recirc(0x1461b27),ct(zone=138,nat),recirc(0x1461b28),ct(zone=74,nat),recirc(0x1461b29),ct(zone=34,nat),recirc(0x1461b2a),ct(zone=112,nat),recirc(0x1461b2b),ct(zone=248,nat),recirc(0x1461b2c),ct(zone=188,nat),recirc(0x1461b2d),ct(zone=160,nat),recirc(0x1461b2e),ct(zone=85,nat),recirc(0x1461b2f),ct(zone=189,nat),recirc(0x1461b30),ct(zone=155,nat),recirc(0x1461b31),ct(zone=36,nat),recirc(0x1461b32),ct(zone=50,nat),recirc(0x1461b33),ct(zone=245,nat),recirc(0x1461b34),ct(zone=89,nat),recirc(0x1461b35),ct(zone=57,nat),recirc(0x1461b36),ct(zone=25,nat),recirc(0x1461b37),ct(zone=126,nat),recirc(0x1461b38),ct(zone=3,nat),recirc(0x1461b39),ct(zone=39,nat),recirc(0x1461b3a),ct(zone=219,nat),recirc(0x1461b3b),ct(zone=212,nat),recirc(0x1461b3c),ct(zone=197,nat),recirc(0x1461b3d),ct(zone=101,nat),recirc(0x1461b3e),ct(zone=59,nat),recirc(0x1461b3f),ct(zone=145,nat),recirc(0x1461b40),ct(zone=66,nat),recirc(0x1461b41),ct(zone=40,nat),recirc(0x1461b42),ct(zone=139,nat),recirc(0x1461b43),ct(zone=221,nat),recirc(0x1461b44),ct(zone=238,nat),recirc(0x1461b45),ct(zone=105,nat),recirc(0x1461b46),ct(zone=54,nat),recirc(0x1461b47) failed (Message too long) on packet igmp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:01,dl_dst=01:00:5e:00:00:01,nw_src=10.130.0.1,nw_dst=224.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=17,igmp_code=10
(In reply to Flavio Leitner from comment #17) > > Another thing calling my attention is this log entry due to a IGMP traffic, > which may indicate a problem processing multicast traffic. ... > failed (Message too long) on packet > igmp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:01,dl_dst=01:00:5e:00:00:01, > nw_src=10.130.0.1,nw_dst=224.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=17, > igmp_code=10 I talked about this with Dumitru and they are aware of this, but should not cause this problem.