Bug 2107020
| Summary: | openvswitch2.15 slowness is observed on the openshift nodes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vinya Nema <vnema> |
| Component: | Networking | Assignee: | Mike Pattrick <mpattric> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | aconole, akaris, bzvonar, dcbw, echaudro, fleitner, midu, mmorency, mpattric, pprahlad, swasthan, trozet |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-10-12 14:00:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vinya Nema
2022-07-14 07:12:05 UTC
Is it possible that we don't have the pull request from https://github.com/openshift/ovn-kubernetes/pull/914 applied? This smells a little bit like https://bugzilla.redhat.com/show_bug.cgi?id=1986216 but I am not 100% positive. Especially because I don't see why cleanup should take 10s of minutes? (In reply to Aaron Conole from comment #1) > Is it possible that we don't have the pull request from > https://github.com/openshift/ovn-kubernetes/pull/914 applied? This smells a > little bit like https://bugzilla.redhat.com/show_bug.cgi?id=1986216 but I am > not 100% positive. Especially because I don't see why cleanup should take > 10s of minutes? 4.8.z does not have #914 yet; it was only backported down to 4.9. But wouldn't it take a lot of hard reboots without waiting to drain the node of pods before we'd build up enough stale interfaces in the OVS DB to cause these problems? As Aaron is out this week, let me add some notes:
Some feedback on #1, looking at the port information I see you have 476 interfaces.
Is this what is expected in this environment, or are there a lot of stale entries?
Looking at the log messages the following sequence:
2022-07-13T16:04:28.390Z|00379|ovs_rcu(urcu4)|WARN|blocked 1000 ms waiting for main to quiesce
2022-07-13T16:04:29.390Z|00380|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for main to quiesce
2022-07-13T16:04:31.390Z|00381|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for main to quiesce
2022-07-13T16:04:35.390Z|00382|ovs_rcu(urcu4)|WARN|blocked 8000 ms waiting for main to quiesce
2022-07-13T16:04:43.390Z|00383|ovs_rcu(urcu4)|WARN|blocked 16001 ms waiting for main to quiesce
2022-07-13T16:04:59.390Z|00384|ovs_rcu(urcu4)|WARN|blocked 32000 ms waiting for main to quiesce
2022-07-13T16:05:31.390Z|00385|ovs_rcu(urcu4)|WARN|blocked 64000 ms waiting for main to quiesce
I've seen this in the past, and in general, this is due to slow response from syscalls.
So if this happens again we should try to see what is going on with the main thread.
The same might be true for the other thread's "Unreasonably long" log messages:
2022-07-13T16:06:24.740Z|00015|timeval(handler32)|WARN|Unreasonably long 2147ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00024|timeval(handler14)|WARN|Unreasonably long 2146ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00011|timeval(handler20)|WARN|Unreasonably long 2146ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00012|timeval(handler26)|WARN|Unreasonably long 3307ms poll interval (0ms user, 7ms system)
2022-07-13T16:06:24.740Z|00014|timeval(handler44)|WARN|Unreasonably long 2148ms poll interval (0ms user, 6ms system)
2022-07-13T16:06:24.740Z|00014|timeval(handler16)|WARN|Unreasonably long 3243ms poll interval (2ms user, 5ms system)
I would like to get the following at the time the "blocked xxx ms waiting for main to quiesce" is happening:
- date
- pidstat -t -p ALL 1 1
- for i in {1..5}; do echo "RUN $i"; more /proc/$(pidof ovs-vswitchd)/task/*/syscall | cat; sleep 0.5; done
- core dump of ovs-vswitchd
- Fresh sos report after the core dump
As a reference take a look at BZ1668554 where a similar issue was troubleshot.
Besides the request in comment#16, maybe the stale interfaces are causing a loop in the traffic or something else. I wonder if it is possible to start ovsdb but not ovs-vswitchd and then delete all unavailable ports using 'ovs-vsctl --no-wait del-port <stale port>' and then see if the node works normally after starting ovs-vswitchd? The idea is to rule out the stale interfaces. Another thing calling my attention is this log entry due to a IGMP traffic, which may indicate a problem processing multicast traffic. Jul 13 16:03:13 <host> ovs-vswitchd[3671]: ovs|65545|dpif|WARN|system@ovs-system: execute ct(zone=151,nat),recirc(0x1461a61),ct(zone=45,nat),recirc(0x1461a62),ct(zone=131,nat),recirc(0x1461a63),ct(zone=176,nat),recirc(0x1461a64),ct(zone=123,nat),recirc(0x1461a65),ct(zone=94,nat),recirc(0x1461a66),ct(zone=22,nat),recirc(0x1461a67),ct(zone=116,nat),recirc(0x1461a68),ct(zone=77,nat),recirc(0x1461a69),ct(zone=80,nat),recirc(0x1461a6a),ct(zone=62,nat),recirc(0x1461a6b),ct(zone=165,nat),recirc(0x1461a6c),ct(zone=252,nat),recirc(0x1461a6d),ct(zone=154,nat),recirc(0x1461a6e),ct(zone=231,nat),recirc(0x1461a6f),ct(zone=184,nat),recirc(0x1461a70),ct(zone=244,nat),recirc(0x1461a71),ct(zone=236,nat),recirc(0x1461a72),ct(zone=173,nat),recirc(0x1461a73),ct(zone=241,nat),recirc(0x1461a74),ct(zone=201,nat),recirc(0x1461a75),ct(zone=208,nat),recirc(0x1461a76),ct(zone=203,nat),recirc(0x1461a77),ct(zone=67,nat),recirc(0x1461a78),ct(zone=228,nat),recirc(0x1461a79),ct(zone=43,nat),recirc(0x1461a7a),ct(zone=136,nat),recirc(0x1461a7b),ct(zone=127,nat),recirc(0x1461a7c),ct(zone=61,nat),recirc(0x1461a7d),ct(zone=166,nat),recirc(0x1461a7e),ct(zone=216,nat),recirc(0x1461a7f),ct(zone=129,nat),recirc(0x1461a80),ct(zone=84,nat),recirc(0x1461a81),ct(zone=70,nat),recirc(0x1461a82),ct(zone=75,nat),recirc(0x1461a83),ct(zone=172,nat),recirc(0x1461a84),ct(zone=192,nat),recirc(0x1461a85),ct(zone=199,nat),recirc(0x1461a86),ct(zone=249,nat),recirc(0x1461a87),ct(zone=230,nat),recirc(0x1461a88),ct(zone=41,nat),recirc(0x1461a89),ct(zone=104,nat),recirc(0x1461a8a),ct(zone=107,nat),recirc(0x1461a8b),ct(zone=164,nat),recirc(0x1461a8c),ct(zone=72,nat),recirc(0x1461a8d),ct(zone=233,nat),recirc(0x1461a8e),ct(zone=175,nat),recirc(0x1461a8f),ct(zone=218,nat),recirc(0x1461a90),ct(zone=158,nat),recirc(0x1461a91),ct(zone=156,nat),recirc(0x1461a92),ct(zone=117,nat),recirc(0x1461a93),ct(zone=179,nat),recirc(0x1461a94),ct(zone=186,nat),recirc(0x1461a95),ct(zone=169,nat),recirc(0x1461a96),ct(zone=159,nat),recirc(0x1461a97),ct(zone=31,nat),recirc(0x1461a98),ct(zone=134,nat),recirc(0x1461a99),ct(zone=222,nat),recirc(0x1461a9a),ct(zone=213,nat),recirc(0x1461a9b),ct(zone=227,nat),recirc(0x1461a9c),ct(zone=30,nat),recirc(0x1461a9d),ct(zone=58,nat),recirc(0x1461a9e),ct(zone=153,nat),recirc(0x1461a9f),ct(zone=97,nat),recirc(0x1461aa0),ct(zone=193,nat),recirc(0x1461aa1),ct(zone=64,nat),recirc(0x1461aa2),ct(zone=125,nat),recirc(0x1461aa3),ct(zone=73,nat),recirc(0x1461aa4),ct(zone=237,nat),recirc(0x1461aa5),ct(zone=168,nat),recirc(0x1461aa6),ct(zone=204,nat),recirc(0x1461aa7),ct(zone=86,nat),recirc(0x1461aa8),ct(zone=210,nat),recirc(0x1461aa9),ct(zone=79,nat),recirc(0x1461aaa),ct(zone=115,nat),recirc(0x1461aab),ct(zone=178,nat),recirc(0x1461aac),ct(zone=51,nat),recirc(0x1461aad),ct(zone=148,nat),recirc(0x1461aae),ct(zone=42,nat),recirc(0x1461aaf),ct(zone=91,nat),recirc(0x1461ab0),ct(zone=122,nat),recirc(0x1461ab1),ct(zone=33,nat),recirc(0x1461ab2),ct(zone=177,nat),recirc(0x1461ab3),ct(zone=211,nat),recirc(0x1461ab4),ct(zone=78,nat),recirc(0x1461ab5),ct(zone=152,nat),recirc(0x1461ab6),ct(zone=251,nat),recirc(0x1461ab7),ct(zone=118,nat),recirc(0x1461ab8),ct(zone=38,nat),recirc(0x1461ab9),ct(zone=195,nat),recirc(0x1461aba),ct(zone=110,nat),recirc(0x1461abb),ct(zone=161,nat),recirc(0x1461abc),ct(zone=56,nat),recirc(0x1461abd),ct(zone=133,nat),recirc(0x1461abe),ct(zone=250,nat),recirc(0x1461abf),ct(zone=28,nat),recirc(0x1461ac0),ct(zone=243,nat),recirc(0x1461ac1),ct(zone=55,nat),recirc(0x1461ac2),ct(zone=246,nat),recirc(0x1461ac3),ct(zone=71,nat),recirc(0x1461ac4),ct(zone=183,nat),recirc(0x1461ac5),ct(zone=113,nat),recirc(0x1461ac6),ct(zone=191,nat),recirc(0x1461ac7),ct(zone=205,nat),recirc(0x1461ac8),ct(zone=143,nat),recirc(0x1461ac9),ct(zone=235,nat),recirc(0x1461aca),ct(zone=69,nat),recirc(0x1461acb),ct(zone=232,nat),recirc(0x1461acc),ct(zone=182,nat),recirc(0x1461acd),ct(zone=194,nat),recirc(0x1461ace),ct(zone=114,nat),recirc(0x1461acf),ct(zone=242,nat),recirc(0x1461ad0),ct(zone=76,nat),recirc(0x1461ad1),ct(zone=163,nat),recirc(0x1461ad2),ct(zone=128,nat),recirc(0x1461ad3),ct(zone=223,nat),recirc(0x1461ad4),ct(zone=24,nat),recirc(0x1461ad5),ct(zone=217,nat),recirc(0x1461ad6),ct(zone=146,nat),recirc(0x1461ad7),ct(zone=44,nat),recirc(0x1461ad8),ct(zone=35,nat),recirc(0x1461ad9),ct(zone=130,nat),recirc(0x1461ada),ct(zone=181,nat),recirc(0x1461adb),ct(zone=124,nat),recirc(0x1461adc),ct(zone=46,nat),recirc(0x1461add),ct(zone=37,nat),recirc(0x1461ade),ct(zone=111,nat),recirc(0x1461adf),ct(zone=234,nat),recirc(0x1461ae0),ct(zone=187,nat),recirc(0x1461ae1),ct(zone=215,nat),recirc(0x1461ae2),ct(zone=226,nat),recirc(0x1461ae3),ct(zone=149,nat),recirc(0x1461ae4),ct(zone=144,nat),recirc(0x1461ae5),ct(zone=167,nat),recirc(0x1461ae6),ct(zone=82,nat),recirc(0x1461ae7),ct(zone=99,nat),recirc(0x1461ae8),ct(zone=214,nat),recirc(0x1461ae9),ct(zone=150,nat),recirc(0x1461aea),ct(zone=174,nat),recirc(0x1461aeb),ct(zone=81,nat),recirc(0x1461aec),ct(zone=90,nat),recirc(0x1461aed),ct(zone=180,nat),recirc(0x1461aee),ct(zone=247,nat),recirc(0x1461aef),ct(zone=202,nat),recirc(0x1461af0),ct(zone=63,nat),recirc(0x1461af1),ct(zone=52,nat),recirc(0x1461af2),ct(zone=137,nat),recirc(0x1461af3),ct(zone=120,nat),recirc(0x1461af4),ct(zone=198,nat),recirc(0x1461af5),ct(zone=87,nat),recirc(0x1461af6),ct(zone=23,nat),recirc(0x1461af7),ct(zone=157,nat),recirc(0x1461af8),ct(zone=200,nat),recirc(0x1461af9),ct(zone=100,nat),recirc(0x1461afa),ct(zone=65,nat),recirc(0x1461afb),ct(zone=209,nat),recirc(0x1461afc),ct(zone=225,nat),recirc(0x1461afd),ct(zone=49,nat),recirc(0x1461afe),ct(zone=60,nat),recirc(0x1461aff),ct(zone=103,nat),recirc(0x1461b00),ct(zone=106,nat),recirc(0x1461b01),ct(zone=83,nat),recirc(0x1461b02),ct(zone=141,nat),recirc(0x1461b03),ct(zone=53,nat),recirc(0x1461b04),ct(zone=220,nat),recirc(0x1461b05),ct(zone=95,nat),recirc(0x1461b06),ct(zone=68,nat),recirc(0x1461b07),ct(zone=147,nat),recirc(0x1461b08),ct(zone=108,nat),recirc(0x1461b09),ct(zone=162,nat),recirc(0x1461b0a),ct(zone=121,nat),recirc(0x1461b0b),ct(zone=27,nat),recirc(0x1461b0c),ct(zone=190,nat),recirc(0x1461b0d),ct(zone=32,nat),recirc(0x1461b0e),ct(zone=92,nat),recirc(0x1461b0f),ct(zone=239,nat),recirc(0x1461b10),ct(zone=170,nat),recirc(0x1461b11),ct(zone=29,nat),recirc(0x1461b12),ct(zone=132,nat),recirc(0x1461b13),ct(zone=96,nat),recirc(0x1461b14),ct(zone=109,nat),recirc(0x1461b15),ct(zone=93,nat),recirc(0x1461b16),ct(zone=224,nat),recirc(0x1461b17),ct(zone=185,nat),recirc(0x1461b18),ct(zone=142,nat),recirc(0x1461b19),ct(zone=253,nat),recirc(0x1461b1a),ct(zone=206,nat),recirc(0x1461b1b),ct(zone=135,nat),recirc(0x1461b1c),ct(zone=88,nat),recirc(0x1461b1d),ct(zone=207,nat),recirc(0x1461b1e),ct(zone=140,nat),recirc(0x1461b1f),ct(zone=240,nat),recirc(0x1461b20),ct(zone=229,nat),recirc(0x1461b21),ct(zone=48,nat),recirc(0x1461b22),ct(zone=119,nat),recirc(0x1461b23),ct(zone=196,nat),recirc(0x1461b24),ct(zone=47,nat),recirc(0x1461b25),ct(zone=171,nat),recirc(0x1461b26),ct(zone=26,nat),recirc(0x1461b27),ct(zone=138,nat),recirc(0x1461b28),ct(zone=74,nat),recirc(0x1461b29),ct(zone=34,nat),recirc(0x1461b2a),ct(zone=112,nat),recirc(0x1461b2b),ct(zone=248,nat),recirc(0x1461b2c),ct(zone=188,nat),recirc(0x1461b2d),ct(zone=160,nat),recirc(0x1461b2e),ct(zone=85,nat),recirc(0x1461b2f),ct(zone=189,nat),recirc(0x1461b30),ct(zone=155,nat),recirc(0x1461b31),ct(zone=36,nat),recirc(0x1461b32),ct(zone=50,nat),recirc(0x1461b33),ct(zone=245,nat),recirc(0x1461b34),ct(zone=89,nat),recirc(0x1461b35),ct(zone=57,nat),recirc(0x1461b36),ct(zone=25,nat),recirc(0x1461b37),ct(zone=126,nat),recirc(0x1461b38),ct(zone=3,nat),recirc(0x1461b39),ct(zone=39,nat),recirc(0x1461b3a),ct(zone=219,nat),recirc(0x1461b3b),ct(zone=212,nat),recirc(0x1461b3c),ct(zone=197,nat),recirc(0x1461b3d),ct(zone=101,nat),recirc(0x1461b3e),ct(zone=59,nat),recirc(0x1461b3f),ct(zone=145,nat),recirc(0x1461b40),ct(zone=66,nat),recirc(0x1461b41),ct(zone=40,nat),recirc(0x1461b42),ct(zone=139,nat),recirc(0x1461b43),ct(zone=221,nat),recirc(0x1461b44),ct(zone=238,nat),recirc(0x1461b45),ct(zone=105,nat),recirc(0x1461b46),ct(zone=54,nat),recirc(0x1461b47) failed (Message too long) on packet igmp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:01,dl_dst=01:00:5e:00:00:01,nw_src=10.130.0.1,nw_dst=224.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=17,igmp_code=10 (In reply to Flavio Leitner from comment #17) > > Another thing calling my attention is this log entry due to a IGMP traffic, > which may indicate a problem processing multicast traffic. ... > failed (Message too long) on packet > igmp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:01,dl_dst=01:00:5e:00:00:01, > nw_src=10.130.0.1,nw_dst=224.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=17, > igmp_code=10 I talked about this with Dumitru and they are aware of this, but should not cause this problem. |