Bug 1509994
Summary: | vswitchd high cpu usage and various warnings on openvswitch logs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eduard Barrera <ebarrera> | ||||
Component: | openvswitch | Assignee: | Eelco Chaudron <echaudro> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Ofer Blaut <oblaut> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 10.0 (Newton) | CC: | amuller, apevec, atragler, chrisw, ebarrera, echaudro, jlebon, jlibosva, jmelvin, jraju, jsitnick, ljozsa, mapiccin, marjones, ovs-team, pablo.iranzo, pcaruana, pmannidi, rcernin, rhos-maint, rkhan, srevivo, tredaelli, vshevche | ||||
Target Milestone: | --- | Keywords: | Reopened, Triaged, ZStream | ||||
Target Release: | 10.0 (Newton) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-06-29 14:30:22 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Eduard Barrera
2017-11-06 14:24:43 UTC
Can you get me some more details, as my crystal ball needs repair :) - What version of OVS are you using - OVS configuration, i.e. DPDK/kernel datapath? - How many, and what kind of OpenFlow rules are installed? - How many datapath rules are installed, especially around the time of the warnings? A dump would help here. - The full log file At a quick glance, revalidator379 and handler340 thread is working on something that takes a long time. The above information request might be a start, maybe you can through in an SOS report? Looking at the logs and configuration form the SOS reports it looks like OVS is spending time waiting on the kernel for Netlink related messages/data. The "Unreasonably long 1259ms poll interval" messages a probably because getting netlink message take a long time. Their not much else going on in the revalidator threads that take time. The "ofproto_dpif_upcall(revalidator379)|INFO|Spent an unreasonably long 2146ms dumping flows" like messages are the result of the above. As it wants all revalidator threads working collectively on a run to be done in time. Having two of these extreme poll intervals in a single thread would cause this (>2000ms). Is there anything else going on in the system that could explain a long latency in the kernel for netlink related messages? I've been trying to simulate the issue you are seeing at the customer's system but I'm not getting anywhere close to the CPU utilization or "unreasonable time spend" messages in a normal scenario. However, I can see a similar issue when the kernel is stealing cycles form the process, for example, due to softirqs. So I guess to get a better understanding what is happening in this specific system I would like to get some perf statistics. Below is an example of what to get. Note that the perf data file is around 2G in size. - Get the following when you detect a high CPU utilization, and/or "unreasonable time spend" messages: pidstat -u -t -p `pidof ovs-vswitchd`,`pidof ovsdb-server` 5 12 > ovs_pidstat.txt & perf record -p `pidof ovs-vswitchd` -g --call-graph dwarf sleep 60 - Once you have collected the data do the following: mv perf.data perf.data_openvswitch perf script -F tid -i perf.data_openvswitch | sort -u | grep -o '[0-9]*' | \ xargs -n1 -I{} \ perf report -i perf.data_openvswitch --no-children \ --percentage relative --stdio --tid {} -g none > perf_reports.txt - Gather the following files (and let me know the previous SOS report that matches): perf.data_openvswitch perf_reports.txt ovs_pidstat.txt Closed this BZ for now, please re-open when the requested information is available. I analyzed the perf data that was sent, and I see that the kernel is spending a decent amount of time on processing new arp entries for received packets. I see multiple threads competing for the lock to add a new neighbor entry causing the high CPU utilization on the OVS threads (and potentially others). Can you investigate to see if a lot of ARP activity is going on on this node? Eduard we are waiting for over a month on comment 24, i.e. troubleshooting to figure out why it appears to be looking like an ARP loop. If you take the perf traces again, like in comment 17, I can quickly see if this is the same ARP related issue. First I want to clarify that I was not aware that the way to recover this was to reboot the nodes! I concentrated my research on node 0. First looking at the messages log I see a lot of IPv6 duplicate address detection. Which once again makes me believe there is some kind of a network look. $ pwd /cases/01967187/sosreport-20180321-144239/overcloud-neutronnetwork-0.localdomain $ grep "IPv6 duplicate address" messages | more Mar 21 13:15:02 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected! Mar 21 13:15:02 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected! Mar 21 13:20:31 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected! Mar 21 13:20:32 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected! Mar 21 13:29:51 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected! Mar 21 13:29:52 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected! Mar 21 13:36:47 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected! Mar 21 13:36:47 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected! Mar 21 13:41:02 overcloud-neutronnetwork-0 kernel: IPv6: qg-e4f29484-0a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe71:17b5 detected! ... $ grep "IPv6 duplicate address" messages | wc -l 443 Back to the perf data. First you can still see that most of the OVS time is spend in kernel space: 02:11:36 PM 0 - 3926 0.60 14.60 0.00 15.20 28 |__handler353 02:11:36 PM 0 - 3944 0.80 69.40 0.00 70.20 42 |__handler371 02:11:36 PM 0 - 3952 0.80 10.40 0.00 11.20 50 |__revalidator379 02:11:36 PM 0 - 3953 0.40 5.00 0.00 5.40 32 |__revalidator380 02:11:36 PM 0 - 3954 0.60 8.80 0.00 9.40 2 |__revalidator381 02:11:36 PM 0 - 3955 0.60 7.20 0.00 7.80 8 |__revalidator382 02:11:36 PM 0 - 3956 0.80 7.60 0.00 8.40 16 |__revalidator383 02:11:36 PM 0 - 3957 0.60 10.20 0.00 10.80 10 |__revalidator384 02:11:36 PM 0 - 3958 0.80 8.00 0.00 8.80 14 |__revalidator385 02:11:36 PM 0 - 3959 0.60 8.60 0.00 9.20 36 |__revalidator386 02:11:36 PM 0 - 3960 0.80 9.60 0.00 10.40 44 |__revalidator387 02:11:36 PM 0 - 3961 0.80 10.00 0.00 10.80 40 |__revalidator388 02:11:36 PM 0 - 3962 0.60 6.20 0.00 6.80 22 |__revalidator389 02:11:36 PM 0 - 3963 0.60 5.20 0.00 5.80 6 |__revalidator390 02:11:36 PM 0 - 3964 1.00 11.20 0.00 12.20 2 |__revalidator391 02:11:36 PM 0 - 3965 0.80 7.20 0.00 8.00 46 |__revalidator392 Looking at the perf data as it's the kernel stealing time from OVS. Same as before its fighting for the ARP neighbor table lock. For example handler 371: 40.30% handler371 [kernel.vmlinux] [k] __write_lock_failed | ---__write_lock_failed | --24.20%--_raw_write_lock_bh __neigh_create 37.87% handler371 [kernel.vmlinux] [k] netlink_broadcast_filtered | ---netlink_broadcast_filtered | --22.75%--nlmsg_notify rtnl_notify __neigh_notify | |--11.58%--neigh_cleanup_and_release | __neigh_create | arp_process | arp_rcv After writing a script to analyze the perf data taken with the probes I found no IP that stood out in taking the lock. However there are still 2M+ ARPs packets processed in the 60-second capture, so 33K a second :( From the script output collecting all ARPs together here is the top (cat perf_arp_process.txt | sort -n | uniq -c | sort -n -r > arps_sort_ip.txt): 14313 - arp_process() on IP 0.119.192.127 for bridge qg-ac146400-2c 7329 - arp_process() on IP 10.8.255.254 for bridge qg-90938aa3-9b 7328 - arp_process() on IP 10.8.255.254 for bridge qg-49eed100-0e 7324 - arp_process() on IP 10.8.255.254 for bridge qg-54b18a5f-9b 7323 - arp_process() on IP 10.8.255.254 for bridge qg-60434a2b-94 7322 - arp_process() on IP 10.8.255.254 for bridge qg-643e4b88-ef 7321 - arp_process() on IP 10.8.255.254 for bridge qg-3f2288dc-79 7320 - arp_process() on IP 10.8.255.254 for bridge qg-fff63afd-27 7320 - arp_process() on IP 10.8.255.254 for bridge qg-8559b7f1-c6 7319 - arp_process() on IP 10.8.255.254 for bridge qg-6355e480-5e 7318 - arp_process() on IP 10.8.255.254 for bridge qg-e1935639-b4 7318 - arp_process() on IP 10.8.255.254 for bridge qg-b416607f-03 7318 - arp_process() on IP 10.8.255.254 for bridge qg-8b6dd0bd-be 7318 - arp_process() on IP 10.8.255.254 for bridge qg-84565145-8c 7317 - arp_process() on IP 10.8.255.254 for bridge qg-e10cac61-96 7317 - arp_process() on IP 10.8.255.254 for bridge qg-ddcfce33-36 7316 - arp_process() on IP 10.8.255.254 for bridge qg-61deed1f-65 7316 - arp_process() on IP 10.8.255.254 for bridge qg-184ce8a6-cf 7316 - arp_process() on IP 10.8.255.254 for bridge qg-0b20756a-e5 7315 - arp_process() on IP 10.8.255.254 for bridge qg-f4d18865-77 7313 - arp_process() on IP 10.8.255.254 for bridge qg-a03d187d-2b 7313 - arp_process() on IP 10.8.255.254 for bridge qg-46bc8538-77 7311 - arp_process() on IP 10.8.255.254 for bridge qg-9e821e61-f5 7311 - arp_process() on IP 10.8.255.254 for bridge qg-83754c33-3c 7310 - arp_process() on IP 10.8.255.254 for bridge qg-2bb7a784-28 7309 - arp_process() on IP 10.8.255.254 for bridge qg-af6068b2-b2 7308 - arp_process() on IP 10.8.255.254 for bridge qg-8ec08f18-e0 7308 - arp_process() on IP 10.8.255.254 for bridge qg-02c639c2-5a 7307 - arp_process() on IP 10.8.255.254 for bridge qg-6816025b-5a 7306 - arp_process() on IP 10.8.255.254 for bridge qg-e56f38c7-00 7305 - arp_process() on IP 10.8.255.254 for bridge qg-e479b248-68 7304 - arp_process() on IP 10.8.255.254 for bridge qg-77cb3d4d-75 7302 - arp_process() on IP 10.8.255.254 for bridge qg-361cfed0-08 7302 - arp_process() on IP 10.8.255.254 for bridge qg-23ccf949-0b 7301 - arp_process() on IP 10.8.255.254 for bridge qg-b51c6064-b4 7301 - arp_process() on IP 10.8.255.254 for bridge qg-69d974c4-5c 7301 - arp_process() on IP 10.8.255.254 for bridge qg-22c53a2f-e2 7299 - arp_process() on IP 10.8.255.254 for bridge qg-d753f577-d6 7299 - arp_process() on IP 10.8.255.254 for bridge qg-52e2f0b1-34 7295 - arp_process() on IP 10.8.255.254 for bridge qg-0b6f302e-87 7294 - arp_process() on IP 10.8.255.254 for bridge qg-26b088c2-2d 7292 - arp_process() on IP 10.8.255.254 for bridge qg-39304955-3e 7291 - arp_process() on IP 10.8.255.254 for bridge qg-0fcabf84-5c 7285 - arp_process() on IP 10.8.255.254 for bridge qg-22e94934-b0 7284 - arp_process() on IP 10.8.255.254 for bridge qg-5c6a125e-f2 7283 - arp_process() on IP 10.8.255.254 for bridge qg-e4c6fc0f-32 7283 - arp_process() on IP 10.8.255.254 for bridge qg-40ed7eca-fb 7282 - arp_process() on IP 10.8.255.254 for bridge qg-35bf0221-d2 7280 - arp_process() on IP 10.8.255.254 for bridge qg-d047bd16-61 7280 - arp_process() on IP 10.8.255.254 for bridge qg-b89f3f92-c3 7280 - arp_process() on IP 10.8.255.254 for bridge qg-7d85facc-20 7276 - arp_process() on IP 10.8.255.254 for bridge qg-6b9e5681-08 7273 - arp_process() on IP 10.8.255.254 for bridge qg-8601a87e-db 7271 - arp_process() on IP 10.8.255.254 for bridge qg-1cbe5b34-ce 7265 - arp_process() on IP 10.8.255.254 for bridge qg-0db1e23b-dd 7264 - arp_process() on IP 10.8.255.254 for bridge qg-543ef736-26 7263 - arp_process() on IP 10.8.255.254 for bridge qg-57df27ee-95 7256 - arp_process() on IP 10.8.255.254 for bridge qg-2abf881e-af 7249 - arp_process() on IP 10.8.255.254 for bridge qg-d1ead6fd-08 7235 - arp_process() on IP 10.8.255.254 for bridge qg-384ca19e-3a 7156 - arp_process() on IP 10.8.255.254 for bridge qg-c368347c-b9 7154 - arp_process() on IP 10.8.255.254 for bridge qg-2990c397-bd 7153 - arp_process() on IP 10.8.255.254 for bridge qg-c4aebfbb-7b 7152 - arp_process() on IP 10.8.255.254 for bridge qg-feff5b1c-d4 7151 - arp_process() on IP 10.8.255.254 for bridge qg-cb0d8bc4-93 7150 - arp_process() on IP 10.8.255.254 for bridge qg-fdb48344-82 7150 - arp_process() on IP 10.8.255.254 for bridge qg-4ae14248-7b 7148 - arp_process() on IP 10.8.255.254 for bridge qg-8cbb51de-5e 7146 - arp_process() on IP 10.8.255.254 for bridge qg-d118def9-33 7145 - arp_process() on IP 10.8.255.254 for bridge qg-83fe93fa-3f 7142 - arp_process() on IP 10.8.255.254 for bridge qg-97969e2f-b9 7141 - arp_process() on IP 10.8.255.254 for bridge qg-10c08ef3-0c 7140 - arp_process() on IP 10.8.255.254 for bridge qg-f64811d2-4b 7138 - arp_process() on IP 10.8.255.254 for bridge qg-83e0932d-7a 7136 - arp_process() on IP 10.8.255.254 for bridge qg-bab83275-5c 7136 - arp_process() on IP 10.8.255.254 for bridge qg-595aa19a-7e 7135 - arp_process() on IP 10.8.255.254 for bridge qg-122a25c0-9c 7134 - arp_process() on IP 10.8.255.254 for bridge qg-bd43420b-a9 7134 - arp_process() on IP 10.8.255.254 for bridge qg-7a833dc0-8e 7133 - arp_process() on IP 10.8.255.254 for bridge qg-444f66b1-3c 7131 - arp_process() on IP 10.8.255.254 for bridge qg-7bd5b526-54 7131 - arp_process() on IP 10.8.255.254 for bridge qg-54b920a6-8e 7130 - arp_process() on IP 10.8.255.254 for bridge qg-f3fff380-f0 7128 - arp_process() on IP 10.8.255.254 for bridge qg-db665e8e-52 7127 - arp_process() on IP 10.8.255.254 for bridge qg-adec81f6-50 7123 - arp_process() on IP 10.8.255.254 for bridge qg-b2ed3c2f-4c 7123 - arp_process() on IP 10.8.255.254 for bridge qg-9af1cb40-3f 7123 - arp_process() on IP 10.8.255.254 for bridge qg-1b608741-91 7122 - arp_process() on IP 10.8.255.254 for bridge qg-d068fba5-ad 7122 - arp_process() on IP 10.8.255.254 for bridge qg-c3afd50e-41 7120 - arp_process() on IP 10.8.255.254 for bridge qg-874fee56-00 7119 - arp_process() on IP 10.8.255.254 for bridge qg-e3dd4e16-98 7117 - arp_process() on IP 10.8.255.254 for bridge qg-f95b2a6f-74 7113 - arp_process() on IP 10.8.255.254 for bridge qg-5ec846ad-c2 7112 - arp_process() on IP 10.8.255.254 for bridge qg-ff9475bc-1c 7112 - arp_process() on IP 10.8.255.254 for bridge qg-22599fd1-75 7111 - arp_process() on IP 10.8.255.254 for bridge qg-6ff1014c-24 7110 - arp_process() on IP 10.8.255.254 for bridge qg-9702727b-1a 7105 - arp_process() on IP 10.8.255.254 for bridge qg-f2a71092-25 7105 - arp_process() on IP 10.8.255.254 for bridge qg-8bd65435-ea 7105 - arp_process() on IP 10.8.255.254 for bridge qg-4d1957a2-7f 7104 - arp_process() on IP 10.8.255.254 for bridge qg-15e67ba9-87 7104 - arp_process() on IP 10.8.255.254 for bridge qg-137bf9c9-9d 7103 - arp_process() on IP 10.8.255.254 for bridge qg-cc26d659-55 7103 - arp_process() on IP 10.8.255.254 for bridge qg-28431b7f-dc 7102 - arp_process() on IP 10.8.255.254 for bridge qg-f9eb38d0-68 7102 - arp_process() on IP 10.8.255.254 for bridge qg-69552fe6-88 7101 - arp_process() on IP 10.8.255.254 for bridge qg-a237007e-e2 7101 - arp_process() on IP 10.8.255.254 for bridge qg-208f986a-a2 7101 - arp_process() on IP 10.8.255.254 for bridge qg-0f459ae0-84 7099 - arp_process() on IP 10.8.255.254 for bridge qg-3b2ff0fe-3f 7098 - arp_process() on IP 10.8.255.254 for bridge qg-66a332af-22 7097 - arp_process() on IP 10.8.255.254 for bridge qg-5515d120-e8 7096 - arp_process() on IP 10.8.255.254 for bridge qg-e29d90b3-39 7095 - arp_process() on IP 10.8.255.254 for bridge qg-f45e01da-60 7094 - arp_process() on IP 10.8.255.254 for bridge qg-eff3812e-b1 7094 - arp_process() on IP 10.8.255.254 for bridge qg-ddd4c341-05 7094 - arp_process() on IP 10.8.255.254 for bridge qg-7b796925-6b 7093 - arp_process() on IP 10.8.255.254 for bridge qg-dbcc915a-90 7092 - arp_process() on IP 10.8.255.254 for bridge qg-c155c29d-e6 7092 - arp_process() on IP 10.8.255.254 for bridge qg-103d267d-26 7091 - arp_process() on IP 10.8.255.254 for bridge qg-df7e187a-83 7090 - arp_process() on IP 10.8.255.254 for bridge qg-e4f29484-0a 7090 - arp_process() on IP 10.8.255.254 for bridge qg-b5d4bded-f3 7090 - arp_process() on IP 10.8.255.254 for bridge qg-8fe7fb1b-3b 7088 - arp_process() on IP 10.8.255.254 for bridge qg-36271b6d-97 7085 - arp_process() on IP 10.8.255.254 for bridge qg-eecf06ec-17 7085 - arp_process() on IP 10.8.255.254 for bridge qg-265a827f-aa 7084 - arp_process() on IP 10.8.255.254 for bridge qg-546bed7c-91 7083 - arp_process() on IP 10.8.255.254 for bridge qg-66e66b6e-6a 7083 - arp_process() on IP 10.8.255.254 for bridge qg-3be17ae0-c5 7082 - arp_process() on IP 10.8.255.254 for bridge qg-e4778e69-8f 7077 - arp_process() on IP 10.8.255.254 for bridge qg-7492bb7d-c7 7076 - arp_process() on IP 10.8.255.254 for bridge qg-a0bf0fa4-bf 7069 - arp_process() on IP 10.8.255.254 for bridge qg-af34ac65-dd 7068 - arp_process() on IP 10.8.255.254 for bridge qg-e6560ca2-0f 7065 - arp_process() on IP 10.8.255.254 for bridge qg-11aeea40-b0 7062 - arp_process() on IP 10.8.255.254 for bridge qg-c90eb60e-51 7060 - arp_process() on IP 10.8.255.254 for bridge qg-3aa49939-a8 7059 - arp_process() on IP 10.8.255.254 for bridge qg-30360910-77 7055 - arp_process() on IP 10.8.255.254 for bridge qg-40d89ead-8e 6872 - arp_process() on IP 10.8.255.254 for bridge qg-9c85b855-bb 6833 - arp_process() on IP 10.8.255.254 for bridge qg-59513191-c3 4024 - arp_process() on IP 10.8.255.254 for bridge qg-6015eff6-2c As you can see its almost all for 10.8.255.254!!! The same top list, sorted by bridge: 14384 qg-49eed100-0e 14380 qg-d047bd16-61 14380 qg-ac146400-2c 14379 qg-8b6dd0bd-be 14375 qg-a03d187d-2b 14375 qg-77cb3d4d-75 14374 qg-90938aa3-9b 14373 qg-61deed1f-65 14372 qg-d753f577-d6 14372 qg-9e821e61-f5 14372 qg-60434a2b-94 14371 qg-b51c6064-b4 14371 qg-8ec08f18-e0 14371 qg-6816025b-5a 14370 qg-fff63afd-27 14370 qg-184ce8a6-cf 14370 qg-0b20756a-e5 14368 qg-e1935639-b4 14368 qg-02c639c2-5a 14366 qg-af6068b2-b2 14365 qg-b416607f-03 14364 qg-e10cac61-96 14357 qg-6355e480-5e 14355 qg-39304955-3e 14354 qg-54b18a5f-9b 14348 qg-e4c6fc0f-32 14337 qg-643e4b88-ef 14323 qg-3f2288dc-79 14321 qg-8601a87e-db 14288 qg-40ed7eca-fb 14259 qg-84565145-8c 14257 qg-e56f38c7-00 14253 qg-361cfed0-08 14252 qg-69d974c4-5c 14250 qg-52e2f0b1-34 14250 qg-2bb7a784-28 14249 qg-ddcfce33-36 14249 qg-0b6f302e-87 14248 qg-22e94934-b0 14247 qg-83754c33-3c 14247 qg-0db1e23b-dd 14246 qg-384ca19e-3a 14244 qg-26b088c2-2d 14241 qg-e479b248-68 14241 qg-7d85facc-20 14239 qg-46bc8538-77 14232 qg-22c53a2f-e2 14230 qg-8559b7f1-c6 14228 qg-23ccf949-0b 14223 qg-0fcabf84-5c 14198 qg-5c6a125e-f2 14197 qg-b89f3f92-c3 14183 qg-543ef736-26 14179 qg-2abf881e-af 14166 qg-6b9e5681-08 14162 qg-1cbe5b34-ce 14141 qg-35bf0221-d2 14089 qg-57df27ee-95 14028 qg-d1ead6fd-08 13971 qg-f4d18865-77 13832 qg-4ae14248-7b 13830 qg-feff5b1c-d4 13820 qg-10c08ef3-0c 13819 qg-cb0d8bc4-93 13819 qg-c368347c-b9 13819 qg-8bd65435-ea 13819 qg-2990c397-bd 13818 qg-7b796925-6b 13817 qg-c4aebfbb-7b 13817 qg-122a25c0-9c 13816 qg-30360910-77 13814 qg-97969e2f-b9 13814 qg-83e0932d-7a 13814 qg-5ec846ad-c2 13813 qg-36271b6d-97 13812 qg-3be17ae0-c5 13808 qg-69552fe6-88 13808 qg-137bf9c9-9d 13807 qg-9af1cb40-3f 13806 qg-bab83275-5c 13805 qg-4d1957a2-7f 13802 qg-8fe7fb1b-3b 13801 qg-eecf06ec-17 13800 qg-af34ac65-dd 13800 qg-83fe93fa-3f 13800 qg-66e66b6e-6a 13799 qg-265a827f-aa 13795 qg-59513191-c3 13795 qg-103d267d-26 13794 qg-3b2ff0fe-3f 13794 qg-3aa49939-a8 13791 qg-9c85b855-bb 13784 qg-f64811d2-4b 13778 qg-fdb48344-82 13775 qg-8cbb51de-5e 13774 qg-bd43420b-a9 13774 qg-7bd5b526-54 13770 qg-e4778e69-8f 13769 qg-b2ed3c2f-4c 13763 qg-54b920a6-8e 13761 qg-f45e01da-60 13759 qg-f95b2a6f-74 13757 qg-e4f29484-0a 13753 qg-5515d120-e8 13752 qg-f3fff380-f0 13752 qg-e3dd4e16-98 13751 qg-b5d4bded-f3 13748 qg-dbcc915a-90 13746 qg-df7e187a-83 13744 qg-c3afd50e-41 13744 qg-adec81f6-50 13744 qg-874fee56-00 13744 qg-595aa19a-7e 13744 qg-0f459ae0-84 13741 qg-c90eb60e-51 13740 qg-ff9475bc-1c 13740 qg-f2a71092-25 13740 qg-22599fd1-75 13739 qg-d068fba5-ad 13739 qg-7492bb7d-c7 13739 qg-546bed7c-91 13739 qg-444f66b1-3c 13738 qg-db665e8e-52 13738 qg-40d89ead-8e 13737 qg-a0bf0fa4-bf 13737 qg-6ff1014c-24 13736 qg-eff3812e-b1 13736 qg-e29d90b3-39 13736 qg-66a332af-22 13736 qg-15e67ba9-87 13734 qg-f9eb38d0-68 13734 qg-e6560ca2-0f 13732 qg-c155c29d-e6 13731 qg-11aeea40-b0 13730 qg-9702727b-1a 13725 qg-ddd4c341-05 13718 qg-cc26d659-55 13715 qg-1b608741-91 13708 qg-a237007e-e2 13707 qg-208f986a-a2 13604 qg-7a833dc0-8e 13395 qg-28431b7f-dc 13315 qg-d118def9-33 Just looking at the offending IP "10.8.255.254", its the default gateway for for those bridges: ./ip_netns_exec_qrouter-5c09c2c9-aa8a-430e-9cfe-15329987141d_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-86439ab9-1f ./ip_netns_exec_qrouter-5c332f62-90b2-4692-861a-a9dcf02c34f8_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-12f59f63-fe ./ip_netns_exec_qrouter-5ddf4e51-77e6-4cc9-9803-a6e97582ee77_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-bba31e3c-ef ./ip_netns_exec_qrouter-5f49f3e1-d314-4fca-8d8b-6aad553ce408_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-3665ec12-75 ./ip_netns_exec_qrouter-5fb64034-3c0a-4e86-b389-5d2ec63d02dd_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ea157c4b-07 ./ip_netns_exec_qrouter-6bed7f94-3133-45e0-898e-eca61b933b39_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-c15b0f85-28 ./ip_netns_exec_qrouter-6c8bc503-aea8-403a-a2fe-661130f19199_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-da6e9966-97 ./ip_netns_exec_qrouter-6c44fd76-7691-4525-941c-78a4388a2b32_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-beaefb67-91 ./ip_netns_exec_qrouter-6c57b7f2-9d10-488a-bf8a-02c9a1f2f346_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-2c3dc165-0f ./ip_netns_exec_qrouter-6c43978e-50af-4375-bd3b-241c44d243d6_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-64529548-00 ./ip_netns_exec_qrouter-6e4f9544-2390-4a2c-81cd-60b501152c28_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-5e776593-0b ./ip_netns_exec_qrouter-6ed8fe47-b9c7-4051-a43c-0075aafb9f60_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ab65ab61-5f ./ip_netns_exec_qrouter-7b041ff9-dddc-4d11-bafe-82240b9c1f92_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-0c278b9f-54 ./ip_netns_exec_qrouter-7c0ad7ac-702a-4f10-ab9c-ca19f9b3a851_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ed5e6f9a-69 ./ip_netns_exec_qrouter-7cbb333c-4e5a-4e9f-a2af-53d5cb79677b_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-4df186a5-58 ./ip_netns_exec_qrouter-7e989393-9ee5-4aa6-ab8d-20d8d4c57e3f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-d87d0217-a9 ./ip_netns_exec_qrouter-8a64d973-03da-4300-8057-689b798ee72e_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-3b981cbc-04 ./ip_netns_exec_qrouter-8ad70288-6ccd-4b03-8bf2-7ffb62e96b7a_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-638d415b-64 ./ip_netns_exec_qrouter-8b0727e2-5ec1-4601-aff7-bf6b9b747b0b_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ded1538c-32 ./ip_netns_exec_qrouter-8d72f7c6-4de9-4a40-879a-bed62a9c75b7_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-7f45ac2c-e2 ./ip_netns_exec_qrouter-60a4e9c7-5d00-4fc2-aed9-bddb86da9cf5_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-a03bd6ba-31 ./ip_netns_exec_qrouter-60aa8692-0a28-4bcc-9832-4391d341df7a_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-224c0b73-cc ./ip_netns_exec_qrouter-64d96bfe-5cdd-4049-ac05-b4a029e65555_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-d6df8bc2-15 ./ip_netns_exec_qrouter-69f4cc16-d6a5-4e08-9177-200b8d9cdbb0_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-4826353f-c1 ./ip_netns_exec_qrouter-71a059c8-e2f5-49af-85f8-c5ca7b862b5d_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-6f95d44f-4c ./ip_netns_exec_qrouter-74fbd52c-3d34-4ec1-bf9d-b19cca055695_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-e6a598f3-a6 ./ip_netns_exec_qrouter-75ac2846-b1e8-46fa-b9a3-782accaa2d1c_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-8230930b-95 ./ip_netns_exec_qrouter-79f00d6a-3196-4fc3-aa57-8685a5d54f22_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-dfbaa6d0-ec ./ip_netns_exec_qrouter-79f2c073-f64d-4c27-8946-6fb83ee34d4f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-b99bcb4c-5f ./ip_netns_exec_qrouter-80f535dc-5c8d-4085-a61e-4a85e559e190_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-7333f82e-d0 ./ip_netns_exec_qrouter-87f12d93-149d-41d4-b8b7-df7a14bd0e0f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-72792747-70 ./ip_netns_exec_qrouter-607fdbe5-3fcb-4a48-bdaa-6ec7726684e6_ip_route_show_table_all:2:default via 10.8.255.254 dev qg-6b748d98-f7 ./ip_netns_exec_qrouter-638e3b23-8ad7-48ce-a060-b9b12ef1cac3_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-fff75d9b-a7 ./ip_netns_exec_qrouter-750e5071-462e-4815-9a4e-008a38ab4b6f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-3e7bca60-96 ./ip_netns_exec_qrouter-6327b790-2793-40de-8d34-8d86c7fe2167_ip_route_show_table_all:2:default via 10.8.255.254 dev qg-7ecf1a19-d8 ./ip_netns_exec_qrouter-7329d473-49a6-4a08-b98b-c3726493e088_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-64e636e6-31 ./ip_netns_exec_qrouter-61146b1e-2fed-4027-8d89-163f0d4e0692_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-0aab2c63-db ./ip_netns_exec_qrouter-77460b71-cd0b-4d0e-9a1f-fee5e989a7f3_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ff2cfc2c-f1 ./ip_netns_exec_qrouter-78362e05-d861-47ea-8442-506cc9d18f4b_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-051df8cf-4b ./ip_netns_exec_qrouter-747135c4-4517-4ccd-bd01-6e1427cb5c7d_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-e87567ea-75 Did the default gateway go down? Why are those interfaces generating so many arp requests? Created attachment 1413149 [details]
RAW anylasis of perf data
Adding some clarification the IP address is the SOURCE IP of the ARP packets. So, in this case, the router seems to generate a lot of ARP packets (or they are being looped). Thanks Eelco, we're looking into the issue together with IT/Network team and they have opened a support case with Juniper. I quickly looked at the traces supplied, and this time I see only +/- 3K packets for 10.8.255.254 out of the 337k, so +/- 1% (taken in 30 seconds). There is still a lot of arp going on, most of them are gratuitous ARPs. For example with the following filter, arp.src.proto_ipv4==172.31.0.1, I see 4360 packets in 31 seconds. I assume I see multiple tunnel endpoints. However, I also see two macs trying to use the same IP here (open the trace with wireshark and it will tell you which ones). Is this some kind of redundancy at the OSP level? It still does not look like an OVS problem, but this could be the effect of an earlier problem causing some redundancy to kick and is starving the kernel due to excessive ARPs? Eduard please further analyze the traces and see if something is wrong at a higher level than OVS, i.e. why we see so many (gratuitous) ARP packets. (In reply to Eelco Chaudron from comment #41) > I quickly looked at the traces supplied, and this time I see only +/- 3K > packets for 10.8.255.254 out of the 337k, so +/- 1% (taken in 30 seconds). > > There is still a lot of arp going on, most of them are gratuitous ARPs. > For example with the following filter, arp.src.proto_ipv4==172.31.0.1, I see > 4360 packets in 31 seconds. I assume I see multiple tunnel endpoints. > However, I also see two macs trying to use the same IP here (open the trace > with wireshark and it will tell you which ones). > > Is this some kind of redundancy at the OSP level? It still does not look > like an OVS problem, but this could be the effect of an earlier problem > causing some redundancy to kick and is starving the kernel due to excessive > ARPs? > > Eduard please further analyze the traces and see if something is wrong at a > higher level than OVS, i.e. why we see so many (gratuitous) ARP packets. the 172.31.0.1 looks it was a default gateway for the ip_netns_exec_qdhcp-23d8767d-0f7c-4fb9-b182-08bde4026ccc ip_route_show_table_all:default via 172.31.0.1 dev tap400998b6-0f Trying to get some extra details about it. This IP was just an example, there are a lot of IPs in the traces that show potential duplicate MACs sharing IPs. In addition, was it ever research why you see a lot of duplicate IPv6 addresses? See first part of comment #36. Maybe there is also some kind of loop issue there. The following BZ 1558336 look similar. The actual symptom seems different, but they see the same kind of ARP behavior. Closing BZ for now due to lack of response. Please reopen if needed when additional information is available. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |