Bug 1509994 - vswitchd high cpu usage and various warnings on openvswitch logs
Summary: vswitchd high cpu usage and various warnings on openvswitch logs
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 10.0 (Newton)
Assignee: Eelco Chaudron
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-06 14:24 UTC by Eduard Barrera
Modified: 2024-06-13 20:49 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-29 14:30:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
RAW anylasis of perf data (5.85 MB, application/x-gzip)
2018-03-26 13:07 UTC, Eelco Chaudron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1526306 0 high CLOSED ovs-vswitchd service hangs with Error too many open files 2022-03-13 14:35:40 UTC
Red Hat Issue Tracker OSP-4759 0 None None None 2022-08-16 11:43:40 UTC

Internal Links: 1526306

Description Eduard Barrera 2017-11-06 14:24:43 UTC
Description of problem:

Sometimes openvswitch start consuming to much CPU and various warnings appears on the logs a small example of this log entries is:


2017-11-06T08:02:17.682Z|01575|timeval(revalidator386)|WARN|faults: 118 minor, 0 major
2017-11-06T08:02:17.682Z|01576|timeval(revalidator386)|WARN|context switches: 2054 voluntary, 3 involuntary
2017-11-06T08:02:17.682Z|02262|ofproto_dpif_upcall(revalidator379)|INFO|Spent an unreasonably long 2146ms dumping flows
2017-11-06T08:02:18.061Z|44872|ovs_rcu|WARN|blocked 1001 ms waiting for handler340 to quiesce
2017-11-06T08:02:18.061Z|05555|ovs_rcu(urcu11)|WARN|blocked 1001 ms waiting for handler340 to quiesce
2017-11-06T08:02:18.152Z|05133|timeval(handler340)|WARN|Unreasonably long 1259ms poll interval (0ms user, 692ms system)
2017-11-06T08:02:18.152Z|05134|timeval(handler340)|WARN|faults: 47 minor, 0 major
2017-11-06T08:02:18.152Z|05135|timeval(handler340)|WARN|context switches: 554 voluntary, 1 involuntary
2017-11-06T08:02:18.346Z|01454|poll_loop(revalidator389)|INFO|Dropped 169 log messages in last 13669 seconds (most recently, 13666 seconds ago) due to excessive rate
2017-11-06T08:02:18.347Z|01455|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.353Z|01456|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.354Z|01457|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.361Z|01458|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.361Z|01459|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.367Z|01460|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.367Z|01461|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.368Z|01462|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.373Z|01463|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:18.382Z|01464|poll_loop(revalidator389)|INFO|wakeup due to [POLLIN] on fd 126 (FIFO pipe:[89884]) at lib/ovs-thread.c:306 (52% CPU usage)
2017-11-06T08:02:19.753Z|01349|timeval(revalidator391)|WARN|Unreasonably long 1371ms poll interval (0ms user, 853ms system)
2017-11-06T08:02:19.753Z|01350|timeval(revalidator391)|WARN|context switches: 90 voluntary, 0 involuntary


Version-Release number of selected component (if applicable):
OSP 10


How reproducible:
from time to time on different controller nodes


Steps to Reproduce:
1. Not sure
2.
3.

Comment 3 Eelco Chaudron 2017-11-09 12:52:26 UTC
Can you get me some more details, as my crystal ball needs repair :)

- What version of OVS are you using
- OVS configuration, i.e. DPDK/kernel datapath?
- How many, and what kind of OpenFlow rules are installed?
- How many datapath rules are installed, especially around the time of the warnings? A dump would help here.
- The full log file

At a quick glance, revalidator379 and handler340 thread is working on something that takes a long time. The above information request might be a start, maybe you can through in an SOS report?

Comment 5 Eelco Chaudron 2017-11-09 15:36:52 UTC
Looking at the logs and configuration form the SOS reports it looks like OVS is spending time waiting on the kernel for Netlink related messages/data.

The "Unreasonably long 1259ms poll interval" messages a probably because getting netlink message take a long time. Their not much else going on in the revalidator threads that take time.

The "ofproto_dpif_upcall(revalidator379)|INFO|Spent an unreasonably long 2146ms dumping flows" like messages are the result of the above. As it wants all revalidator threads working collectively on a run to be done in time. Having two of these extreme poll intervals in a single thread would cause this (>2000ms).

Is there anything else going on in the system that could explain a long latency in the kernel for netlink related messages?

Comment 7 Eelco Chaudron 2017-11-16 13:10:55 UTC
I've been trying to simulate the issue you are seeing at the customer's system
but I'm not getting anywhere close to the CPU utilization or "unreasonable
time spend" messages in a normal scenario. However, I can see a similar issue
when the kernel is stealing cycles form the process, for example, due to softirqs.

So I guess to get a better understanding what is happening in this specific
system I would like to get some perf statistics. Below is an example of what to
get. Note that the perf data file is around 2G in size.


- Get the following when you detect a high CPU utilization, and/or "unreasonable time spend" messages:

  pidstat -u -t -p `pidof ovs-vswitchd`,`pidof ovsdb-server` 5 12 > ovs_pidstat.txt &
  perf record -p `pidof ovs-vswitchd` -g --call-graph dwarf sleep 60


- Once you have collected the data do the following:

  mv perf.data perf.data_openvswitch
  perf script -F tid -i perf.data_openvswitch | sort -u | grep -o '[0-9]*' | \
    xargs -n1 -I{} \
      perf report -i perf.data_openvswitch --no-children \
        --percentage relative --stdio --tid {} -g none > perf_reports.txt

- Gather the following files (and let me know the previous SOS report that matches):

  perf.data_openvswitch
  perf_reports.txt
  ovs_pidstat.txt

Comment 9 Eelco Chaudron 2017-12-14 14:04:52 UTC
Closed this BZ for now, please re-open when the requested information is available.

Comment 21 Eelco Chaudron 2018-01-23 14:12:32 UTC
I analyzed the perf data that was sent, and I see that the kernel is spending a decent amount of time on processing new arp entries for received packets. I see multiple threads competing for the lock to add a new neighbor entry causing the high CPU utilization on the OVS threads (and potentially others).

Can you investigate to see if a lot of ARP activity is going on on this node?

Comment 28 Eelco Chaudron 2018-03-12 12:39:55 UTC
Eduard we are waiting for over a month on comment 24, i.e. troubleshooting to figure out why it appears to be looking like an ARP loop.


If you take the perf traces again, like in comment 17, I can quickly see if this is the same ARP related issue.

Comment 35 Eelco Chaudron 2018-03-26 12:58:14 UTC
First I want to clarify that I was not aware that the way to recover this was
to reboot the nodes!

I concentrated my research on node 0. First looking at the messages log I see a
lot of IPv6 duplicate address detection. Which once again makes me believe
there is some kind of a network look.

$ pwd
/cases/01967187/sosreport-20180321-144239/overcloud-neutronnetwork-0.localdomain

$ grep "IPv6 duplicate address" messages | more
Mar 21 13:15:02 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected!
Mar 21 13:15:02 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected!
Mar 21 13:20:31 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected!
Mar 21 13:20:32 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected!
Mar 21 13:29:51 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected!
Mar 21 13:29:52 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected!
Mar 21 13:36:47 overcloud-neutronnetwork-0 kernel: IPv6: qg-66e66b6e-6a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe5a:5667 detected!
Mar 21 13:36:47 overcloud-neutronnetwork-0 kernel: IPv6: qg-b2ed3c2f-4c: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe12:e856 detected!
Mar 21 13:41:02 overcloud-neutronnetwork-0 kernel: IPv6: qg-e4f29484-0a: IPv6 duplicate address 2620:52:0:8f0:f816:3eff:fe71:17b5 detected!
...

$ grep "IPv6 duplicate address" messages | wc -l
443

Back to the perf data. First you can still see that most of the OVS time is
spend in kernel space:

02:11:36 PM     0         -      3926    0.60   14.60    0.00   15.20    28  |__handler353
02:11:36 PM     0         -      3944    0.80   69.40    0.00   70.20    42  |__handler371
02:11:36 PM     0         -      3952    0.80   10.40    0.00   11.20    50  |__revalidator379
02:11:36 PM     0         -      3953    0.40    5.00    0.00    5.40    32  |__revalidator380
02:11:36 PM     0         -      3954    0.60    8.80    0.00    9.40     2  |__revalidator381
02:11:36 PM     0         -      3955    0.60    7.20    0.00    7.80     8  |__revalidator382
02:11:36 PM     0         -      3956    0.80    7.60    0.00    8.40    16  |__revalidator383
02:11:36 PM     0         -      3957    0.60   10.20    0.00   10.80    10  |__revalidator384
02:11:36 PM     0         -      3958    0.80    8.00    0.00    8.80    14  |__revalidator385
02:11:36 PM     0         -      3959    0.60    8.60    0.00    9.20    36  |__revalidator386
02:11:36 PM     0         -      3960    0.80    9.60    0.00   10.40    44  |__revalidator387
02:11:36 PM     0         -      3961    0.80   10.00    0.00   10.80    40  |__revalidator388
02:11:36 PM     0         -      3962    0.60    6.20    0.00    6.80    22  |__revalidator389
02:11:36 PM     0         -      3963    0.60    5.20    0.00    5.80     6  |__revalidator390
02:11:36 PM     0         -      3964    1.00   11.20    0.00   12.20     2  |__revalidator391
02:11:36 PM     0         -      3965    0.80    7.20    0.00    8.00    46  |__revalidator392

Looking at the perf data as it's the kernel stealing time from OVS. Same as
before its fighting for the ARP neighbor table lock. For example handler 371:


40.30%  handler371  [kernel.vmlinux]        [k] __write_lock_failed
            |
            ---__write_lock_failed
               |
                --24.20%--_raw_write_lock_bh
                          __neigh_create

37.87%  handler371  [kernel.vmlinux]        [k] netlink_broadcast_filtered
            |
            ---netlink_broadcast_filtered
               |
                --22.75%--nlmsg_notify
                          rtnl_notify
                          __neigh_notify
                          |
                          |--11.58%--neigh_cleanup_and_release
                          |          __neigh_create
                          |          arp_process
                          |          arp_rcv


After writing a script to analyze the perf data taken with the probes I found
no IP that stood out in taking the lock. However there are still 2M+ ARPs
packets processed in the 60-second capture, so 33K a second :(

From the script output collecting all ARPs together here is the top
(cat perf_arp_process.txt  | sort -n | uniq -c | sort -n -r > arps_sort_ip.txt):


  14313 - arp_process() on IP 0.119.192.127   for bridge qg-ac146400-2c
   7329 - arp_process() on IP 10.8.255.254    for bridge qg-90938aa3-9b
   7328 - arp_process() on IP 10.8.255.254    for bridge qg-49eed100-0e
   7324 - arp_process() on IP 10.8.255.254    for bridge qg-54b18a5f-9b
   7323 - arp_process() on IP 10.8.255.254    for bridge qg-60434a2b-94	
   7322 - arp_process() on IP 10.8.255.254    for bridge qg-643e4b88-ef
   7321 - arp_process() on IP 10.8.255.254    for bridge qg-3f2288dc-79
   7320 - arp_process() on IP 10.8.255.254    for bridge qg-fff63afd-27
   7320 - arp_process() on IP 10.8.255.254    for bridge qg-8559b7f1-c6
   7319 - arp_process() on IP 10.8.255.254    for bridge qg-6355e480-5e
   7318 - arp_process() on IP 10.8.255.254    for bridge qg-e1935639-b4
   7318 - arp_process() on IP 10.8.255.254    for bridge qg-b416607f-03
   7318 - arp_process() on IP 10.8.255.254    for bridge qg-8b6dd0bd-be
   7318 - arp_process() on IP 10.8.255.254    for bridge qg-84565145-8c
   7317 - arp_process() on IP 10.8.255.254    for bridge qg-e10cac61-96
   7317 - arp_process() on IP 10.8.255.254    for bridge qg-ddcfce33-36
   7316 - arp_process() on IP 10.8.255.254    for bridge qg-61deed1f-65
   7316 - arp_process() on IP 10.8.255.254    for bridge qg-184ce8a6-cf
   7316 - arp_process() on IP 10.8.255.254    for bridge qg-0b20756a-e5
   7315 - arp_process() on IP 10.8.255.254    for bridge qg-f4d18865-77
   7313 - arp_process() on IP 10.8.255.254    for bridge qg-a03d187d-2b
   7313 - arp_process() on IP 10.8.255.254    for bridge qg-46bc8538-77
   7311 - arp_process() on IP 10.8.255.254    for bridge qg-9e821e61-f5
   7311 - arp_process() on IP 10.8.255.254    for bridge qg-83754c33-3c
   7310 - arp_process() on IP 10.8.255.254    for bridge qg-2bb7a784-28
   7309 - arp_process() on IP 10.8.255.254    for bridge qg-af6068b2-b2
   7308 - arp_process() on IP 10.8.255.254    for bridge qg-8ec08f18-e0
   7308 - arp_process() on IP 10.8.255.254    for bridge qg-02c639c2-5a
   7307 - arp_process() on IP 10.8.255.254    for bridge qg-6816025b-5a
   7306 - arp_process() on IP 10.8.255.254    for bridge qg-e56f38c7-00
   7305 - arp_process() on IP 10.8.255.254    for bridge qg-e479b248-68
   7304 - arp_process() on IP 10.8.255.254    for bridge qg-77cb3d4d-75
   7302 - arp_process() on IP 10.8.255.254    for bridge qg-361cfed0-08
   7302 - arp_process() on IP 10.8.255.254    for bridge qg-23ccf949-0b
   7301 - arp_process() on IP 10.8.255.254    for bridge qg-b51c6064-b4
   7301 - arp_process() on IP 10.8.255.254    for bridge qg-69d974c4-5c
   7301 - arp_process() on IP 10.8.255.254    for bridge qg-22c53a2f-e2
   7299 - arp_process() on IP 10.8.255.254    for bridge qg-d753f577-d6
   7299 - arp_process() on IP 10.8.255.254    for bridge qg-52e2f0b1-34
   7295 - arp_process() on IP 10.8.255.254    for bridge qg-0b6f302e-87
   7294 - arp_process() on IP 10.8.255.254    for bridge qg-26b088c2-2d
   7292 - arp_process() on IP 10.8.255.254    for bridge qg-39304955-3e
   7291 - arp_process() on IP 10.8.255.254    for bridge qg-0fcabf84-5c
   7285 - arp_process() on IP 10.8.255.254    for bridge qg-22e94934-b0
   7284 - arp_process() on IP 10.8.255.254    for bridge qg-5c6a125e-f2
   7283 - arp_process() on IP 10.8.255.254    for bridge qg-e4c6fc0f-32
   7283 - arp_process() on IP 10.8.255.254    for bridge qg-40ed7eca-fb
   7282 - arp_process() on IP 10.8.255.254    for bridge qg-35bf0221-d2
   7280 - arp_process() on IP 10.8.255.254    for bridge qg-d047bd16-61
   7280 - arp_process() on IP 10.8.255.254    for bridge qg-b89f3f92-c3
   7280 - arp_process() on IP 10.8.255.254    for bridge qg-7d85facc-20
   7276 - arp_process() on IP 10.8.255.254    for bridge qg-6b9e5681-08
   7273 - arp_process() on IP 10.8.255.254    for bridge qg-8601a87e-db
   7271 - arp_process() on IP 10.8.255.254    for bridge qg-1cbe5b34-ce
   7265 - arp_process() on IP 10.8.255.254    for bridge qg-0db1e23b-dd
   7264 - arp_process() on IP 10.8.255.254    for bridge qg-543ef736-26
   7263 - arp_process() on IP 10.8.255.254    for bridge qg-57df27ee-95
   7256 - arp_process() on IP 10.8.255.254    for bridge qg-2abf881e-af
   7249 - arp_process() on IP 10.8.255.254    for bridge qg-d1ead6fd-08
   7235 - arp_process() on IP 10.8.255.254    for bridge qg-384ca19e-3a
   7156 - arp_process() on IP 10.8.255.254    for bridge qg-c368347c-b9
   7154 - arp_process() on IP 10.8.255.254    for bridge qg-2990c397-bd
   7153 - arp_process() on IP 10.8.255.254    for bridge qg-c4aebfbb-7b
   7152 - arp_process() on IP 10.8.255.254    for bridge qg-feff5b1c-d4
   7151 - arp_process() on IP 10.8.255.254    for bridge qg-cb0d8bc4-93
   7150 - arp_process() on IP 10.8.255.254    for bridge qg-fdb48344-82
   7150 - arp_process() on IP 10.8.255.254    for bridge qg-4ae14248-7b
   7148 - arp_process() on IP 10.8.255.254    for bridge qg-8cbb51de-5e
   7146 - arp_process() on IP 10.8.255.254    for bridge qg-d118def9-33
   7145 - arp_process() on IP 10.8.255.254    for bridge qg-83fe93fa-3f
   7142 - arp_process() on IP 10.8.255.254    for bridge qg-97969e2f-b9
   7141 - arp_process() on IP 10.8.255.254    for bridge qg-10c08ef3-0c
   7140 - arp_process() on IP 10.8.255.254    for bridge qg-f64811d2-4b
   7138 - arp_process() on IP 10.8.255.254    for bridge qg-83e0932d-7a
   7136 - arp_process() on IP 10.8.255.254    for bridge qg-bab83275-5c
   7136 - arp_process() on IP 10.8.255.254    for bridge qg-595aa19a-7e
   7135 - arp_process() on IP 10.8.255.254    for bridge qg-122a25c0-9c
   7134 - arp_process() on IP 10.8.255.254    for bridge qg-bd43420b-a9
   7134 - arp_process() on IP 10.8.255.254    for bridge qg-7a833dc0-8e
   7133 - arp_process() on IP 10.8.255.254    for bridge qg-444f66b1-3c
   7131 - arp_process() on IP 10.8.255.254    for bridge qg-7bd5b526-54
   7131 - arp_process() on IP 10.8.255.254    for bridge qg-54b920a6-8e
   7130 - arp_process() on IP 10.8.255.254    for bridge qg-f3fff380-f0
   7128 - arp_process() on IP 10.8.255.254    for bridge qg-db665e8e-52
   7127 - arp_process() on IP 10.8.255.254    for bridge qg-adec81f6-50
   7123 - arp_process() on IP 10.8.255.254    for bridge qg-b2ed3c2f-4c
   7123 - arp_process() on IP 10.8.255.254    for bridge qg-9af1cb40-3f
   7123 - arp_process() on IP 10.8.255.254    for bridge qg-1b608741-91
   7122 - arp_process() on IP 10.8.255.254    for bridge qg-d068fba5-ad
   7122 - arp_process() on IP 10.8.255.254    for bridge qg-c3afd50e-41
   7120 - arp_process() on IP 10.8.255.254    for bridge qg-874fee56-00
   7119 - arp_process() on IP 10.8.255.254    for bridge qg-e3dd4e16-98
   7117 - arp_process() on IP 10.8.255.254    for bridge qg-f95b2a6f-74
   7113 - arp_process() on IP 10.8.255.254    for bridge qg-5ec846ad-c2
   7112 - arp_process() on IP 10.8.255.254    for bridge qg-ff9475bc-1c
   7112 - arp_process() on IP 10.8.255.254    for bridge qg-22599fd1-75
   7111 - arp_process() on IP 10.8.255.254    for bridge qg-6ff1014c-24
   7110 - arp_process() on IP 10.8.255.254    for bridge qg-9702727b-1a
   7105 - arp_process() on IP 10.8.255.254    for bridge qg-f2a71092-25
   7105 - arp_process() on IP 10.8.255.254    for bridge qg-8bd65435-ea
   7105 - arp_process() on IP 10.8.255.254    for bridge qg-4d1957a2-7f
   7104 - arp_process() on IP 10.8.255.254    for bridge qg-15e67ba9-87
   7104 - arp_process() on IP 10.8.255.254    for bridge qg-137bf9c9-9d
   7103 - arp_process() on IP 10.8.255.254    for bridge qg-cc26d659-55
   7103 - arp_process() on IP 10.8.255.254    for bridge qg-28431b7f-dc
   7102 - arp_process() on IP 10.8.255.254    for bridge qg-f9eb38d0-68
   7102 - arp_process() on IP 10.8.255.254    for bridge qg-69552fe6-88
   7101 - arp_process() on IP 10.8.255.254    for bridge qg-a237007e-e2
   7101 - arp_process() on IP 10.8.255.254    for bridge qg-208f986a-a2
   7101 - arp_process() on IP 10.8.255.254    for bridge qg-0f459ae0-84
   7099 - arp_process() on IP 10.8.255.254    for bridge qg-3b2ff0fe-3f
   7098 - arp_process() on IP 10.8.255.254    for bridge qg-66a332af-22
   7097 - arp_process() on IP 10.8.255.254    for bridge qg-5515d120-e8
   7096 - arp_process() on IP 10.8.255.254    for bridge qg-e29d90b3-39
   7095 - arp_process() on IP 10.8.255.254    for bridge qg-f45e01da-60
   7094 - arp_process() on IP 10.8.255.254    for bridge qg-eff3812e-b1
   7094 - arp_process() on IP 10.8.255.254    for bridge qg-ddd4c341-05
   7094 - arp_process() on IP 10.8.255.254    for bridge qg-7b796925-6b
   7093 - arp_process() on IP 10.8.255.254    for bridge qg-dbcc915a-90
   7092 - arp_process() on IP 10.8.255.254    for bridge qg-c155c29d-e6
   7092 - arp_process() on IP 10.8.255.254    for bridge qg-103d267d-26
   7091 - arp_process() on IP 10.8.255.254    for bridge qg-df7e187a-83
   7090 - arp_process() on IP 10.8.255.254    for bridge qg-e4f29484-0a
   7090 - arp_process() on IP 10.8.255.254    for bridge qg-b5d4bded-f3
   7090 - arp_process() on IP 10.8.255.254    for bridge qg-8fe7fb1b-3b
   7088 - arp_process() on IP 10.8.255.254    for bridge qg-36271b6d-97
   7085 - arp_process() on IP 10.8.255.254    for bridge qg-eecf06ec-17
   7085 - arp_process() on IP 10.8.255.254    for bridge qg-265a827f-aa
   7084 - arp_process() on IP 10.8.255.254    for bridge qg-546bed7c-91
   7083 - arp_process() on IP 10.8.255.254    for bridge qg-66e66b6e-6a
   7083 - arp_process() on IP 10.8.255.254    for bridge qg-3be17ae0-c5
   7082 - arp_process() on IP 10.8.255.254    for bridge qg-e4778e69-8f
   7077 - arp_process() on IP 10.8.255.254    for bridge qg-7492bb7d-c7
   7076 - arp_process() on IP 10.8.255.254    for bridge qg-a0bf0fa4-bf
   7069 - arp_process() on IP 10.8.255.254    for bridge qg-af34ac65-dd
   7068 - arp_process() on IP 10.8.255.254    for bridge qg-e6560ca2-0f
   7065 - arp_process() on IP 10.8.255.254    for bridge qg-11aeea40-b0
   7062 - arp_process() on IP 10.8.255.254    for bridge qg-c90eb60e-51
   7060 - arp_process() on IP 10.8.255.254    for bridge qg-3aa49939-a8
   7059 - arp_process() on IP 10.8.255.254    for bridge qg-30360910-77
   7055 - arp_process() on IP 10.8.255.254    for bridge qg-40d89ead-8e
   6872 - arp_process() on IP 10.8.255.254    for bridge qg-9c85b855-bb
   6833 - arp_process() on IP 10.8.255.254    for bridge qg-59513191-c3
   4024 - arp_process() on IP 10.8.255.254    for bridge qg-6015eff6-2c

As you can see its almost all for 10.8.255.254!!!
The same top list, sorted by bridge:

  14384 qg-49eed100-0e
  14380 qg-d047bd16-61
  14380 qg-ac146400-2c
  14379 qg-8b6dd0bd-be
  14375 qg-a03d187d-2b
  14375 qg-77cb3d4d-75
  14374 qg-90938aa3-9b
  14373 qg-61deed1f-65
  14372 qg-d753f577-d6
  14372 qg-9e821e61-f5
  14372 qg-60434a2b-94
  14371 qg-b51c6064-b4
  14371 qg-8ec08f18-e0
  14371 qg-6816025b-5a
  14370 qg-fff63afd-27
  14370 qg-184ce8a6-cf
  14370 qg-0b20756a-e5
  14368 qg-e1935639-b4
  14368 qg-02c639c2-5a
  14366 qg-af6068b2-b2
  14365 qg-b416607f-03
  14364 qg-e10cac61-96
  14357 qg-6355e480-5e
  14355 qg-39304955-3e
  14354 qg-54b18a5f-9b
  14348 qg-e4c6fc0f-32
  14337 qg-643e4b88-ef
  14323 qg-3f2288dc-79
  14321 qg-8601a87e-db
  14288 qg-40ed7eca-fb
  14259 qg-84565145-8c
  14257 qg-e56f38c7-00
  14253 qg-361cfed0-08
  14252 qg-69d974c4-5c
  14250 qg-52e2f0b1-34
  14250 qg-2bb7a784-28
  14249 qg-ddcfce33-36
  14249 qg-0b6f302e-87
  14248 qg-22e94934-b0
  14247 qg-83754c33-3c
  14247 qg-0db1e23b-dd
  14246 qg-384ca19e-3a
  14244 qg-26b088c2-2d
  14241 qg-e479b248-68
  14241 qg-7d85facc-20
  14239 qg-46bc8538-77
  14232 qg-22c53a2f-e2
  14230 qg-8559b7f1-c6
  14228 qg-23ccf949-0b
  14223 qg-0fcabf84-5c
  14198 qg-5c6a125e-f2
  14197 qg-b89f3f92-c3
  14183 qg-543ef736-26
  14179 qg-2abf881e-af
  14166 qg-6b9e5681-08
  14162 qg-1cbe5b34-ce
  14141 qg-35bf0221-d2
  14089 qg-57df27ee-95
  14028 qg-d1ead6fd-08
  13971 qg-f4d18865-77
  13832 qg-4ae14248-7b
  13830 qg-feff5b1c-d4
  13820 qg-10c08ef3-0c
  13819 qg-cb0d8bc4-93
  13819 qg-c368347c-b9
  13819 qg-8bd65435-ea
  13819 qg-2990c397-bd
  13818 qg-7b796925-6b
  13817 qg-c4aebfbb-7b
  13817 qg-122a25c0-9c
  13816 qg-30360910-77
  13814 qg-97969e2f-b9
  13814 qg-83e0932d-7a
  13814 qg-5ec846ad-c2
  13813 qg-36271b6d-97
  13812 qg-3be17ae0-c5
  13808 qg-69552fe6-88
  13808 qg-137bf9c9-9d
  13807 qg-9af1cb40-3f
  13806 qg-bab83275-5c
  13805 qg-4d1957a2-7f
  13802 qg-8fe7fb1b-3b
  13801 qg-eecf06ec-17
  13800 qg-af34ac65-dd
  13800 qg-83fe93fa-3f
  13800 qg-66e66b6e-6a
  13799 qg-265a827f-aa
  13795 qg-59513191-c3
  13795 qg-103d267d-26
  13794 qg-3b2ff0fe-3f
  13794 qg-3aa49939-a8
  13791 qg-9c85b855-bb
  13784 qg-f64811d2-4b
  13778 qg-fdb48344-82
  13775 qg-8cbb51de-5e
  13774 qg-bd43420b-a9
  13774 qg-7bd5b526-54
  13770 qg-e4778e69-8f
  13769 qg-b2ed3c2f-4c
  13763 qg-54b920a6-8e
  13761 qg-f45e01da-60
  13759 qg-f95b2a6f-74
  13757 qg-e4f29484-0a
  13753 qg-5515d120-e8
  13752 qg-f3fff380-f0
  13752 qg-e3dd4e16-98
  13751 qg-b5d4bded-f3
  13748 qg-dbcc915a-90
  13746 qg-df7e187a-83
  13744 qg-c3afd50e-41
  13744 qg-adec81f6-50
  13744 qg-874fee56-00
  13744 qg-595aa19a-7e
  13744 qg-0f459ae0-84
  13741 qg-c90eb60e-51
  13740 qg-ff9475bc-1c
  13740 qg-f2a71092-25
  13740 qg-22599fd1-75
  13739 qg-d068fba5-ad
  13739 qg-7492bb7d-c7
  13739 qg-546bed7c-91
  13739 qg-444f66b1-3c
  13738 qg-db665e8e-52
  13738 qg-40d89ead-8e
  13737 qg-a0bf0fa4-bf
  13737 qg-6ff1014c-24
  13736 qg-eff3812e-b1
  13736 qg-e29d90b3-39
  13736 qg-66a332af-22
  13736 qg-15e67ba9-87
  13734 qg-f9eb38d0-68
  13734 qg-e6560ca2-0f
  13732 qg-c155c29d-e6
  13731 qg-11aeea40-b0
  13730 qg-9702727b-1a
  13725 qg-ddd4c341-05
  13718 qg-cc26d659-55
  13715 qg-1b608741-91
  13708 qg-a237007e-e2
  13707 qg-208f986a-a2
  13604 qg-7a833dc0-8e
  13395 qg-28431b7f-dc
  13315 qg-d118def9-33

Just looking at the offending IP "10.8.255.254", its the default gateway for
for those bridges:

./ip_netns_exec_qrouter-5c09c2c9-aa8a-430e-9cfe-15329987141d_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-86439ab9-1f 
./ip_netns_exec_qrouter-5c332f62-90b2-4692-861a-a9dcf02c34f8_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-12f59f63-fe 
./ip_netns_exec_qrouter-5ddf4e51-77e6-4cc9-9803-a6e97582ee77_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-bba31e3c-ef 
./ip_netns_exec_qrouter-5f49f3e1-d314-4fca-8d8b-6aad553ce408_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-3665ec12-75 
./ip_netns_exec_qrouter-5fb64034-3c0a-4e86-b389-5d2ec63d02dd_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ea157c4b-07 
./ip_netns_exec_qrouter-6bed7f94-3133-45e0-898e-eca61b933b39_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-c15b0f85-28 
./ip_netns_exec_qrouter-6c8bc503-aea8-403a-a2fe-661130f19199_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-da6e9966-97 
./ip_netns_exec_qrouter-6c44fd76-7691-4525-941c-78a4388a2b32_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-beaefb67-91 
./ip_netns_exec_qrouter-6c57b7f2-9d10-488a-bf8a-02c9a1f2f346_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-2c3dc165-0f 
./ip_netns_exec_qrouter-6c43978e-50af-4375-bd3b-241c44d243d6_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-64529548-00 
./ip_netns_exec_qrouter-6e4f9544-2390-4a2c-81cd-60b501152c28_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-5e776593-0b 
./ip_netns_exec_qrouter-6ed8fe47-b9c7-4051-a43c-0075aafb9f60_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ab65ab61-5f 
./ip_netns_exec_qrouter-7b041ff9-dddc-4d11-bafe-82240b9c1f92_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-0c278b9f-54 
./ip_netns_exec_qrouter-7c0ad7ac-702a-4f10-ab9c-ca19f9b3a851_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ed5e6f9a-69 
./ip_netns_exec_qrouter-7cbb333c-4e5a-4e9f-a2af-53d5cb79677b_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-4df186a5-58 
./ip_netns_exec_qrouter-7e989393-9ee5-4aa6-ab8d-20d8d4c57e3f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-d87d0217-a9 
./ip_netns_exec_qrouter-8a64d973-03da-4300-8057-689b798ee72e_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-3b981cbc-04 
./ip_netns_exec_qrouter-8ad70288-6ccd-4b03-8bf2-7ffb62e96b7a_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-638d415b-64 
./ip_netns_exec_qrouter-8b0727e2-5ec1-4601-aff7-bf6b9b747b0b_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ded1538c-32 
./ip_netns_exec_qrouter-8d72f7c6-4de9-4a40-879a-bed62a9c75b7_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-7f45ac2c-e2 
./ip_netns_exec_qrouter-60a4e9c7-5d00-4fc2-aed9-bddb86da9cf5_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-a03bd6ba-31 
./ip_netns_exec_qrouter-60aa8692-0a28-4bcc-9832-4391d341df7a_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-224c0b73-cc 
./ip_netns_exec_qrouter-64d96bfe-5cdd-4049-ac05-b4a029e65555_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-d6df8bc2-15 
./ip_netns_exec_qrouter-69f4cc16-d6a5-4e08-9177-200b8d9cdbb0_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-4826353f-c1 
./ip_netns_exec_qrouter-71a059c8-e2f5-49af-85f8-c5ca7b862b5d_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-6f95d44f-4c 
./ip_netns_exec_qrouter-74fbd52c-3d34-4ec1-bf9d-b19cca055695_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-e6a598f3-a6 
./ip_netns_exec_qrouter-75ac2846-b1e8-46fa-b9a3-782accaa2d1c_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-8230930b-95 
./ip_netns_exec_qrouter-79f00d6a-3196-4fc3-aa57-8685a5d54f22_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-dfbaa6d0-ec 
./ip_netns_exec_qrouter-79f2c073-f64d-4c27-8946-6fb83ee34d4f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-b99bcb4c-5f 
./ip_netns_exec_qrouter-80f535dc-5c8d-4085-a61e-4a85e559e190_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-7333f82e-d0 
./ip_netns_exec_qrouter-87f12d93-149d-41d4-b8b7-df7a14bd0e0f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-72792747-70 
./ip_netns_exec_qrouter-607fdbe5-3fcb-4a48-bdaa-6ec7726684e6_ip_route_show_table_all:2:default via 10.8.255.254 dev qg-6b748d98-f7 
./ip_netns_exec_qrouter-638e3b23-8ad7-48ce-a060-b9b12ef1cac3_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-fff75d9b-a7 
./ip_netns_exec_qrouter-750e5071-462e-4815-9a4e-008a38ab4b6f_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-3e7bca60-96 
./ip_netns_exec_qrouter-6327b790-2793-40de-8d34-8d86c7fe2167_ip_route_show_table_all:2:default via 10.8.255.254 dev qg-7ecf1a19-d8 
./ip_netns_exec_qrouter-7329d473-49a6-4a08-b98b-c3726493e088_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-64e636e6-31 
./ip_netns_exec_qrouter-61146b1e-2fed-4027-8d89-163f0d4e0692_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-0aab2c63-db 
./ip_netns_exec_qrouter-77460b71-cd0b-4d0e-9a1f-fee5e989a7f3_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-ff2cfc2c-f1 
./ip_netns_exec_qrouter-78362e05-d861-47ea-8442-506cc9d18f4b_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-051df8cf-4b 
./ip_netns_exec_qrouter-747135c4-4517-4ccd-bd01-6e1427cb5c7d_ip_route_show_table_all:1:default via 10.8.255.254 dev qg-e87567ea-75 

Did the default gateway go down? Why are those interfaces generating so many
arp requests?

Comment 36 Eelco Chaudron 2018-03-26 13:07:12 UTC
Created attachment 1413149 [details]
RAW anylasis of perf data

Comment 37 Eelco Chaudron 2018-03-29 08:02:34 UTC
Adding some clarification the IP address is the SOURCE IP of the ARP packets. So, in this case, the router seems to generate a lot of ARP packets (or they are being looped).

Comment 38 Ladislav Jozsa 2018-03-29 13:30:44 UTC
Thanks Eelco, we're looking into the issue together with IT/Network team and they have opened a support case with Juniper.

Comment 41 Eelco Chaudron 2018-05-07 09:44:49 UTC
I quickly looked at the traces supplied, and this time I see only +/- 3K packets for 10.8.255.254 out of the 337k, so +/- 1% (taken in 30 seconds). 

There is still a lot of arp going on, most of them are gratuitous ARPs. 
For example with the following filter, arp.src.proto_ipv4==172.31.0.1, I see 4360 packets in 31 seconds. I assume I see multiple tunnel endpoints. However, I also see two macs trying to use the same IP here (open the trace with wireshark and it will tell you which ones).

Is this some kind of redundancy at the OSP level? It still does not look like an OVS problem, but this could be the effect of an earlier problem causing some redundancy to kick and is starving the kernel due to excessive ARPs?

Eduard please further analyze the traces and see if something is wrong at a higher level than OVS, i.e. why we see so many (gratuitous) ARP packets.

Comment 43 Pablo Caruana 2018-05-12 15:41:32 UTC
(In reply to Eelco Chaudron from comment #41)
> I quickly looked at the traces supplied, and this time I see only +/- 3K
> packets for 10.8.255.254 out of the 337k, so +/- 1% (taken in 30 seconds). 
> 
> There is still a lot of arp going on, most of them are gratuitous ARPs. 
> For example with the following filter, arp.src.proto_ipv4==172.31.0.1, I see
> 4360 packets in 31 seconds. I assume I see multiple tunnel endpoints.
> However, I also see two macs trying to use the same IP here (open the trace
> with wireshark and it will tell you which ones).
> 
> Is this some kind of redundancy at the OSP level? It still does not look
> like an OVS problem, but this could be the effect of an earlier problem
> causing some redundancy to kick and is starving the kernel due to excessive
> ARPs?
> 
> Eduard please further analyze the traces and see if something is wrong at a
> higher level than OVS, i.e. why we see so many (gratuitous) ARP packets.

the 172.31.0.1 looks it was a default gateway for the  ip_netns_exec_qdhcp-23d8767d-0f7c-4fb9-b182-08bde4026ccc ip_route_show_table_all:default via 172.31.0.1 dev tap400998b6-0f

Trying to get some extra details about it.

Comment 44 Eelco Chaudron 2018-05-14 07:33:38 UTC
This IP was just an example, there are a lot of IPs in the traces that show potential duplicate MACs sharing IPs. In addition, was it ever research why you see a lot of duplicate IPv6 addresses? See first part of comment #36. Maybe there is also some kind of loop issue there.

Comment 45 Eelco Chaudron 2018-05-14 13:05:50 UTC
The following BZ 1558336 look similar. The actual symptom seems different, but they see the same kind of ARP behavior.

Comment 48 Eelco Chaudron 2018-06-29 14:29:57 UTC
Closing BZ for now due to lack of response. Please reopen if needed when additional information is available.

Comment 49 Red Hat Bugzilla 2023-09-15 01:26:30 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.