Bug 1917074

Summary: sosreport (ethtool -e) causes ovs-dpdk bonds to flap [rhel-7.9.z]
Product: Red Hat Enterprise Linux 7 Reporter: nacurry
Component: sosAssignee: Jan Jansky <jjansky>
Status: CLOSED ERRATA QA Contact: Maros Kopec <makopec>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.7CC: agk, alonare, bmr, cory.bannister, cww, fhallal, fkrska, jreznik, mheler, plambri, pmoravec, sbradley, theute
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sos-3.9-5.el7_9.2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-02 11:59:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nacurry 2021-01-17 01:45:07 UTC
Description of problem:
Running sosreport causes system OVS memory usage to balloon to over 45GB and crash due to OOM.

- Specifically `ethtool -e` run against the Broadcom controllers[1]

[1]
02:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
02:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
02:00.2 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
02:00.3 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
04:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)
04:00.1 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)
82:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)
82:00.1 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)

- 04:00.1 and 82:00.1 are members of a Linux Bond.
- The OVS-DPDK bond is on 04:00.0 and 82:00.0
  - mode: balance-tcp (tech preview)
  - support-multi-driver is not enabled
  - firmware not in line with recommendation for installed dpdk packages

- Cu unable to provide any counterexamples to balance-tcp + support-multi-driver disabled + out of date firmware, so it is unclear whether this will be required for replication.

Version-Release number of selected component (if applicable):
Packages:
- sos-3.7-7.el7_7.noarch
- ethtool-4.8-10.el7.x86_64
- dpdk-18.11.2-1.el7.x86_64
- openvswitch-2.9.0-114.el7fdp.x86_64

Firmware:
- BCM
  - 5719-v1.46 NCSI v1.3.16.0
- Intel
  - 5.60 0x80002dac 1.1618.0
  - 4.61 0x80002bb1 1.3377.0


How reproducible:
Every time

Steps to Reproduce:
1. Set up ovs-dpdk on intel x710s
   - (maybe) needs to be balance-tcp
   - (maybe) needs to have userspace and kernel driver loaded on same card without enabling support-multi-driver
2. Run sosreport


Actual results:
Hangs and memory ballooning in OVS causes ports to flap when running ethtool -e against BCM5719 devices

Expected results:
Doesn't disrupt the network.  Either succeeds or fails gracefully.

Additional info:
Potentially related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1744317#c126

Comment 3 Pavel Moravec 2021-01-18 09:26:54 UTC
(In reply to nacurry from comment #0)
> Steps to Reproduce:
> 1. Set up ovs-dpdk on intel x710s
>    - (maybe) needs to be balance-tcp
>    - (maybe) needs to have userspace and kernel driver loaded on same card
> without enabling support-multi-driver


So the request is to stop collecting "ethtool -e" in that setup, am I right? We can predicate calling the command, but could you please provide a diagnostic command to determine such setup?

(an example: similar issues happen on bnx2x NICs; so when "ethtool -i %DEV" contains "bnx2x" string, we skip calling "ethtool -e %DEV" [1];  please provide similar command/condition)

[1] https://github.com/sosreport/sos/pull/2200/files

Comment 4 mheler 2021-01-18 16:40:37 UTC
"ethtool -i %DEV" contains "tg3" string

should be enough to match network cards that are seeing this issue with sos

Comment 5 Pavel Moravec 2021-01-18 21:51:39 UTC
Upstream PR proposed.

Leaving on jjansky to decide about inclusion in RHEL7. In RHEL8, it should be contained in 8.5 by default. If sooner fix is required, let clone the BZ to RHEL8 (but I dont want to promise anything..).

Comment 7 Chris Williams 2021-01-20 19:13:28 UTC
We are looking at getting an erratum pushed out for this asap that will alleviate the reported issue and prevent the default running of ethtool -e.
Additional investigation will also be needed to determine why the this long standing diagnostic tool has recently become disruptive to certain environments.

Comment 11 Maros Kopec 2021-01-25 14:56:56 UTC
I tested this manually by wrapping ethtool binary file on RHEL-7.9 x86_64

SETUP

# cat /root/fakebin/ethtool
#!/bin/bash

# /usr/sbin/original-ethtool $@ | sed 's/driver: .*/driver: bnx2x/'
/usr/sbin/original-ethtool ethtool $@ | sed 's/driver: .*/driver: tg3/' 

# chmod u+x /root/fakebin/ethtool
# mv /usr/sbin/ethtool /usr/sbin/original-ethtool
# ln -s /root/fakebin/ethtool /usr/sbin/ethtool


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

OLD
# rpm -qa sos
sos-3.9-5.el7_9.1.noarch

# sosreport --list-plugins | grep 'networking\.'
 networking.traceroute     off             collect a traceroute to www.example.com
 networking.namespace_pattern                 Specific namespaces pattern to be collected, namespaces pattern should be separated by whitespace as for example "eth* ens2"
 networking.namespaces     0               Number of namespaces to collect, 0 for unlimited. Incompatible with the namespace_pattern plugin option
 networking.ethtool_namespaces on              Define if ethtool commands should be collected for namespaces

With bnx2x driver
# sosreport -o networking --batch
...
[plugin:networking] skipped command 'ethtool -e eth0': 
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-dzeixss.tar.xz


With tg3 driver
# sosreport -o networking --batch
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-ixtpwuf.tar.xz

# tar tf /var/tmp/sosreport-localhost-2021-01-25-ixtpwuf.tar.xz| grep ethtool_-e
sosreport-localhost-2021-01-25-ixtpwuf/sos_commands/networking/ethtool_-e_eth0
sosreport-localhost-2021-01-25-ixtpwuf/sos_commands/networking/ethtool_-e_lo

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

NEW

# rpm -qa sos
sos-3.9-5.el7_9.noarch

We can see that eepromdump is now ignored by default
# sosreport --list-plugins | grep 'networking\.'
 networking.traceroute     off             collect a traceroute to www.example.com
 networking.namespace_pattern                 Specific namespaces pattern to be collected, namespaces pattern should be separated by whitespace as for example "eth* ens2"
 networking.namespaces     0               Number of namespaces to collect, 0 for unlimited. Incompatible with the namespace_pattern plugin option
 networking.ethtool_namespaces on              Define if ethtool commands should be collected for namespaces
 networking.eepromdump     off             collect 'ethtool -e' for all devices 


With bnx2x driver
# sosreport -o networking --batch
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-qmqodau.tar.xz

# tar tf /var/tmp/sosreport-localhost-2021-01-25-qmqodau.tar.xz | grep ethtool_-e

With tg3 driver
# sosreport -o networking --batch
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-jwqopke.tar.xz

# tar tf /var/tmp/sosreport-localhost-2021-01-25-jwqopke.tar.xz| grep ethtool_-e

Comment 15 errata-xmlrpc 2021-02-02 11:59:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sos bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0333