RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1917074 - sosreport (ethtool -e) causes ovs-dpdk bonds to flap [rhel-7.9.z]
Summary: sosreport (ethtool -e) causes ovs-dpdk bonds to flap [rhel-7.9.z]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sos
Version: 7.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jan Jansky
QA Contact: Maros Kopec
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-17 01:45 UTC by nacurry
Modified: 2024-06-13 23:56 UTC (History)
13 users (show)

Fixed In Version: sos-3.9-5.el7_9.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-02 11:59:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github sosreport sos pull 2376 0 None closed [networking] Collect 'ethtool -e <device>' conditionally only 2021-02-19 09:52:08 UTC
Github sosreport sos pull 2380 0 None closed [networking] Collect 'ethtool -e <device>' conditionally only 2021-02-19 09:52:08 UTC

Internal Links: 1918923

Description nacurry 2021-01-17 01:45:07 UTC
Description of problem:
Running sosreport causes system OVS memory usage to balloon to over 45GB and crash due to OOM.

- Specifically `ethtool -e` run against the Broadcom controllers[1]

[1]
02:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
02:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
02:00.2 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
02:00.3 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
04:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)
04:00.1 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)
82:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)
82:00.1 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572] (rev 01)

- 04:00.1 and 82:00.1 are members of a Linux Bond.
- The OVS-DPDK bond is on 04:00.0 and 82:00.0
  - mode: balance-tcp (tech preview)
  - support-multi-driver is not enabled
  - firmware not in line with recommendation for installed dpdk packages

- Cu unable to provide any counterexamples to balance-tcp + support-multi-driver disabled + out of date firmware, so it is unclear whether this will be required for replication.

Version-Release number of selected component (if applicable):
Packages:
- sos-3.7-7.el7_7.noarch
- ethtool-4.8-10.el7.x86_64
- dpdk-18.11.2-1.el7.x86_64
- openvswitch-2.9.0-114.el7fdp.x86_64

Firmware:
- BCM
  - 5719-v1.46 NCSI v1.3.16.0
- Intel
  - 5.60 0x80002dac 1.1618.0
  - 4.61 0x80002bb1 1.3377.0


How reproducible:
Every time

Steps to Reproduce:
1. Set up ovs-dpdk on intel x710s
   - (maybe) needs to be balance-tcp
   - (maybe) needs to have userspace and kernel driver loaded on same card without enabling support-multi-driver
2. Run sosreport


Actual results:
Hangs and memory ballooning in OVS causes ports to flap when running ethtool -e against BCM5719 devices

Expected results:
Doesn't disrupt the network.  Either succeeds or fails gracefully.

Additional info:
Potentially related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1744317#c126

Comment 3 Pavel Moravec 2021-01-18 09:26:54 UTC
(In reply to nacurry from comment #0)
> Steps to Reproduce:
> 1. Set up ovs-dpdk on intel x710s
>    - (maybe) needs to be balance-tcp
>    - (maybe) needs to have userspace and kernel driver loaded on same card
> without enabling support-multi-driver


So the request is to stop collecting "ethtool -e" in that setup, am I right? We can predicate calling the command, but could you please provide a diagnostic command to determine such setup?

(an example: similar issues happen on bnx2x NICs; so when "ethtool -i %DEV" contains "bnx2x" string, we skip calling "ethtool -e %DEV" [1];  please provide similar command/condition)

[1] https://github.com/sosreport/sos/pull/2200/files

Comment 4 mheler 2021-01-18 16:40:37 UTC
"ethtool -i %DEV" contains "tg3" string

should be enough to match network cards that are seeing this issue with sos

Comment 5 Pavel Moravec 2021-01-18 21:51:39 UTC
Upstream PR proposed.

Leaving on jjansky to decide about inclusion in RHEL7. In RHEL8, it should be contained in 8.5 by default. If sooner fix is required, let clone the BZ to RHEL8 (but I dont want to promise anything..).

Comment 7 Chris Williams 2021-01-20 19:13:28 UTC
We are looking at getting an erratum pushed out for this asap that will alleviate the reported issue and prevent the default running of ethtool -e.
Additional investigation will also be needed to determine why the this long standing diagnostic tool has recently become disruptive to certain environments.

Comment 11 Maros Kopec 2021-01-25 14:56:56 UTC
I tested this manually by wrapping ethtool binary file on RHEL-7.9 x86_64

SETUP

# cat /root/fakebin/ethtool
#!/bin/bash

# /usr/sbin/original-ethtool $@ | sed 's/driver: .*/driver: bnx2x/'
/usr/sbin/original-ethtool ethtool $@ | sed 's/driver: .*/driver: tg3/' 

# chmod u+x /root/fakebin/ethtool
# mv /usr/sbin/ethtool /usr/sbin/original-ethtool
# ln -s /root/fakebin/ethtool /usr/sbin/ethtool


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

OLD
# rpm -qa sos
sos-3.9-5.el7_9.1.noarch

# sosreport --list-plugins | grep 'networking\.'
 networking.traceroute     off             collect a traceroute to www.example.com
 networking.namespace_pattern                 Specific namespaces pattern to be collected, namespaces pattern should be separated by whitespace as for example "eth* ens2"
 networking.namespaces     0               Number of namespaces to collect, 0 for unlimited. Incompatible with the namespace_pattern plugin option
 networking.ethtool_namespaces on              Define if ethtool commands should be collected for namespaces

With bnx2x driver
# sosreport -o networking --batch
...
[plugin:networking] skipped command 'ethtool -e eth0': 
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-dzeixss.tar.xz


With tg3 driver
# sosreport -o networking --batch
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-ixtpwuf.tar.xz

# tar tf /var/tmp/sosreport-localhost-2021-01-25-ixtpwuf.tar.xz| grep ethtool_-e
sosreport-localhost-2021-01-25-ixtpwuf/sos_commands/networking/ethtool_-e_eth0
sosreport-localhost-2021-01-25-ixtpwuf/sos_commands/networking/ethtool_-e_lo

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

NEW

# rpm -qa sos
sos-3.9-5.el7_9.noarch

We can see that eepromdump is now ignored by default
# sosreport --list-plugins | grep 'networking\.'
 networking.traceroute     off             collect a traceroute to www.example.com
 networking.namespace_pattern                 Specific namespaces pattern to be collected, namespaces pattern should be separated by whitespace as for example "eth* ens2"
 networking.namespaces     0               Number of namespaces to collect, 0 for unlimited. Incompatible with the namespace_pattern plugin option
 networking.ethtool_namespaces on              Define if ethtool commands should be collected for namespaces
 networking.eepromdump     off             collect 'ethtool -e' for all devices 


With bnx2x driver
# sosreport -o networking --batch
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-qmqodau.tar.xz

# tar tf /var/tmp/sosreport-localhost-2021-01-25-qmqodau.tar.xz | grep ethtool_-e

With tg3 driver
# sosreport -o networking --batch
...
Your sosreport has been generated and saved in:
  /var/tmp/sosreport-localhost-2021-01-25-jwqopke.tar.xz

# tar tf /var/tmp/sosreport-localhost-2021-01-25-jwqopke.tar.xz| grep ethtool_-e

Comment 15 errata-xmlrpc 2021-02-02 11:59:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sos bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0333


Note You need to log in before you can comment on or make changes to this bug.