Bug 1869724

Summary: sosreport running 'ethtool -e' is causing bnx2x NICs to pause
Product: Red Hat Enterprise Linux 8 Reporter: suresh kumar <surkumar>
Component: sosAssignee: Pavel Moravec <pmoravec>
Status: CLOSED ERRATA QA Contact: Miroslav HradĂ­lek <mhradile>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.3CC: agk, bmr, mhradile, plambri, ptalbert, sbradley
Target Milestone: rcKeywords: OtherQA
Target Release: 8.0Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: sos-3.9.1-6.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 01:58:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description suresh kumar 2020-08-18 13:57:41 UTC
Description of problem:

[1]
Customer observed their application is stuck for ~4 seconds while executing sosreport.

Issue was further tracked down to 'ethtool -e' command.  Checking the strace, we could see ioctl for reading eeprom is returned after 3.444488 seconds.

+++
26621 1595999211.585459 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3<UDP:[21070872]> <0.000016>
26621 1595999211.585748 ioctl(3<UDP:[21070872]>, SIOCETHTOOL, 0x7fffffffe530) = 0 <0.000026>
26621 1595999211.586002 mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff750a000 <0.000012>
26621 1595999211.587348 ioctl(3<UDP:[21070872]>, SIOCETHTOOL, 0x7fffffffe530 <unfinished ...>
26621 1595999215.032016 <... ioctl resumed> ) = 0 <3.444488>  <<<<<
+++

NIC version:
+++
driver: bnx2x
version: 1.713.36-0 storm 7.13.1.0
firmware-version: mbi 7.15.64 bc 7.14.62
+++



Version-Release number of selected component (if applicable):

sos version >= 3.7
sosreport has added support for 'ethtool -e' from version 3.7 on wards.

+++
$ git show 8b989aeb
commit 8b989aebc9c152430fc57f918a8e90210a792a9f
Author: Patrick Talbert <ptalbert>
Date:   Thu Dec 6 13:14:38 2018 +0100

    [networking] Extend ethtool command set
    
    Update the list of ethtool commands to include:
    
    ethtool -e (EEPROM dump)
    ethtool -P (permanent MAC address)
    ethtool -l (channel/queue settings)
    ethtool --phy-statistics
    ethtool --show-priv-flags
    ethtool --show-eee
    
    All of the above are helpful in understanding the state of modern NICs.
    And -P is nice to have as otherwise there is no reliable way to see the
    permanent MAC of team ports.
...
+++



How reproducible:

Always. Run sosreport on a system with bnx2x NIC.

Below is test result from on system dell-pem630-01

+++
System Information
        Manufacturer: Dell Inc.
        Product Name: PowerEdge M630
        Version: Not Specified
        Serial Number: 1V8QT52
        UUID: 4c4c4544-0056-3810-8051-b1c04f543532


+++
# ethtool -i em1
driver: bnx2x                                      <------------------
version: 1.713.36-0 storm 7.13.1.0
firmware-version: FFV7.12.19 bc 7.12.5
expansion-rom-version: 
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
+++


It takes time in EEPROM dump
+++
# time ethtool -e em1
...
0x1fff80:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x1fff90:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x1fffa0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x1fffb0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x1fffc0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x1fffd0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x1fffe0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x1ffff0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

real	0m12.674s                       <-----------------
user	0m0.090s
sys	0m1.294s
+++



Actual results:
sosreport is taking time to complete and bnx2x NICs are pausing.



Expected results:
sosreport should not affect NIC operation.



Additional info:
[1]
In another instance (https://bugzilla.redhat.com/show_bug.cgi?id=1846708), we observed sosreport breaks iDRAC connectivity. Issue was again tracked to "ethtool -e" command.


[2]
Dell has an advisory regarding this:
+++
https://www.dell.com/support/manuals/in/en/inbsdt1/red-hat-entps-lx-v7.0/rhel_7.7_rn/reading-eeprom-from-a-broadcom-device-via-ethtool-results-in-soft-lockup?guid=guid-986ca2a9-c9f9-4345-8762-02d286cc0d1f&lang=en-us
+++

[3]

We also have case where abrtd triggers sosreport and CPUs get into soft lockup very often.


So, its better to avoid sosreport running 'ethtool -e' on bnx2x NICs as its not production safe.

Comment 1 suresh kumar 2020-08-18 14:07:06 UTC
I have submitted an upstream patch to remove ethtool -e for bnx2x NICs which is accepted.
https://github.com/sosreport/sos/commit/34c77d6902ee1df403dc3836b4092d413fb95350 .

+++
$ git show 34c77d69
commit 34c77d6902ee1df403dc3836b4092d413fb95350
Author: suresh2514 <suresh2514>
Date:   Fri Aug 14 22:59:34 2020 +0530

    [networking] remove 'ethtool -e' option for bnx2x NICs
    
    Running EEPROM dump (ethtool -e) can result in bnx2x driver NICs to
    pause for few seconds and is not recommended in production environment.
    
    Resolves: #2188
    Resolves: #2200
    
    Signed-off-by: suresh2514 <suresh2514>
    Signed-off-by: Jake Hunsaker <jhunsake>

diff --git a/sos/report/plugins/networking.py b/sos/report/plugins/networking.py
index ba9c0fb1..397549a5 100644
--- a/sos/report/plugins/networking.py
+++ b/sos/report/plugins/networking.py
@@ -198,7 +198,6 @@ class Networking(Plugin):
                 "ethtool -a " + eth,
                 "ethtool -c " + eth,
                 "ethtool -g " + eth,
-                "ethtool -e " + eth,
                 "ethtool -P " + eth,
                 "ethtool -l " + eth,
                 "ethtool --phy-statistics " + eth,
@@ -206,6 +205,17 @@ class Networking(Plugin):
                 "ethtool --show-eee " + eth
             ], tags=eth)
 
+            # skip EEPROM collection for 'bnx2x' NICs as this command
+            # can pause the NIC and is not production safe.
+            bnx_output = {
+                "cmd": "ethtool -i %s" % eth,
+                "output": "bnx2x"
+            }
+            bnx_pred = SoSPredicate(self,
+                                    cmd_outputs=bnx_output,
+                                    required={'cmd_outputs': 'none'})
+            self.add_cmd_output("ethtool -e %s" % eth, pred=bnx_pred)
+
         # Collect information about bridges (some data already collected via
         # "ip .." commands)
         self.add_cmd_output([
+++



Test result for above patch:

+++
 Setting up archive ...
 Setting up plugins ...
...
[plugin:networking] skipped command 'ethtool -e em2':    <--------------------- bnx2x NIC
[plugin:networking] skipped command 'ethtool -e em1':    <--------------------- bnx2x NICs
 Running plugins. Please wait ...

  Starting 1/1   networking      [Running: networking]

  Finished running plugins

Creating compressed archive...

Your sosreport has been generated and saved in:
	/var/tmp/sosreport-dell-pem630-01-2020-08-14-ixpdmsw.tar.xz

 Size	1.24MiB
 Owner	root
 md5	a2c236193997733cc383ebdf2bac478f

Please send this file to your support representative.


real	0m3.718s   <-------  Without this patch,  it was taking 12s to complete sosreport.
user	0m2.028s
sys	0m0.896s
+++

Comment 2 Pavel Moravec 2020-08-18 14:30:36 UTC
I can add it to RHEL 8.3.0 but we are limited on QE capacity. If/Once a candidate package is available, could you verify it, please?

Comment 3 suresh kumar 2020-08-18 16:06:09 UTC
(In reply to Pavel Moravec from comment #2)
> I can add it to RHEL 8.3.0 but we are limited on QE capacity. If/Once a
> candidate package is available, could you verify it, please?

sure

regards

Comment 4 Pavel Moravec 2020-08-19 09:10:15 UTC
Hello,
could you please verify the fix against below build? Thanks in advance.


A yum repository for the build of sos-3.9.1-6.el8 (task 30820540) is available at:

http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/6.el8/

You can install the rpms locally by putting this .repo file in your /etc/yum.repos.d/ directory:

http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/6.el8/sos-3.9.1-6.el8.repo

RPMs and build logs can be found in the following locations:
http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/6.el8/noarch/

The full list of available rpms is:
http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/6.el8/noarch/sos-3.9.1-6.el8.src.rpm
http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/6.el8/noarch/sos-3.9.1-6.el8.noarch.rpm
http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/6.el8/noarch/sos-audit-3.9.1-6.el8.noarch.rpm

The repository will be available for the next 60 days. Scratch build output will be deleted
earlier, based on the Brew scratch build retention policy.

Comment 18 errata-xmlrpc 2020-11-04 01:58:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sos bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4534