Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2055457

Summary:	kernel NULL pointer dereference while calling dma_pool_alloc from the mlx5_core module [rhel-7.9.z]
Product:	Red Hat Enterprise Linux 7	Reporter:	suresh kumar <surkumar>
Component:	kernel	Assignee:	William Zhao <wizhao>
kernel sub component:	Networking	QA Contact:	Tianhao <tizhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	arawal, jiji, kpfleming, kzhang, mleitner, nmurray, sukulkar, tizhao
Version:	7.9	Keywords:	OtherQA, ZStream
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	7.9
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-3.10.0-1160.66.1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-18 16:15:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2069951

Description suresh kumar 2022-02-17 02:32:43 UTC

Description of problem:

While rebooting, the system triggered panic

   
[  755.537074] qla2xxx [0000:37:00.1]-fffa:2: Adapter shutdown
[  755.537140] qla2xxx [0000:37:00.1]-00af:2: Performing ISP error recovery - ha=ffff89243d855000.
[  755.537594] qla2xxx [0000:37:00.1]-fffe:2: Adapter shutdown successfully.
[  755.537596] qla2xxx [0000:37:00.0]-fffa:0: Adapter shutdown
[  755.537685] qla2xxx [0000:37:00.0]-00af:0: Performing ISP error recovery - ha=ffff8905b2f99000.
[  755.538510] qla2xxx [0000:37:00.0]-fffe:0: Adapter shutdown successfully.
[  755.549084] mlx5_core 0000:12:00.1: Shutdown was called            <---------------------------------
[  755.613877] bond0: link status definitely down for interface ens1f1, disabling it
[  756.404455] mlx5_core 0000:12:00.0: Shutdown was called           <---------------------------------
[  756.462899] bond0: link status definitely down for interface ens1f0, disabling it
[  756.462909] bond0: now running without any active interface!
[  757.164336] mlx5_core 0000:12:00.0: mlx5_cmd_check:745:(pid 12649): ACCESS_REG(0x805) op_mod(0x1) failed, status bad system state(0x4), syndrome (0x192deb)
[  757.303748] dlm: closing connection to node 6
[  757.303759] dlm: closing connection to node 5
[  757.303766] dlm: closing connection to node 4
[  757.303773] dlm: closing connection to node 3
[  757.303780] dlm: closing connection to node 2
[  757.303788] dlm: closing connection to node 1
[  757.305729] dlm: data: no userland control daemon, stopping lockspace
[  757.305743] dlm: data1: no userland control daemon, stopping lockspace
[  757.305757] dlm: clvmd: no userland control daemon, stopping lockspace
[  757.305769] dlm: dlm user daemon left 3 lockspaces
[  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280
[  758.102527] PGD 8000001f970b9067 PUD 1f9b841067 PMD 0 
[  758.164249] Oops: 0000 [#1] SMP 
[  758.202963] Modules linked in: gfs2 dlm bonding falcon_lsm_serviceable(PE) falcon_nf_netcontain(PE) falcon_kal(E) falcon_lsm_pinned_12904(E) sunrpc dm_service_time skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ses enclosure dm_multipath ipmi_si mei_me ipmi_devintf sg lpc_ich hpilo joydev mei wmi hpwdt ipmi_msghandler acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 qla2xxx i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm crct10dif_pclmul crct10dif_common crc32c_intel serio_raw smartpqi nvme_fc nvme_fabrics nvme_core tg3 scsi_transport_sas scsi_transport_fc scsi_tgt mlxfw devlink
[  759.051524]  ptp drm_panel_orientation_quirks pps_core dm_mirror dm_region_hash dm_log dm_mod
[  759.138374] CPU: 1 PID: 12649 Comm: amsd Kdump: loaded Tainted: P            E  ------------   3.10.0-1160.53.1.el7.x86_64 #1
[  759.274327] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 01/23/2021
[  759.376813] task: ffff8924108f2100 ti: ffff89240e1a0000 task.ti: ffff89240e1a0000
[  759.466751] RIP: 0010:[<ffffffff8ee11acb>]  [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280
[  759.567160] RSP: 0018:ffff89240e1a3968  EFLAGS: 00010046
[  759.630955] RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000
[  759.716709] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090
[  759.802465] RBP: ffff89240e1a39c0 R08: 000000000001f080 R09: ffff8905ffc03c00
[  759.888220] R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0
[  759.973976] R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000
[  760.059732] FS:  00007fa2fbc6c8c0(0000) GS:ffff892440040000(0000) knlGS:0000000000000000
[  760.156991] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  760.226021] CR2: 0000000000000000 CR3: 0000001f8e2fc000 CR4: 00000000007607e0
[  760.311773] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  760.397529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  760.483283] PKRU: 55555554
[  760.515707] Call Trace:
[  760.527371] sd 0:0:1:27: rejecting I/O to offline device
[  760.527421] sd 0:0:1:72: rejecting I/O to offline device
[  760.672598]  [<ffffffff8ee2920c>] ? kmem_cache_alloc_trace+0x3c/0x200
[  760.750018]  [<ffffffffc04680f3>] mlx5_alloc_cmd_msg+0xd3/0x2a0 [mlx5_core]
[  760.833716]  [<ffffffffc046ad62>] cmd_exec+0x112/0x860 [mlx5_core]
[  760.907999]  [<ffffffffc046b4fb>] mlx5_cmd_exec+0x2b/0x50 [mlx5_core]
[  760.985428]  [<ffffffffc0475434>] mlx5_core_access_reg+0xe4/0x130 [mlx5_core]
[  761.071222]  [<ffffffffc04a7348>] mlx5e_get_fec_caps+0xa8/0x100 [mlx5_core]
[  761.154933]  [<ffffffffc04992bf>] get_fec_supported_advertised+0x3f/0x150 [mlx5_core]
[  761.249101]  [<ffffffffc049ab36>] mlx5e_get_link_ksettings+0x3a6/0x530 [mlx5_core]
[  761.340112]  [<ffffffff8f25db46>] __ethtool_get_link_ksettings+0xa6/0x210
[  761.421701]  [<ffffffff8f277208>] speed_show+0x78/0xb0
[  761.483417]  [<ffffffff8f0b70e3>] dev_attr_show+0x23/0x60
[  761.548270]  [<ffffffff8f3875f2>] ? mutex_lock+0x12/0x2f
[  761.612076]  [<ffffffff8eedbedf>] sysfs_kf_seq_show+0xcf/0x1f0
[  761.682160]  [<ffffffff8eeda596>] kernfs_seq_show+0x26/0x30
[  761.749101]  [<ffffffff8ee76d10>] seq_read+0x130/0x450
[  761.810813]  [<ffffffff8eedaef5>] kernfs_fop_read+0x105/0x170
[  761.879845]  [<ffffffff8ee4e3ff>] vfs_read+0x9f/0x170
[  761.940510]  [<ffffffff8ee4f27f>] SyS_read+0x7f/0xf0
[  762.000131]  [<ffffffff8f395f92>] system_call_fastpath+0x25/0x2a
[  762.072293] Code: 4c 89 f6 48 89 df 48 89 45 b0 e8 d1 4b 19 00 8b 53 24 48 8b 45 b0 49 89 d7 4c 03 7b 10 83 43 20 01 48 03 53 18 48 89 c6 4c 89 ef <41> 8b 0f 89 4b 24 48 8b 4d b8 48 89 11 e8 e3 9c 57 00 41 81 e4 
[  762.299642] RIP  [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280
[  762.371817]  RSP <ffff89240e1a3968>
[  762.413652] CR2: 0000000000000000


Version-Release number of selected component (if applicable):

kernel 3.10.0-1160.53.1.el7.x86_64


How reproducible:

  Customer is able to reproduce it with:
     pcs stonith fence <host>

Actual results:
  System panic while rebooting

Expected results:
  No panic

Additional info:
  Provided an upstream patch to check for netdevice being present to net-sysfs speed_show

Comment 3 suresh kumar 2022-02-17 02:40:54 UTC

[1]

The system was trying to access /sys/devices/pci0000:11/0000:11:00.0/0000:12:00.0/net/ens1f0/speed and crashed because the device was already removed

+++
crash> mount |grep ffff8924f3b81000
ffff8944bfb28300 ffff8924f3b81000 sysfs  sysfs     /sys      

crash> dentry.d_name.name,d_parent 0xffff892419b10f00
  d_name.name = 0xffff892419b10f38 "speed"
  d_parent = 0xffff89243cbc4cc0
crash> dentry.d_name.name,d_parent 0xffff89243cbc4cc0
  d_name.name = 0xffff89243cbc4cf8 "ens1f0"
  d_parent = 0xffff89443cecbd40
crash> dentry.d_name.name,d_parent 0xffff89443cecbd40
  d_name.name = 0xffff89443cecbd78 "net"
  d_parent = 0xffff89443cc1f740
crash> dentry.d_name.name,d_parent 0xffff89443cc1f740
  d_name.name = 0xffff89443cc1f778 "0000:12:00.0"
  d_parent = 0xffff8924f37eae40
+++


I have submitted below patch to upstream:


   net-sysfs: add check for netdevice being present to speed_show
    
    When bringing down the netdevice or system shutdown, a panic can be
    triggered while accessing the sysfs path because the device is already
    removed.
    
        [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
        [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
        ...
        [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
        [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280
    
        crash> bt
        ...
        PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
        ...
         #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
            [exception RIP: dma_pool_alloc+0x1ab]
            RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
            RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
            RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
            RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
            R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
            R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
        #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
        #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
        #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
        #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
        #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
        #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
        #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
        #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
        #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
        #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
        #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
        #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
        #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
        #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
        #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
        #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92
    
        crash> net_device.state ffff89443b0c0000
          state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)
    
    To prevent this scenario, we also make sure that the netdevice is present.
    
    Signed-off-by: suresh kumar <suresh2514>

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 53ea262ecafd..fbddf966206b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -213,7 +213,7 @@ static ssize_t speed_show(struct device *dev,
        if (!rtnl_trylock())
                return restart_syscall();
 
-       if (netif_running(netdev)) {
+       if (netif_running(netdev) && netif_device_present(netdev)) {
                struct ethtool_link_ksettings cmd;
 
                if (!__ethtool_get_link_ksettings(netdev, &cmd))



Provided a test kernel to customer and confirmed system does not panic during fence testing


[2]
Similar issue was reported earlier also at https://bugzilla.redhat.com/show_bug.cgi?id=1845694#c9, but more information was not provided and bugzilla got closed

Comment 4 Marcelo Ricardo Leitner 2022-02-17 14:25:48 UTC

(In reply to suresh kumar from comment #3)
> I have submitted below patch to upstream:
> 
> 
>    net-sysfs: add check for netdevice being present to speed_show

Thanks Suresh.

That's:
https://lore.kernel.org/netdev/20220217015518.62719-1-sureshks%40redhat.com/T/

Comment 8 Tianhao 2022-03-28 09:06:10 UTC

Hi suresh,

Could customer help test the bug?

Regards,
Tianhao

Comment 9 suresh kumar 2022-03-29 01:12:11 UTC

Hi Tianhao,

Yes. They earlier helped in testing our test kernel also

Comment 10 Tianhao 2022-03-29 02:08:16 UTC

Based on comment #9, set OtherQA and qa_ack+.

Comment 11 William Zhao 2022-04-11 18:49:07 UTC

Hi Tianhao,

This is currently set to OtherQA and I haven't seen much movement for a while. I was wondering if there is any action needed on my side.

Comment 20 Tianhao 2022-05-09 06:20:33 UTC

The tier2 nic functional tests pass on kernel-3.10.0-1160.66.1.el7 on mlx5_core driver.

Test includes:
scaling: pass
setup topo via NetworkManager and reboot: mostly pass, vlan over bridge topo also failed on RHEL-7.9

related job:
https://beaker.engineering.redhat.com/jobs/6093585
https://beaker.engineering.redhat.com/jobs/6093586

There is no regressions found in testing, based on the tier1 test results on dt kernel and tier2 test results on candidate kernel, set VERFIED.

Comment 21 Tianhao 2022-05-09 06:21:46 UTC

(In reply to Tianhao from comment #20)
> https://beaker.engineering.redhat.com/jobs/6093585
> https://beaker.engineering.redhat.com/jobs/6093586
Here should be:
https://beaker.engineering.redhat.com/jobs/6600152

Comment 25 errata-xmlrpc 2022-05-18 16:15:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4642