Bug 2055457
| Summary: | kernel NULL pointer dereference while calling dma_pool_alloc from the mlx5_core module [rhel-7.9.z] | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | suresh kumar <surkumar> |
| Component: | kernel | Assignee: | William Zhao <wizhao> |
| kernel sub component: | Networking | QA Contact: | Tianhao <tizhao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | arawal, jiji, kpfleming, kzhang, mleitner, nmurray, sukulkar, tizhao |
| Version: | 7.9 | Keywords: | OtherQA, ZStream |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 7.9 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-3.10.0-1160.66.1.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-18 16:15:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2069951 | ||
[1]
The system was trying to access /sys/devices/pci0000:11/0000:11:00.0/0000:12:00.0/net/ens1f0/speed and crashed because the device was already removed
+++
crash> mount |grep ffff8924f3b81000
ffff8944bfb28300 ffff8924f3b81000 sysfs sysfs /sys
crash> dentry.d_name.name,d_parent 0xffff892419b10f00
d_name.name = 0xffff892419b10f38 "speed"
d_parent = 0xffff89243cbc4cc0
crash> dentry.d_name.name,d_parent 0xffff89243cbc4cc0
d_name.name = 0xffff89243cbc4cf8 "ens1f0"
d_parent = 0xffff89443cecbd40
crash> dentry.d_name.name,d_parent 0xffff89443cecbd40
d_name.name = 0xffff89443cecbd78 "net"
d_parent = 0xffff89443cc1f740
crash> dentry.d_name.name,d_parent 0xffff89443cc1f740
d_name.name = 0xffff89443cc1f778 "0000:12:00.0"
d_parent = 0xffff8924f37eae40
+++
I have submitted below patch to upstream:
net-sysfs: add check for netdevice being present to speed_show
When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.
[ 755.549084] mlx5_core 0000:12:00.1: Shutdown was called
[ 756.404455] mlx5_core 0000:12:00.0: Shutdown was called
...
[ 757.937260] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280
crash> bt
...
PID: 12649 TASK: ffff8924108f2100 CPU: 1 COMMAND: "amsd"
...
#9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
[exception RIP: dma_pool_alloc+0x1ab]
RIP: ffffffff8ee11acb RSP: ffff89240e1a3968 RFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090
RBP: ffff89240e1a39c0 R8: 000000000001f080 R9: ffff8905ffc03c00
R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0
R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
#11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
#12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
#13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
#14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
#15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
#16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
#17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
#18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
#19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
#20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
#21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
#22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
#23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
#24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
#25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
#26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92
crash> net_device.state ffff89443b0c0000
state = 0x5 (__LINK_STATE_START| __LINK_STATE_NOCARRIER)
To prevent this scenario, we also make sure that the netdevice is present.
Signed-off-by: suresh kumar <suresh2514>
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 53ea262ecafd..fbddf966206b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -213,7 +213,7 @@ static ssize_t speed_show(struct device *dev,
if (!rtnl_trylock())
return restart_syscall();
- if (netif_running(netdev)) {
+ if (netif_running(netdev) && netif_device_present(netdev)) {
struct ethtool_link_ksettings cmd;
if (!__ethtool_get_link_ksettings(netdev, &cmd))
Provided a test kernel to customer and confirmed system does not panic during fence testing
[2]
Similar issue was reported earlier also at https://bugzilla.redhat.com/show_bug.cgi?id=1845694#c9, but more information was not provided and bugzilla got closed
(In reply to suresh kumar from comment #3) > I have submitted below patch to upstream: > > > net-sysfs: add check for netdevice being present to speed_show Thanks Suresh. That's: https://lore.kernel.org/netdev/20220217015518.62719-1-sureshks%40redhat.com/T/ Hi suresh, Could customer help test the bug? Regards, Tianhao Hi Tianhao, Yes. They earlier helped in testing our test kernel also Based on comment #9, set OtherQA and qa_ack+. Hi Tianhao, This is currently set to OtherQA and I haven't seen much movement for a while. I was wondering if there is any action needed on my side. The tier2 nic functional tests pass on kernel-3.10.0-1160.66.1.el7 on mlx5_core driver. Test includes: scaling: pass setup topo via NetworkManager and reboot: mostly pass, vlan over bridge topo also failed on RHEL-7.9 related job: https://beaker.engineering.redhat.com/jobs/6093585 https://beaker.engineering.redhat.com/jobs/6093586 There is no regressions found in testing, based on the tier1 test results on dt kernel and tier2 test results on candidate kernel, set VERFIED. (In reply to Tianhao from comment #20) > https://beaker.engineering.redhat.com/jobs/6093585 > https://beaker.engineering.redhat.com/jobs/6093586 Here should be: https://beaker.engineering.redhat.com/jobs/6600152 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4642 |
Description of problem: While rebooting, the system triggered panic [ 755.537074] qla2xxx [0000:37:00.1]-fffa:2: Adapter shutdown [ 755.537140] qla2xxx [0000:37:00.1]-00af:2: Performing ISP error recovery - ha=ffff89243d855000. [ 755.537594] qla2xxx [0000:37:00.1]-fffe:2: Adapter shutdown successfully. [ 755.537596] qla2xxx [0000:37:00.0]-fffa:0: Adapter shutdown [ 755.537685] qla2xxx [0000:37:00.0]-00af:0: Performing ISP error recovery - ha=ffff8905b2f99000. [ 755.538510] qla2xxx [0000:37:00.0]-fffe:0: Adapter shutdown successfully. [ 755.549084] mlx5_core 0000:12:00.1: Shutdown was called <--------------------------------- [ 755.613877] bond0: link status definitely down for interface ens1f1, disabling it [ 756.404455] mlx5_core 0000:12:00.0: Shutdown was called <--------------------------------- [ 756.462899] bond0: link status definitely down for interface ens1f0, disabling it [ 756.462909] bond0: now running without any active interface! [ 757.164336] mlx5_core 0000:12:00.0: mlx5_cmd_check:745:(pid 12649): ACCESS_REG(0x805) op_mod(0x1) failed, status bad system state(0x4), syndrome (0x192deb) [ 757.303748] dlm: closing connection to node 6 [ 757.303759] dlm: closing connection to node 5 [ 757.303766] dlm: closing connection to node 4 [ 757.303773] dlm: closing connection to node 3 [ 757.303780] dlm: closing connection to node 2 [ 757.303788] dlm: closing connection to node 1 [ 757.305729] dlm: data: no userland control daemon, stopping lockspace [ 757.305743] dlm: data1: no userland control daemon, stopping lockspace [ 757.305757] dlm: clvmd: no userland control daemon, stopping lockspace [ 757.305769] dlm: dlm user daemon left 3 lockspaces [ 757.937260] BUG: unable to handle kernel NULL pointer dereference at (null) [ 758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280 [ 758.102527] PGD 8000001f970b9067 PUD 1f9b841067 PMD 0 [ 758.164249] Oops: 0000 [#1] SMP [ 758.202963] Modules linked in: gfs2 dlm bonding falcon_lsm_serviceable(PE) falcon_nf_netcontain(PE) falcon_kal(E) falcon_lsm_pinned_12904(E) sunrpc dm_service_time skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ses enclosure dm_multipath ipmi_si mei_me ipmi_devintf sg lpc_ich hpilo joydev mei wmi hpwdt ipmi_msghandler acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 qla2xxx i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm crct10dif_pclmul crct10dif_common crc32c_intel serio_raw smartpqi nvme_fc nvme_fabrics nvme_core tg3 scsi_transport_sas scsi_transport_fc scsi_tgt mlxfw devlink [ 759.051524] ptp drm_panel_orientation_quirks pps_core dm_mirror dm_region_hash dm_log dm_mod [ 759.138374] CPU: 1 PID: 12649 Comm: amsd Kdump: loaded Tainted: P E ------------ 3.10.0-1160.53.1.el7.x86_64 #1 [ 759.274327] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 01/23/2021 [ 759.376813] task: ffff8924108f2100 ti: ffff89240e1a0000 task.ti: ffff89240e1a0000 [ 759.466751] RIP: 0010:[<ffffffff8ee11acb>] [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280 [ 759.567160] RSP: 0018:ffff89240e1a3968 EFLAGS: 00010046 [ 759.630955] RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000 [ 759.716709] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090 [ 759.802465] RBP: ffff89240e1a39c0 R08: 000000000001f080 R09: ffff8905ffc03c00 [ 759.888220] R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0 [ 759.973976] R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000 [ 760.059732] FS: 00007fa2fbc6c8c0(0000) GS:ffff892440040000(0000) knlGS:0000000000000000 [ 760.156991] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 760.226021] CR2: 0000000000000000 CR3: 0000001f8e2fc000 CR4: 00000000007607e0 [ 760.311773] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 760.397529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 760.483283] PKRU: 55555554 [ 760.515707] Call Trace: [ 760.527371] sd 0:0:1:27: rejecting I/O to offline device [ 760.527421] sd 0:0:1:72: rejecting I/O to offline device [ 760.672598] [<ffffffff8ee2920c>] ? kmem_cache_alloc_trace+0x3c/0x200 [ 760.750018] [<ffffffffc04680f3>] mlx5_alloc_cmd_msg+0xd3/0x2a0 [mlx5_core] [ 760.833716] [<ffffffffc046ad62>] cmd_exec+0x112/0x860 [mlx5_core] [ 760.907999] [<ffffffffc046b4fb>] mlx5_cmd_exec+0x2b/0x50 [mlx5_core] [ 760.985428] [<ffffffffc0475434>] mlx5_core_access_reg+0xe4/0x130 [mlx5_core] [ 761.071222] [<ffffffffc04a7348>] mlx5e_get_fec_caps+0xa8/0x100 [mlx5_core] [ 761.154933] [<ffffffffc04992bf>] get_fec_supported_advertised+0x3f/0x150 [mlx5_core] [ 761.249101] [<ffffffffc049ab36>] mlx5e_get_link_ksettings+0x3a6/0x530 [mlx5_core] [ 761.340112] [<ffffffff8f25db46>] __ethtool_get_link_ksettings+0xa6/0x210 [ 761.421701] [<ffffffff8f277208>] speed_show+0x78/0xb0 [ 761.483417] [<ffffffff8f0b70e3>] dev_attr_show+0x23/0x60 [ 761.548270] [<ffffffff8f3875f2>] ? mutex_lock+0x12/0x2f [ 761.612076] [<ffffffff8eedbedf>] sysfs_kf_seq_show+0xcf/0x1f0 [ 761.682160] [<ffffffff8eeda596>] kernfs_seq_show+0x26/0x30 [ 761.749101] [<ffffffff8ee76d10>] seq_read+0x130/0x450 [ 761.810813] [<ffffffff8eedaef5>] kernfs_fop_read+0x105/0x170 [ 761.879845] [<ffffffff8ee4e3ff>] vfs_read+0x9f/0x170 [ 761.940510] [<ffffffff8ee4f27f>] SyS_read+0x7f/0xf0 [ 762.000131] [<ffffffff8f395f92>] system_call_fastpath+0x25/0x2a [ 762.072293] Code: 4c 89 f6 48 89 df 48 89 45 b0 e8 d1 4b 19 00 8b 53 24 48 8b 45 b0 49 89 d7 4c 03 7b 10 83 43 20 01 48 03 53 18 48 89 c6 4c 89 ef <41> 8b 0f 89 4b 24 48 8b 4d b8 48 89 11 e8 e3 9c 57 00 41 81 e4 [ 762.299642] RIP [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280 [ 762.371817] RSP <ffff89240e1a3968> [ 762.413652] CR2: 0000000000000000 Version-Release number of selected component (if applicable): kernel 3.10.0-1160.53.1.el7.x86_64 How reproducible: Customer is able to reproduce it with: pcs stonith fence <host> Actual results: System panic while rebooting Expected results: No panic Additional info: Provided an upstream patch to check for netdevice being present to net-sysfs speed_show