Bug 2055457
Summary: | kernel NULL pointer dereference while calling dma_pool_alloc from the mlx5_core module [rhel-7.9.z] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | suresh kumar <surkumar> |
Component: | kernel | Assignee: | William Zhao <wizhao> |
kernel sub component: | Networking | QA Contact: | Tianhao <tizhao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | arawal, jiji, kpfleming, kzhang, mleitner, nmurray, sukulkar, tizhao |
Version: | 7.9 | Keywords: | OtherQA, ZStream |
Target Milestone: | rc | ||
Target Release: | 7.9 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-3.10.0-1160.66.1.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-18 16:15:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2069951 |
Description
suresh kumar
2022-02-17 02:32:43 UTC
[1] The system was trying to access /sys/devices/pci0000:11/0000:11:00.0/0000:12:00.0/net/ens1f0/speed and crashed because the device was already removed +++ crash> mount |grep ffff8924f3b81000 ffff8944bfb28300 ffff8924f3b81000 sysfs sysfs /sys crash> dentry.d_name.name,d_parent 0xffff892419b10f00 d_name.name = 0xffff892419b10f38 "speed" d_parent = 0xffff89243cbc4cc0 crash> dentry.d_name.name,d_parent 0xffff89243cbc4cc0 d_name.name = 0xffff89243cbc4cf8 "ens1f0" d_parent = 0xffff89443cecbd40 crash> dentry.d_name.name,d_parent 0xffff89443cecbd40 d_name.name = 0xffff89443cecbd78 "net" d_parent = 0xffff89443cc1f740 crash> dentry.d_name.name,d_parent 0xffff89443cc1f740 d_name.name = 0xffff89443cc1f778 "0000:12:00.0" d_parent = 0xffff8924f37eae40 +++ I have submitted below patch to upstream: net-sysfs: add check for netdevice being present to speed_show When bringing down the netdevice or system shutdown, a panic can be triggered while accessing the sysfs path because the device is already removed. [ 755.549084] mlx5_core 0000:12:00.1: Shutdown was called [ 756.404455] mlx5_core 0000:12:00.0: Shutdown was called ... [ 757.937260] BUG: unable to handle kernel NULL pointer dereference at (null) [ 758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280 crash> bt ... PID: 12649 TASK: ffff8924108f2100 CPU: 1 COMMAND: "amsd" ... #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778 [exception RIP: dma_pool_alloc+0x1ab] RIP: ffffffff8ee11acb RSP: ffff89240e1a3968 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090 RBP: ffff89240e1a39c0 R8: 000000000001f080 R9: ffff8905ffc03c00 R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0 R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core] #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core] #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core] #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core] #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core] #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core] #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core] #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46 #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208 #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3 #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596 #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10 #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5 #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92 crash> net_device.state ffff89443b0c0000 state = 0x5 (__LINK_STATE_START| __LINK_STATE_NOCARRIER) To prevent this scenario, we also make sure that the netdevice is present. Signed-off-by: suresh kumar <suresh2514> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 53ea262ecafd..fbddf966206b 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -213,7 +213,7 @@ static ssize_t speed_show(struct device *dev, if (!rtnl_trylock()) return restart_syscall(); - if (netif_running(netdev)) { + if (netif_running(netdev) && netif_device_present(netdev)) { struct ethtool_link_ksettings cmd; if (!__ethtool_get_link_ksettings(netdev, &cmd)) Provided a test kernel to customer and confirmed system does not panic during fence testing [2] Similar issue was reported earlier also at https://bugzilla.redhat.com/show_bug.cgi?id=1845694#c9, but more information was not provided and bugzilla got closed (In reply to suresh kumar from comment #3) > I have submitted below patch to upstream: > > > net-sysfs: add check for netdevice being present to speed_show Thanks Suresh. That's: https://lore.kernel.org/netdev/20220217015518.62719-1-sureshks%40redhat.com/T/ Hi suresh, Could customer help test the bug? Regards, Tianhao Hi Tianhao, Yes. They earlier helped in testing our test kernel also Based on comment #9, set OtherQA and qa_ack+. Hi Tianhao, This is currently set to OtherQA and I haven't seen much movement for a while. I was wondering if there is any action needed on my side. The tier2 nic functional tests pass on kernel-3.10.0-1160.66.1.el7 on mlx5_core driver. Test includes: scaling: pass setup topo via NetworkManager and reboot: mostly pass, vlan over bridge topo also failed on RHEL-7.9 related job: https://beaker.engineering.redhat.com/jobs/6093585 https://beaker.engineering.redhat.com/jobs/6093586 There is no regressions found in testing, based on the tier1 test results on dt kernel and tier2 test results on candidate kernel, set VERFIED. (In reply to Tianhao from comment #20) > https://beaker.engineering.redhat.com/jobs/6093585 > https://beaker.engineering.redhat.com/jobs/6093586 Here should be: https://beaker.engineering.redhat.com/jobs/6600152 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4642 |