Bug 1391299

Summary: [LLNL 7.4 Bug] Crash in Infiniband rdmavt layer when kernel consumer exhausts queue pairs
Product: Red Hat Enterprise Linux 7 Reporter: Jim Foraker <foraker1>
Component: kernelAssignee: Jonathan Toppins <jtoppins>
kernel sub component: Infiniband QA Contact: Mike Stowell <mstowell>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: dhoward, jshortt, mstowell, rdma-dev-team, tgummels
Version: 7.3Keywords: ZStream
Target Milestone: rc   
Target Release: 7.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-549.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1417191 (view as bug list) Environment:
Last Closed: 2017-08-02 04:25:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1353018, 1381646, 1417191, 1446211    
Attachments:
Description Flags
kernel module reproducer
none
enhanced kernel module to accept module params - makes automation easier none

Description Jim Foraker 2016-11-03 00:57:52 UTC
Created attachment 1216823 [details]
kernel module reproducer

Description of problem:

Several of our nodes have experienced crashes similar to the following:

2016-10-31 14:48:04 [246684.429255] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
2016-10-31 14:48:04 [246684.438112] IP: [<ffffffffa09ac5cc>] rvt_create_qp+0x3fc/0xa60 [rdmavt]
2016-10-31 14:48:04 [246684.445605] PGD 1ffc7e4067 PUD 2021a63067 PMD 0 
2016-10-31 14:48:04 [246684.450883] Oops: 0002 [#1] SMP 
2016-10-31 14:48:04 [246684.454598] Modules linked in: lmv(OE) fld(OE) mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) fid(OE) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 nfsv4 dns_resolver ko2iblnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) nfsv3 nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl iosf_mbi hfi1 kvm irqbypass iTCO_wdt mei_me rdmavt ipmi_devintf iTCO_vendor_support mei sb_edac sg lpc_ich pcspkr shpchp i2c_i801 edac_core ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace ip_tables ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i iw_cxgb3 ib_core cxgb3 mdio libcxgbi qla4xxx iscsi_boot_sysfs crct10dif_pclmul crct10dif_common crc32_pclmul 8021q crc32c_intel mgag200 garp ghash_clmulni_intel stp drm_kms_helper llc syscopyarea sysfillrect mrp dm_multipath sysimgblt aesni_intel fb_sys_fops lrw igb gf128mul glue_helper ttm ablk_helper ahci dca cryptd libahci ptp drm pps_core libata i2c_algo_bit i2c_core mxm_wmi fjes wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate
2016-10-31 14:48:04 [246684.581974] CPU: 18 PID: 134712 Comm: kworker/u384:1 Tainted: P           OE  ------------   3.10.0-510.0.0.2chaos.ch6.x86_64 #1
2016-10-31 14:48:04 [246684.594977] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-10-31 14:48:04 [246684.607403] Workqueue: rdma_cm cma_work_handler [rdma_cm]
2016-10-31 14:48:04 [246684.613531] task: ffff880e47a05e20 ti: ffff88201db14000 task.ti: ffff88201db14000
2016-10-31 14:48:04 [246684.621979] RIP: 0010:[<ffffffffa09ac5cc>]  [<ffffffffa09ac5cc>] rvt_create_qp+0x3fc/0xa60 [rdmavt]
2016-10-31 14:48:04 [246684.632182] RSP: 0018:ffff88201db17bf0  EFLAGS: 00010246
2016-10-31 14:48:04 [246684.638205] RAX: 0000000000000000 RBX: ffff88103a2c0000 RCX: 0000000000000000
2016-10-31 14:48:04 [246684.646263] RDX: 0000000000008dd4 RSI: 0000000000000000 RDI: 0000000000000028
2016-10-31 14:48:04 [246684.654320] RBP: ffff88201db17c80 R08: ffff880178ee8000 R09: ffff88031e961800
2016-10-31 14:48:04 [246684.662378] R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffff4
2016-10-31 14:48:04 [246684.670436] R13: ffff88103a2c099c R14: ffff88031e9c1000 R15: 00000000000006e8
2016-10-31 14:48:04 [246684.678495] FS:  0000000000000000(0000) GS:ffff88203de00000(0000) knlGS:0000000000000000
2016-10-31 14:48:04 [246684.687619] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-10-31 14:48:04 [246684.694127] CR2: 0000000000000028 CR3: 000000202faa2000 CR4: 00000000003407e0
2016-10-31 14:48:04 [246684.702185] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2016-10-31 14:48:04 [246684.710243] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2016-10-31 14:48:04 [246684.718300] Stack:
2016-10-31 14:48:04 [246684.720637]  8000000000000163 000080d21db17fd8 ffff88031e34d308 00000000b54cba81
2016-10-31 14:48:04 [246684.729026]  ffff88201db17c28 ffffffff811c4732 ffffc90200000102 ffff880178ee8030
2016-10-31 14:48:04 [246684.737415]  0000000000000020 ffffc90203907000 0000000000000000 00000000ffffffff
2016-10-31 14:48:04 [246684.745803] Call Trace:
2016-10-31 14:48:04 [246684.748630]  [<ffffffff811c4732>] ? find_vmap_area+0x42/0x70
2016-10-31 14:48:04 [246684.755041]  [<ffffffffa06a5a3f>] ib_create_qp+0x3f/0x250 [ib_core]
2016-10-31 14:48:04 [246684.762131]  [<ffffffffa09ef5a4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
2016-10-31 14:48:04 [246684.769325]  [<ffffffffa0bb93d3>] kiblnd_create_conn+0xc83/0x1a70 [ko2iblnd]
2016-10-31 14:48:04 [246684.777290]  [<ffffffffa0bc9b39>] kiblnd_active_connect+0x79/0x540 [ko2iblnd]
2016-10-31 14:48:04 [246684.785343]  [<ffffffff810cb7f5>] ? sched_clock_cpu+0xa5/0xe0
2016-10-31 14:48:04 [246684.791852]  [<ffffffffa0bcb0e0>] kiblnd_cm_callback+0x10e0/0x1260 [ko2iblnd]
2016-10-31 14:48:04 [246684.799911]  [<ffffffffa09f346c>] cma_work_handler+0x6c/0xa0 [rdma_cm]
2016-10-31 14:48:04 [246684.807294]  [<ffffffff810aab2b>] process_one_work+0x18b/0x4d0
2016-10-31 14:48:04 [246684.813897]  [<ffffffff810aba66>] worker_thread+0x126/0x430
2016-10-31 14:48:04 [246684.820209]  [<ffffffff810ab940>] ? rescuer_thread+0x4b0/0x4b0
2016-10-31 14:48:04 [246684.826814]  [<ffffffff810b34cf>] kthread+0xcf/0xe0
2016-10-31 14:48:04 [246684.832353]  [<ffffffff810b3400>] ? kthread_create_on_node+0x140/0x140
2016-10-31 14:48:04 [246684.839735]  [<ffffffff816acfd8>] ret_from_fork+0x58/0x90
2016-10-31 14:48:04 [246684.845854]  [<ffffffff810b3400>] ? kthread_create_on_node+0x140/0x140
2016-10-31 14:48:04 [246684.853234] Code: 49 8d 75 20 ba 08 00 00 00 4c 89 ff e8 5e 67 98 e0 85 c0 0f 84 f9 01 00 00 49 c7 c4 f2 ff ff ff 49 8b 86 10 01 00 00 48 8d 78 28 <f0> 83 68 28 01 0f 94 c2 84 d2 74 05 e8 a3 d6 ff ff 41 8b 96 a0 
2016-10-31 14:48:04 [246684.875054] RIP  [<ffffffffa09ac5cc>] rvt_create_qp+0x3fc/0xa60 [rdmavt]
2016-10-31 14:48:04 [246684.882638]  RSP <ffff88201db17bf0>
2016-10-31 14:48:04 [246684.886623] CR2: 0000000000000028
2016-10-31 14:48:04 [246684.893660] ---[ end trace d73e3a2bbac48f14 ]---

When rvt_create_qp() runs out of queue pairs to allocate, it will attempt to put a reference to qp->ip, but this structure is NULL if the request comes from kernel space.  In our case, it appears that this is being caused by Lustre (ko2iblnd) churning through queue pairs, but this should be triggerable via any in-kernel verbs consumer, including IPoIB on a sufficiently large fabric.

I have posted a patch to linux-rdma, which contains a simple workaround for the issue:

"IB/rdmavt: Only put mmap_info ref if it exists"
http://marc.info/?l=linux-rdma&m=147803367001588&w=2

To verify the issue and fix, I created a reproducer that simply spawns thousands of queue pairs from within the kernel, which I've attached.

Version-Release number of selected component (if applicable):

3.10.0-510.el6

How reproducible:

With reproducer, always.

Steps to Reproduce:
1. Download the reproducer, manyqp.c
2. Modify TEST_ADDR to be an IP address on the IB device to be tested, and MAX_CONNS to be larger than the number of queue pairs configured for the device.
3. Compile as a kernel module and install.

Actual results:

Generates a NULL pointer dereference with the same address and at the same offset into rvt_create_qp() as above.

Expected results:

Doesn't crash with the NULL pointer exception.  Once the QP pool is exhausted, subsequent attempts to create QP's should fail with ENOMEM (-12).

Comment 5 Jonathan Toppins 2017-01-06 18:45:55 UTC
Created attachment 1238084 [details]
enhanced kernel module to accept module params - makes automation easier

Comment 7 Jonathan Toppins 2017-01-19 14:23:17 UTC
Posted:
http://patchwork.usersys.redhat.com/patch/162825/

Comment 8 Rafael Aquini 2017-01-26 14:56:40 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 11 Rafael Aquini 2017-01-27 14:05:39 UTC
Patch(es) available on kernel-3.10.0-549.el7

Comment 15 errata-xmlrpc 2017-08-02 04:25:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842