Red Hat Bugzilla – Bug 1391299
[LLNL 7.4 Bug] Crash in Infiniband rdmavt layer when kernel consumer exhausts queue pairs
Last modified: 2017-08-02 00:25:56 EDT
Created attachment 1216823 [details] kernel module reproducer Description of problem: Several of our nodes have experienced crashes similar to the following: 2016-10-31 14:48:04 [246684.429255] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 2016-10-31 14:48:04 [246684.438112] IP: [<ffffffffa09ac5cc>] rvt_create_qp+0x3fc/0xa60 [rdmavt] 2016-10-31 14:48:04 [246684.445605] PGD 1ffc7e4067 PUD 2021a63067 PMD 0 2016-10-31 14:48:04 [246684.450883] Oops: 0002 [#1] SMP 2016-10-31 14:48:04 [246684.454598] Modules linked in: lmv(OE) fld(OE) mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) fid(OE) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 nfsv4 dns_resolver ko2iblnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) nfsv3 nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl iosf_mbi hfi1 kvm irqbypass iTCO_wdt mei_me rdmavt ipmi_devintf iTCO_vendor_support mei sb_edac sg lpc_ich pcspkr shpchp i2c_i801 edac_core ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace ip_tables ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i iw_cxgb3 ib_core cxgb3 mdio libcxgbi qla4xxx iscsi_boot_sysfs crct10dif_pclmul crct10dif_common crc32_pclmul 8021q crc32c_intel mgag200 garp ghash_clmulni_intel stp drm_kms_helper llc syscopyarea sysfillrect mrp dm_multipath sysimgblt aesni_intel fb_sys_fops lrw igb gf128mul glue_helper ttm ablk_helper ahci dca cryptd libahci ptp drm pps_core libata i2c_algo_bit i2c_core mxm_wmi fjes wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 2016-10-31 14:48:04 [246684.581974] CPU: 18 PID: 134712 Comm: kworker/u384:1 Tainted: P OE ------------ 3.10.0-510.0.0.2chaos.ch6.x86_64 #1 2016-10-31 14:48:04 [246684.594977] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 2016-10-31 14:48:04 [246684.607403] Workqueue: rdma_cm cma_work_handler [rdma_cm] 2016-10-31 14:48:04 [246684.613531] task: ffff880e47a05e20 ti: ffff88201db14000 task.ti: ffff88201db14000 2016-10-31 14:48:04 [246684.621979] RIP: 0010:[<ffffffffa09ac5cc>] [<ffffffffa09ac5cc>] rvt_create_qp+0x3fc/0xa60 [rdmavt] 2016-10-31 14:48:04 [246684.632182] RSP: 0018:ffff88201db17bf0 EFLAGS: 00010246 2016-10-31 14:48:04 [246684.638205] RAX: 0000000000000000 RBX: ffff88103a2c0000 RCX: 0000000000000000 2016-10-31 14:48:04 [246684.646263] RDX: 0000000000008dd4 RSI: 0000000000000000 RDI: 0000000000000028 2016-10-31 14:48:04 [246684.654320] RBP: ffff88201db17c80 R08: ffff880178ee8000 R09: ffff88031e961800 2016-10-31 14:48:04 [246684.662378] R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffff4 2016-10-31 14:48:04 [246684.670436] R13: ffff88103a2c099c R14: ffff88031e9c1000 R15: 00000000000006e8 2016-10-31 14:48:04 [246684.678495] FS: 0000000000000000(0000) GS:ffff88203de00000(0000) knlGS:0000000000000000 2016-10-31 14:48:04 [246684.687619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2016-10-31 14:48:04 [246684.694127] CR2: 0000000000000028 CR3: 000000202faa2000 CR4: 00000000003407e0 2016-10-31 14:48:04 [246684.702185] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2016-10-31 14:48:04 [246684.710243] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 2016-10-31 14:48:04 [246684.718300] Stack: 2016-10-31 14:48:04 [246684.720637] 8000000000000163 000080d21db17fd8 ffff88031e34d308 00000000b54cba81 2016-10-31 14:48:04 [246684.729026] ffff88201db17c28 ffffffff811c4732 ffffc90200000102 ffff880178ee8030 2016-10-31 14:48:04 [246684.737415] 0000000000000020 ffffc90203907000 0000000000000000 00000000ffffffff 2016-10-31 14:48:04 [246684.745803] Call Trace: 2016-10-31 14:48:04 [246684.748630] [<ffffffff811c4732>] ? find_vmap_area+0x42/0x70 2016-10-31 14:48:04 [246684.755041] [<ffffffffa06a5a3f>] ib_create_qp+0x3f/0x250 [ib_core] 2016-10-31 14:48:04 [246684.762131] [<ffffffffa09ef5a4>] rdma_create_qp+0x34/0xb0 [rdma_cm] 2016-10-31 14:48:04 [246684.769325] [<ffffffffa0bb93d3>] kiblnd_create_conn+0xc83/0x1a70 [ko2iblnd] 2016-10-31 14:48:04 [246684.777290] [<ffffffffa0bc9b39>] kiblnd_active_connect+0x79/0x540 [ko2iblnd] 2016-10-31 14:48:04 [246684.785343] [<ffffffff810cb7f5>] ? sched_clock_cpu+0xa5/0xe0 2016-10-31 14:48:04 [246684.791852] [<ffffffffa0bcb0e0>] kiblnd_cm_callback+0x10e0/0x1260 [ko2iblnd] 2016-10-31 14:48:04 [246684.799911] [<ffffffffa09f346c>] cma_work_handler+0x6c/0xa0 [rdma_cm] 2016-10-31 14:48:04 [246684.807294] [<ffffffff810aab2b>] process_one_work+0x18b/0x4d0 2016-10-31 14:48:04 [246684.813897] [<ffffffff810aba66>] worker_thread+0x126/0x430 2016-10-31 14:48:04 [246684.820209] [<ffffffff810ab940>] ? rescuer_thread+0x4b0/0x4b0 2016-10-31 14:48:04 [246684.826814] [<ffffffff810b34cf>] kthread+0xcf/0xe0 2016-10-31 14:48:04 [246684.832353] [<ffffffff810b3400>] ? kthread_create_on_node+0x140/0x140 2016-10-31 14:48:04 [246684.839735] [<ffffffff816acfd8>] ret_from_fork+0x58/0x90 2016-10-31 14:48:04 [246684.845854] [<ffffffff810b3400>] ? kthread_create_on_node+0x140/0x140 2016-10-31 14:48:04 [246684.853234] Code: 49 8d 75 20 ba 08 00 00 00 4c 89 ff e8 5e 67 98 e0 85 c0 0f 84 f9 01 00 00 49 c7 c4 f2 ff ff ff 49 8b 86 10 01 00 00 48 8d 78 28 <f0> 83 68 28 01 0f 94 c2 84 d2 74 05 e8 a3 d6 ff ff 41 8b 96 a0 2016-10-31 14:48:04 [246684.875054] RIP [<ffffffffa09ac5cc>] rvt_create_qp+0x3fc/0xa60 [rdmavt] 2016-10-31 14:48:04 [246684.882638] RSP <ffff88201db17bf0> 2016-10-31 14:48:04 [246684.886623] CR2: 0000000000000028 2016-10-31 14:48:04 [246684.893660] ---[ end trace d73e3a2bbac48f14 ]--- When rvt_create_qp() runs out of queue pairs to allocate, it will attempt to put a reference to qp->ip, but this structure is NULL if the request comes from kernel space. In our case, it appears that this is being caused by Lustre (ko2iblnd) churning through queue pairs, but this should be triggerable via any in-kernel verbs consumer, including IPoIB on a sufficiently large fabric. I have posted a patch to linux-rdma, which contains a simple workaround for the issue: "IB/rdmavt: Only put mmap_info ref if it exists" http://marc.info/?l=linux-rdma&m=147803367001588&w=2 To verify the issue and fix, I created a reproducer that simply spawns thousands of queue pairs from within the kernel, which I've attached. Version-Release number of selected component (if applicable): 3.10.0-510.el6 How reproducible: With reproducer, always. Steps to Reproduce: 1. Download the reproducer, manyqp.c 2. Modify TEST_ADDR to be an IP address on the IB device to be tested, and MAX_CONNS to be larger than the number of queue pairs configured for the device. 3. Compile as a kernel module and install. Actual results: Generates a NULL pointer dereference with the same address and at the same offset into rvt_create_qp() as above. Expected results: Doesn't crash with the NULL pointer exception. Once the QP pool is exhausted, subsequent attempts to create QP's should fail with ENOMEM (-12).
Created attachment 1238084 [details] enhanced kernel module to accept module params - makes automation easier
Posted: http://patchwork.usersys.redhat.com/patch/162825/
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Patch(es) available on kernel-3.10.0-549.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842