| Summary: | warning and hang in ib DMA code when doing NFSoRDMA to server | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jeff Layton <jlayton> | ||||
| Component: | kernel | Assignee: | Sachin Prabhu <sprabhu> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 7.0 | CC: | chuck.lever, dledford, honli, infiniband-qe, kheib, nfs-maint, shirleym.pdx, sprabhu, steved, swhiteho, zguo | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-11-28 08:33:02 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1171868, 1255429, 1289204 | ||||||
| Attachments: |
|
||||||
|
Description
Jeff Layton
2013-12-09 11:39:10 UTC
FWIW, same exact problem on -60.el6 as well... I'll grab this for now, but it looks like this is happening down in the bowels of the page mgmt code... cc'ing Chuck since he's been looking at the NFSoRDMA code too... Nothing the client does should cause a server crash (not a predictive statement, but rather a statement of design goal). So IMO this is a server bug. In fact, I think this is probably closer to the silicon than it is to NFS. What HBA are you using? Can you also add info about IB software stack version? If you can reproduce this on upstream kernels and there is a known working configuration, try to bisect the server's kernel. Bear with me. I'm not at all familiar with this code yet, and it's quite complex... Agreed, this is definitely a server bug. I just posted the details about the reproducer since it is reproducible. Note that I'm doing <PAGE_SIZE writes here. I need to go back and try to do page-sized DIOs from the client and see if that works around the bug. If it does, then that may tell us something about it. It does seem like something in the bowels of the rdma code, but it's also possible that the RPC layer is passing down a bogus pagelist or something. I think both the client and server are using Mellanox cards in this case. On the software stack, I'm not sure. This is RHEL7 so it's fairly close to current Fedora. If you know the packages you want versions of, let me know. Doug may be able to tell us more. If I do the same dd command and use bs=4096, then there is no oops on the server. That's probably significant... ...and yeah, confirmed. This IB network is:
FDR fabric using Mellanox 36 port 56GBit/s SwitchX model SX6036
These hosts are multiply connected with different fabrics, so maybe I'll give a go with a different fabric/card driver when I get a few mins.
One thing that looks suspect here:
static void frmr_unmap_dma(struct svcxprt_rdma *xprt,
struct svc_rdma_fastreg_mr *frmr)
{
int page_no;
for (page_no = 0; page_no < frmr->page_list_len; page_no++) {
dma_addr_t addr = frmr->page_list->page_list[page_no];
if (ib_dma_mapping_error(frmr->mr->device, addr))
continue;
atomic_dec(&xprt->sc_dma_used);
ib_dma_unmap_page(frmr->mr->device, addr, PAGE_SIZE,
frmr->direction);
}
}
...the hardcoded PAGE_SIZE in there looks suspicious. I wonder if we always know that this is a page length or not?
So dma_ops_domain_unmap has this:
pte += PM_LEVEL_INDEX(0, address);
WARN_ON(!*pte);
*pte = 0ULL;
...so maybe this means that there was a double unmapping, or that the pte ID'ed in that function was wrong...
So if I add a "count=1" to the dd command then it all works fine. Same up to about count=3, but once I hit count=4 it fails. My suspicion at this point is that the bug is due to something like this: We get a 512b READ request on the server. Server then maps the corresponding page on the client and starts writing to it. During this we get another 512b request on the client and the server does the same thing, but it's mapping the same page. Eventually the first READ request is finished and the server goes to remove the mapping but the other requests aren't done yet and the DMA mapping is ripped out from under them. AFAICT, svc_rdma_get_frmr/svc_rdma_put_frmr don't do any refcounting so there doesn't seem to be any way to know whether more than one request is referencing it. Adding that refcounting doesn't look *too* hard, fortunately... I can reproduce the same problem on f20 as well. That kernel however gives some interesting warnings before the WARN pops. Pretty much the same problem though: [ 4545.538763] ------------[ cut here ]------------ [ 4545.538810] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x00000000111df000 flags=0x0020] [ 4545.538816] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x00000000111df0e0 flags=0x0020] [ 4545.538820] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x00000000111df040 flags=0x0020] [ 4545.538824] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x00000000111df0c0 flags=0x0020] [ 4545.538828] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x00000000111df080 flags=0x0020] [ 4545.597252] WARNING: CPU: 0 PID: 0 at drivers/iommu/amd_iommu.c:2694 dma_ops_domain_unmap.part.11+0x62/0x70() [ 4545.607307] Modules linked in: svcrdma cfg80211 rfkill ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs kvm_amd ipmi_devintf mlx5_ib kvm ib_umad bnx2 ipmi_si mlx5_core amd64_edac_mod dcdbas microcode rdma_cm ipmi_msghandler edac_core sp5100_tco shpchp acpi_power_meter serio_raw i2c_piix4 k10temp ib_cm edac_mce_amd acpi_cpufreq iw_cm ib_addr ib_sa ib_mad ib_core nfsd auth_rpcgss nfs_acl lockd sunrpc mgag200 i2c_algo_bit drm_kms_helper ata_generic ttm pata_acpi drm i2c_core pata_atiixp [ 4545.656796] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.6-300.fc20.x86_64 #1 [ 4545.664210] Hardware name: Dell Inc. PowerEdge R415/0GXH08, BIOS 2.0.2 10/22/2012 [ 4545.671799] 0000000000000009 ffff88081ee03d18 ffffffff816630c1 0000000000000000 [ 4545.679358] ffff88081ee03d50 ffffffff810691dd ffff880812612ef8 0000000000000000 [ 4545.686916] 0000000000001000 00000000111df000 ffff880810bc4c00 ffff88081ee03d60 [ 4545.694474] Call Trace: [ 4545.696955] <IRQ> [<ffffffff816630c1>] dump_stack+0x45/0x56 [ 4545.702824] [<ffffffff810691dd>] warn_slowpath_common+0x7d/0xa0 [ 4545.708922] [<ffffffff810692ba>] warn_slowpath_null+0x1a/0x20 [ 4545.714843] [<ffffffff81538922>] dma_ops_domain_unmap.part.11+0x62/0x70 [ 4545.721645] [<ffffffff81539e03>] __unmap_single.isra.19+0xa3/0x110 [ 4545.728005] [<ffffffff8153a591>] unmap_page+0x51/0x70 [ 4545.733225] [<ffffffffa0359f3b>] svc_rdma_unmap_dma+0xdb/0xf0 [svcrdma] [ 4545.740026] [<ffffffffa035b17e>] dto_tasklet_func+0xee/0x1c0 [svcrdma] [ 4545.746741] [<ffffffff81010000>] ? xen_smp_intr_free+0xf0/0x1d0 [ 4545.752836] [<ffffffff8106e26e>] tasklet_action+0x6e/0x110 [ 4545.758489] [<ffffffff8106e747>] __do_softirq+0xf7/0x240 [ 4545.763969] [<ffffffff816739dc>] call_softirq+0x1c/0x30 [ 4545.769362] [<ffffffff810146a5>] do_softirq+0x55/0x90 [ 4545.774575] [<ffffffff8106ea25>] irq_exit+0xb5/0xc0 [ 4545.779615] [<ffffffff816742d6>] do_IRQ+0x56/0xc0 [ 4545.784478] [<ffffffff81669fad>] common_interrupt+0x6d/0x6d [ 4545.790219] <EOI> [<ffffffff8104e486>] ? native_safe_halt+0x6/0x10 [ 4545.796698] [<ffffffff8101b0bf>] default_idle+0x1f/0xc0 [ 4545.802088] [<ffffffff8101b996>] arch_cpu_idle+0x26/0x30 [ 4545.811583] [<ffffffff810b9845>] cpu_startup_entry+0xe5/0x280 [ 4545.821498] [<ffffffff816593f7>] rest_init+0x77/0x80 [ 4545.830620] [<ffffffff81d1cef7>] start_kernel+0x40a/0x415 [ 4545.840161] [<ffffffff81d1c8f6>] ? repair_env_string+0x5c/0x5c [ 4545.850048] [<ffffffff81d1c120>] ? early_idt_handlers+0x120/0x120 [ 4545.860116] [<ffffffff81d1c5de>] x86_64_start_reservations+0x2a/0x2c [ 4545.870421] [<ffffffff81d1c6e8>] x86_64_start_kernel+0x108/0x117 [ 4545.880309] ---[ end trace 24fce75e1a28ae2f ]--- ...and I can't reproduce this against RHEL6, so it seems likely that something crept in after 2.6.32 to cause this regression. I'll try bisecting to track it down. Still unconclusive. I can reproduce the problem only with certain IB drivers, apparently. I checked out a 3rd machine and am unable to reproduce the hang on it. At this point, I think I need help from someone who understands IB better than I do... The one where I can reproduce the bug has this card:
01:00.0 InfiniBand: Mellanox Technologies MT27600 [Connect-IB]
...the other one has this:
01:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0)
Apparently, RHEL6 doesn't have drivers for the Connect-IB card as it's not detected when I do a RHEL6 install on it.
I suspect the reason I can't reproduce it on those other cards is that they may not support FRMR?
Ok, finally started doing some debugging. The difference is that the two hosts are using completely different codepaths. The problematic one is using the "fastreg" codepath, whereas the other is not. Now to examine the differences and see if I can determine why it's problematic. Not directly related to this bug, but the client also doesn't seem to cope well with the server crashing and coming back: RPC: xprt_rdma_close: closing RPC: disconnected transport ffff880821c25000 RPC: rpcrdma_conn_upcall: disconnected: 172.31.0.204:20049 (ep 0xffff880821c255e0 event 0xa) RPC: rpcrdma_conn_upcall: disconnected rpcrdma: connection to 172.31.0.204:20049 closed (-103) RPC: rpcrdma_ep_disconnect: after wait, disconnected RPC: wake_up_first(ffff880821c25190 "xprt_sending") RPC: rpcrdma_conn_upcall: unexpected CM event 15 ...that event 15 in the last line is (I think) RDMA_CM_EVENT_TIMEWAIT_EXIT, which doesn't seem to be handled properly by rpcrdma_conn_upcall. So, my theory on this bug is that we map a page in order to RDMA WRITE it to another client and then it is somehow unmapped more than once.
There is some raciness involved here too and turning up rpc_debug tends to throw off the timing. So, I did a Q&D patch that converts dfprintk() macro to use trace_printk() to lower the overhead, and added some debugging around the ib_dma_map_page and ib_dma_unmap_page (basically I just wrapped those functions and added some dprintks to the wrappers).
I then ran the test from a client that had mounted it:
# dd if=/file/on/nfsordma of=/dev/null bs=512 count=4 iflag=direct
...that eventually gave me the typical WARN pop on the host that uses FRMR:
[ 175.144493] ------------[ cut here ]------------
[ 175.144540] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x0000000010a35000 flags=0x0020]
[ 175.144546] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x0000000010a350c0 flags=0x0020]
[ 175.144550] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x0000000010a35080 flags=0x0020]
[ 175.144553] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x0000000010a35040 flags=0x0020]
[ 175.144557] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x0000000010a350e0 flags=0x0020]
[ 175.202993] WARNING: CPU: 0 PID: 0 at drivers/iommu/amd_iommu.c:2694 dma_ops_domain_unmap.part.11+0x62/0x70()
[ 175.213048] Modules linked in: svcrdma(OF) cfg80211 rfkill ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm mlx5_ib ib_ucm ib_uverbs ib_umad mlx5_core kvm_amd amd64_edac_mod rdma_cm edac_core kvm sp5100_tco ipmi_devintf dcdbas bnx2 microcode serio_raw k10temp edac_mce_amd ib_cm i2c_piix4 shpchp iw_cm ib_addr ib_sa ipmi_si ipmi_msghandler ib_mad acpi_power_meter ib_core acpi_cpufreq nfsd(OF) auth_rpcgss(OF) nfs_acl lockd(OF) sunrpc(OF) mgag200 i2c_algo_bit drm_kms_helper ata_generic ttm pata_acpi drm i2c_core pata_atiixp
[ 175.264433] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GF O 3.12.7-300.fc20.x86_64 #1
[ 175.272817] Hardware name: Dell Inc. PowerEdge R415/0GXH08, BIOS 2.0.2 10/22/2012
[ 175.280410] 0000000000000009 ffff88081ee03ce0 ffffffff8166799b 0000000000000000
[ 175.287981] ffff88081ee03d18 ffffffff8106921d ffff8807f58a41a8 0000000000000000
[ 175.295550] 0000000000001000 0000000010a35000 ffff880813687800 ffff88081ee03d28
[ 175.303120] Call Trace:
[ 175.305606] <IRQ> [<ffffffff8166799b>] dump_stack+0x45/0x56
[ 175.311483] [<ffffffff8106921d>] warn_slowpath_common+0x7d/0xa0
[ 175.317582] [<ffffffff810692fa>] warn_slowpath_null+0x1a/0x20
[ 175.323505] [<ffffffff8153dcc2>] dma_ops_domain_unmap.part.11+0x62/0x70
[ 175.330307] [<ffffffff8153f1a3>] __unmap_single.isra.19+0xa3/0x110
[ 175.336670] [<ffffffff8153f931>] unmap_page+0x51/0x70
[ 175.341893] [<ffffffffa0437049>] svc_rdma_unmap_page+0x89/0xe0 [svcrdma]
[ 175.348784] [<ffffffffa043712c>] svc_rdma_unmap_dma+0x8c/0xb0 [svcrdma]
[ 175.355588] [<ffffffffa04382fe>] dto_tasklet_func+0xee/0x1c0 [svcrdma]
[ 175.362305] [<ffffffff81010000>] ? xen_smp_intr_free+0xf0/0x1d0
[ 175.368403] [<ffffffff8106e2ae>] tasklet_action+0x6e/0x110
[ 175.374059] [<ffffffff8106e787>] __do_softirq+0xf7/0x240
[ 175.379542] [<ffffffff8167821c>] call_softirq+0x1c/0x30
[ 175.384935] [<ffffffff810146a5>] do_softirq+0x55/0x90
[ 175.390153] [<ffffffff8106ea65>] irq_exit+0xb5/0xc0
[ 175.395194] [<ffffffff81678b16>] do_IRQ+0x56/0xc0
[ 175.400060] [<ffffffff8166e7ed>] common_interrupt+0x6d/0x6d
[ 175.405800] <EOI> [<ffffffff8104e486>] ? native_safe_halt+0x6/0x10
[ 175.412287] [<ffffffff8101b0bf>] default_idle+0x1f/0xc0
[ 175.421714] [<ffffffff8101b996>] arch_cpu_idle+0x26/0x30
[ 175.431200] [<ffffffff810b9905>] cpu_startup_entry+0xe5/0x280
[ 175.441111] [<ffffffff8165dcf7>] rest_init+0x77/0x80
[ 175.450215] [<ffffffff81d1def7>] start_kernel+0x40a/0x415
[ 175.459661] [<ffffffff81d1d8f6>] ? repair_env_string+0x5c/0x5c
[ 175.469468] [<ffffffff81d1d120>] ? early_idt_handlers+0x120/0x120
[ 175.479496] [<ffffffff81d1d5de>] x86_64_start_reservations+0x2a/0x2c
[ 175.489745] [<ffffffff81d1d6e8>] x86_64_start_kernel+0x108/0x117
[ 175.499582] ---[ end trace c61a3ae8cf3c7e75 ]---
...and then 1s later, an oops in svc_rdma_xdr_encode_error():
[ 175.508036] general protection fault: 0000 [#1] SMP
[ 175.865830] Modules linked in: svcrdma(OF) cfg80211 rfkill ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm mlx5_ib ib_ucm ib_uverbs ib_umad mlx5_core kvm_amd amd64_edac_mod rdma_cm edac_core kvm sp5100_tco ipmi_devintf dcdbas bnx2 microcode serio_raw k10temp edac_mce_amd ib_cm i2c_piix4 shpchp iw_cm ib_addr ib_sa ipmi_si ipmi_msghandler ib_mad acpi_power_meter ib_core acpi_cpufreq nfsd(OF) auth_rpcgss(OF) nfs_acl lockd(OF) sunrpc(OF) mgag200 i2c_algo_bit drm_kms_helper ata_generic ttm pata_acpi drm i2c_core pata_atiixp
[ 176.496096] CPU: 2 PID: 1246 Comm: nfsd Tainted: GF W O 3.12.7-300.fc20.x86_64 #1
[ 176.630502] Hardware name: Dell Inc. PowerEdge R415/0GXH08, BIOS 2.0.2 10/22/2012
[ 176.762797] task: ffff8807f5cd0840 ti: ffff8807f5dee000 task.ti: ffff8807f5dee000
[ 176.893238] RIP: 0010:[<ffffffffa0438c75>] [<ffffffffa0438c75>] svc_rdma_xdr_encode_error+0x5/0x50 [svcrdma]
[ 177.026320] RSP: 0018:ffff8807f5defd80 EFLAGS: 00010286
[ 177.151415] RAX: ffff880000000000 RBX: dead000000200200 RCX: ffff8807f5cc9000
[ 177.276665] RDX: 0000000000000001 RSI: dead000000200200 RDI: ffff8807f809e400
[ 177.398001] RBP: ffff8807f5defe08 R08: 0000000000000001 R09: 0000000000000000
[ 177.517886] R10: 0000000000004afe R11: ffff8808134c2488 R12: ffff8807f809e400
[ 177.636805] R13: ffffea001fd73240 R14: 0000000000000001 R15: ffff88080f915800
[ 177.754753] FS: 00007fc84e776700(0000) GS:ffff88081ee40000(0000) knlGS:0000000000000000
[ 177.875485] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 177.993077] CR2: 00007fc840002e20 CR3: 00000007fda10000 CR4: 00000000000007e0
[ 178.112235] Stack:
[ 178.224480] ffffffffa0438715 005ff00000000000 ffffea001fd73240 ffff88081ee579a0
[ 178.345609] 0000000000000000 ffff8807f5defde0 0000000000000297 ffffea001fd73240
[ 178.468151] ffffea002046e7c0 ffff8807f66dc000 00000000000000e4 ffff88080f915800
[ 178.590292] Call Trace:
[ 178.705934] [<ffffffffa0438715>] ? svc_rdma_send_error+0x55/0x1f0 [svcrdma]
[ 178.829480] [<ffffffffa043b41a>] svc_rdma_recvfrom+0x4ca/0x10b0 [svcrdma]
[ 178.952854] [<ffffffffa00c60d0>] svc_recv+0x490/0x5d0 [sunrpc]
[ 179.075332] [<ffffffff81077fdb>] ? recalc_sigpending+0x1b/0x50
[ 179.197764] [<ffffffffa010570d>] nfsd+0xad/0x130 [nfsd]
[ 179.318980] [<ffffffffa0105660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[ 179.441501] [<ffffffff8108b110>] kthread+0xc0/0xd0
[ 179.562652] [<ffffffff8108b050>] ? insert_kthread_work+0x40/0x40
[ 179.686362] [<ffffffff816768bc>] ret_from_fork+0x7c/0xb0
[ 179.809329] [<ffffffff8108b050>] ? insert_kthread_work+0x40/0x40
[ 179.933761] Code: c3 48 8b 35 4e 5a 00 00 48 89 fa 31 c0 48 c7 c7 4b 8c 43 a0 e8 3d 47 ce e0 e9 52 ff ff ff 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <8b> 06 55 83 fa 01 48 89 e5 0f c8 89 01 8b 46 04 0f c8 89 41 04
[ 180.206129] RIP [<ffffffffa0438c75>] svc_rdma_xdr_encode_error+0x5/0x50 [svcrdma]
[ 180.338961] RSP <ffff8807f5defd80>
[ 180.466419] ---[ end trace c61a3ae8cf3c7e76 ]---
...looking at the original warning, we see this value on the stack:
0000000010a35000
...which is the address shown in the first line of the WARN messages. If I then go back through my debug output, I see this:
nfsd-1240 [001] .... 137.007907: svc_rdma_map_page: svc_rdma_map_page: dev=0xffff880808660000 addr=0x10a35000 page=0xffffea002046e7c0 offset=0 size=4096 dir=2
<idle>-0 [000] ..s. 175.144489: svc_rdma_unmap_page: svc_rdma_unmap_page: dev=0xffff880808660000 addr=0x10a35000 size=4096 dir=2
...so the mapping and unmapping seem to have a 1:1 correspondence there, so it doesn't seem to be that the RPC layer is unmapping more often than it should. Does FRMR do some sort of "automatic" unmapping of the pages?
I think I need to fix my debugging to prove this point one way or the other...
Created attachment 864576 [details]
warning messages from v3.13.3 kernel
Updated the box where I'm seeing the panic to 3.13.3-201.fc20.x86_64. I still see the WARN messages, but I didn't see things hang. Note that this output has some extra debug messages. I added some wrappers around ib_dma_map_page and ib_dma_unmap_page called svc_rdma_map_page and svc_rdma_unmap_page, and put some dprintks in there to show the arguments and results.
One thing that looks strange though again I don't really understand this code that well. Here is the mapping phase for some of the pages: [ 149.426913] svc_rdma_map_page: dev=0xffff880811340000 addr=0x11b4b000 page=0xffffea0000dce800 offset=0 size=4096 dir=1 [ 149.426917] svc_rdma_map_page: dev=0xffff880811340000 addr=0x11b4c800 page=0xffffea0002e52a40 offset=2048 size=2048 dir=1 [ 149.426921] svc_rdma_map_page: dev=0xffff880811340000 addr=0x11b4d000 page=0xffffea001fcc6480 offset=0 size=512 dir=1 [ 149.426956] svc_rdma_map_page: dev=0xffff880811340000 addr=0x11b4f000 page=0xffffea0000dc8600 offset=0 size=64 dir=1 ...and here is the unmapping: [ 149.426971] svc_rdma_unmap_page: dev=0xffff880811340000 addr=0x11b4f000 size=64 dir=1 [ 149.426974] svc_rdma_unmap_page: dev=0xffff880811340000 addr=0x11b4b000 size=4096 dir=1 [ 149.426977] svc_rdma_unmap_page: dev=0xffff880811340000 addr=0x11b4c800 size=4096 dir=1 [ 149.426980] svc_rdma_unmap_page: dev=0xffff880811340000 addr=0x11b4d000 size=4096 dir=1 ...the last one triggers the WARN. The thing I notice is that the size seems to be different in the unmapping and the mapping. This doesn't always seem to trigger the issue, but maybe the unmapping request is treading into an area that wasn't originally mapped? This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. The comment above is incorrect. The correct version is bellow. I'm sorry for any inconvenience. --------------------------------------------------------------- This request was NOT resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you need to escalate this bug. Likely a DUP of Bug 1272171. May also be a DUP of Bug 1262728. |