Bug 1428973

Summary: PANIC: "kernel BUG at fs/ceph/addr.c:91!"
Product: Red Hat Enterprise Linux 7 Reporter: Vikhyat Umrao <vumrao>
Component: kernelAssignee: Jeff Layton <jlayton>
kernel sub component: CephFS QA Contact: Vasu Kulkarni <vakulkar>
Status: CLOSED ERRATA Docs Contact: John Wilkins <jowilkin>
Severity: high    
Priority: high CC: cww, idryomov, jlayton, kdreyer, tmuthami, vumrao, zlang, zyan
Version: 7.3   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-647.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-02 05:47:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298243    
Attachments:
Description Flags
ceph: take cap references in write_begin, release them in write_end
none
patch: get cap references before splicing
none
Updated patch
none
Follow-on patch to fix up releasepage warnings
none
patch: get caps when handling splice none

Comment 4 Jeff Layton 2017-03-03 18:17:36 UTC
Still looking over this code. We obviously crashed because i_wr_ref is zero here. I don't see where we go acquiring cap refs in this call chain however. I'll need to go over the splice code in rhel7 to see if I can discern where those references should be taken in this path.

Comment 6 Jeff Layton 2017-03-06 13:05:06 UTC
i_wr_ref is a counter that just shows how many CEPH_CAP_FILE_WR references are currently outstanding. Here, we're doing a splice to a file opened for buffered writes, but that's more or less the same thing as write_begin/write_end.

It also isn't taken in write_end at all, AFAICT but that's where the page gets dirtied. What I'm not sure of is why we aren't crashing all over the place?

Zheng, I think you added this BUG_ON a couple of years ago in 5dda377cf0a6b. Should we be taking cap refs in write_end? Or, does write_end->set_page_dirty->ceph_set_page_dirty depend on us holding caps elsewhere, and we somehow lost them here?

It sort of seems like the right fix would be to take Fw cap refs in write_begin, and release them in write_end.

Thoughts?

Comment 10 Vikhyat Umrao 2017-03-06 20:58:15 UTC
Description of problem:
PANIC: "kernel BUG at fs/ceph/addr.c:91!"

Version-Release number of selected component (if applicable):

crash> sys
      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-514.6.1.el7.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/227843688/crash/vmcore  [PARTIAL DUMP]
        CPUS: 96
        DATE: Wed Mar  1 05:16:22 2017
      UPTIME: 00:30:58
LOAD AVERAGE: 0.08, 0.26, 0.24
       TASKS: 960
    NODENAME: cephfsnode
     RELEASE: 3.10.0-514.6.1.el7.x86_64
     VERSION: #1 SMP Sat Dec 10 11:15:38 EST 2016
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 512 GB
       PANIC: "kernel BUG at fs/ceph/addr.c:91!"

crash> bt
PID: 7872   TASK: ffff881fbe3c5e20  CPU: 9   COMMAND: "pool"
 #0 [ffff881f7f23f8d0] machine_kexec at ffffffff81059c7b
 #1 [ffff881f7f23f930] __crash_kexec at ffffffff811052d2
 #2 [ffff881f7f23fa00] crash_kexec at ffffffff811053c0
 #3 [ffff881f7f23fa18] oops_end at ffffffff8168f088
 #4 [ffff881f7f23fa40] die at ffffffff8102e93b
 #5 [ffff881f7f23fa70] do_trap at ffffffff8168e740
 #6 [ffff881f7f23fac0] do_invalid_op at ffffffff8102b144
 #7 [ffff881f7f23fb70] invalid_op at ffffffff8169805e
    [exception RIP: ceph_set_page_dirty+448]
    RIP: ffffffffa1391b80  RSP: ffff881f7f23fc20  RFLAGS: 00010246
    RAX: 0000000000140014  RBX: ffffea00f86a76c0  RCX: 0000000000000011
    RDX: 0000000000000000  RSI: 0000000000000011  RDI: ffff881f94f7ef50
    RBP: ffff881f7f23fc80   R8: 0000000000000011   R9: ffffea00f86a76c0
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff881f94f7f280
    R13: ffff881f94f7f3d0  R14: ffff881f94f7ef50  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff881f7f23fc48] ceph_update_writeable_page at ffffffffa139362e [ceph]
 #9 [ffff881f7f23fce8] pagecache_write_end at ffffffff81180386
#10 [ffff881f7f23fd38] pipe_to_file at ffffffff8122d4d6
#11 [ffff881f7f23fdb0] splice_from_pipe_feed at ffffffff8122cd1e
#12 [ffff881f7f23fde8] generic_file_splice_write_actor at ffffffff8122d70f
#13 [ffff881f7f23fe20] splice_write_to_file at ffffffff8122d589
#14 [ffff881f7f23fea8] generic_file_splice_write at ffffffff8122ee14
#15 [ffff881f7f23fec0] do_splice_from at ffffffff8122dafd
#16 [ffff881f7f23ff00] sys_splice at ffffffff8122f3da
#17 [ffff881f7f23ff80] system_call_fastpath at ffffffff816967c9
    RIP: 00007fe5e01ec043  RSP: 00007fe5d11a07e0  RFLAGS: 00010202
    RAX: 0000000000000113  RBX: ffffffff816967c9  RCX: 000000000248cc08
    RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000011
    RBP: 00007fe5d11a0898   R8: 0000000000000011   R9: 0000000000000004
    R10: 00007fe5d11a0898  R11: 0000000000000293  R12: 00007fe5d11a08a8
    R13: 0000000000000011  R14: 0000000000000000  R15: 0000000000000010
    ORIG_RAX: 0000000000000113  CS: 0033  SS: 002b


crash> dis -lr ffffffffa1391b80 | tail
0xffffffffa1391b59 <ceph_set_page_dirty+409>:	mov    0x10(%rdi),%r8
0xffffffffa1391b5d <ceph_set_page_dirty+413>:	mov    0x0(%r13),%rdx
0xffffffffa1391b61 <ceph_set_page_dirty+417>:	mov    %rdi,%rcx
0xffffffffa1391b64 <ceph_set_page_dirty+420>:	mov    $0xffffffffa13b4308,%rsi
0xffffffffa1391b6b <ceph_set_page_dirty+427>:	mov    $0xffffffffa13c2df0,%rdi
0xffffffffa1391b72 <ceph_set_page_dirty+434>:	xor    %eax,%eax
0xffffffffa1391b74 <ceph_set_page_dirty+436>:	callq  0xffffffff8133c1b0 <__dynamic_pr_debug>
0xffffffffa1391b79 <ceph_set_page_dirty+441>:	jmpq   0xffffffffa13919fa
0xffffffffa1391b7e <ceph_set_page_dirty+446>:	ud2    
0xffffffffa1391b80 <ceph_set_page_dirty+448>:	ud2 

Information about ceph module:

crash> mod | grep -e NAME -e ceph
     MODULE       NAME                     SIZE  OBJECT FILE
ffffffffa12adc40  libceph                287066  (not loaded)  [CONFIG_KALLSYMS]
ffffffffa13c6320  ceph                   327687  (not loaded)  [CONFIG_KALLSYMS]

crash> px ((struct module *)0xffffffffa13c6320)->name
$1 = "ceph\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

crash> px ((struct module *)0xffffffffa13c6320)->version
$2 = 0x0

crash> px ((struct module *)0xffffffffa13c6320)->srcversion
$3 = 0xffff885fbdab39c0 "35DD4ACCD12B2332E7CF233"

Comment 11 Yan, Zheng 2017-03-07 01:43:34 UTC
we can't guarantee the atomicity of write syscall if we take Fw caps in write_begin. I think the simplest fix is use default_file_splice_{read,write} instead of generic_file_splice_{read,write}

Comment 12 Jeff Layton 2017-03-07 12:12:20 UTC
Created attachment 1260772 [details]
ceph: take cap references in write_begin, release them in write_end

This is the patch that I was considering. It's not well tested yet, so not quite ready to send. Zheng, I don't quite understand how this would break write atomicity. Either we do the whole write or none at all here. Can you explain?

Comment 13 Yan, Zheng 2017-03-07 12:45:59 UTC
For example: We write 8k data (two pages). The first write_begin() succeeds in getting Fw caps, the second write_begin() waits (mds revokes Fw caps). If other client reads data at this time, it sees partial updated data.

Comment 14 Jeff Layton 2017-03-07 14:22:50 UTC
Ok, I think I get it now.

What we really need to do here is take the caps in a new ceph_splice_write operation, that can just be a wrapper around generic_file_splice_write. I suspect that this is not a problem in mainline kernels since we switched to write_iter there and that unified the two codepaths.

Comment 15 Yan, Zheng 2017-03-07 14:42:59 UTC
Yes. Implement our ceph_splice_write(), which tries getting Fwb caps. If it gets Fwb caps, call generic_file_splice_write(). If it only get Fw caps, call default_file_splice_wirte(). Besides, we also need to implement our splice_read callback.

Comment 16 Jeff Layton 2017-03-08 16:05:23 UTC
Created attachment 1261341 [details]
patch: get cap references before splicing

Ok, this patch should do the right thing. It builds but I don't have a working test rig at the moment, so testing will have to wait a bit. If anyone else can test it sooner, then that would be great too.

Note that this is a RHEL7 one-off patch, as I don't think this is necessary with the current upstream code.

Comment 17 Yan, Zheng 2017-03-09 02:33:30 UTC
Comment on attachment 1261341 [details]
patch: get cap references before splicing

If ceph_splice_write() fails to get CEPH_CAP_FILE_BUFFER and CEPH_CAP_FILE_LAZYIO. It should release the CEPH_CAP_FILE_WR and call default_file_splice_write(). ceph_splice_read() requires similar change too.

Comment 18 Jeff Layton 2017-03-09 11:40:26 UTC
Why? We're splicing from a pipe buffer into a ceph file here. BUFFER would be nice to have but I don't see why it's required. The file we're splicing into could even be open with O_DIRECT in which case we have no real need for BUFFER anyway...

Comment 20 Jeff Layton 2017-03-10 18:26:15 UTC
Maybe I just don't understand quite why we need special handling for the LAZYIO case. Zheng, can you explain?

Comment 21 Yan, Zheng 2017-03-13 02:04:35 UTC
If a program opens a file with LAZYIO, it tells us it does not need cache coherent. But that code is not properly tested, may not work as expected.

Comment 25 Jeff Layton 2017-03-13 21:43:45 UTC
It turns out that I can reproduce this at will with xfstest generic/095. I have a revised patch that seems to fix that bug, but now I'm seeing the WARN_ONs in ceph_releasepage pop. I'll have a look at those soon.

Comment 27 Jeff Layton 2017-03-13 21:54:15 UTC
Created attachment 1262656 [details]
Updated patch

Revised patch, closer along the lines of what Zheng was suggesting. This seems to fix the BUG_ON for me, and the testcase completes, but a bunch of WARN_ONs pop:

[  263.971683] WARNING: at fs/ceph/addr.c:175 ceph_releasepage+0x60/0x80 [ceph]()
[  263.972819] Modules linked in: ceph(OE) libceph dns_resolver ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc i2c_piix4 pcspkr virtio_balloon parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_console virtio_net virtio_blk crct10dif_pclmul crct10dif_common ata_piix qxl crc32c_intel serio_raw drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm floppy drm libata virtio_pci virtio_ring virtio i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  263.987974] CPU: 1 PID: 4473 Comm: fio Tainted: G        W  OE  ------------   3.10.0-514.10.2.el7.x86_64 #1
[  263.989345] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
[  263.990592]  0000000000000000 00000000d1c64ac1 ffff880050e13990 ffffffff816864ef
[  263.991773]  ffff880050e139c8 ffffffff81085940 ffffea0001424840 ffff880068e09ad0
[  263.992946]  0000000000000074 ffffea0001424840 0000000000000000 ffff880050e139d8
[  263.994122] Call Trace:
[  263.994649]  [<ffffffff816864ef>] dump_stack+0x19/0x1b
[  263.995504]  [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0
[  263.996422]  [<ffffffff81085a8a>] warn_slowpath_null+0x1a/0x20
[  263.997320]  [<ffffffffa058cf40>] ceph_releasepage+0x60/0x80 [ceph]
[  263.998302]  [<ffffffff81180662>] try_to_release_page+0x32/0x50
[  263.999204]  [<ffffffff81190769>] invalidate_inode_pages2_range+0x2f9/0x430
[  264.000196]  [<ffffffff8117fed9>] ? filemap_fdatawait_range+0x159/0x1b0
[  264.001154]  [<ffffffff8118d44e>] ? do_writepages+0x1e/0x40
[  264.001999]  [<ffffffff81182035>] ? __filemap_fdatawrite_range+0x65/0x80
[  264.002964]  [<ffffffffa05899fd>] ceph_direct_read_write+0xa2d/0xbc0 [ceph]
[  264.003963]  [<ffffffffa05946db>] ? try_get_cap_refs+0x17b/0x5a0 [ceph]
[  264.004917]  [<ffffffffa0590b36>] ? ceph_pool_perm_check+0x76/0x6b0 [ceph]
[  264.005904]  [<ffffffffa0598697>] ? ceph_get_caps+0x107/0x320 [ceph]
[  264.006832]  [<ffffffff8108d912>] ? current_fs_time+0x12/0x60
[  264.007681]  [<ffffffff81219d8f>] ? file_update_time+0x5f/0xf0
[  264.008551]  [<ffffffffa0589dcc>] ceph_aio_write+0x23c/0xc30 [ceph]
[  264.011785]  [<ffffffff810cf178>] ? check_preempt_wakeup+0x148/0x250
[  264.012836]  [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0
[  264.013712]  [<ffffffff811fdabd>] do_sync_write+0x8d/0xd0
[  264.014525]  [<ffffffff811fe32d>] vfs_write+0xbd/0x1e0
[  264.015309]  [<ffffffff811ff002>] SyS_pwrite64+0x92/0xc0
[  264.016360]  [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
[  264.020142] ---[ end trace b3ea1ef63fa583af ]---

I think we'll want to fix that as well. It may be an unrelated bug but I'll need to look at it more closely to know for sure.

They're welcome to test the patch as well, but it may not be in its final version yet.

Comment 28 Jeff Layton 2017-03-14 11:54:22 UTC
Created attachment 1262881 [details]
Follow-on patch to fix up releasepage warnings

I think that we'll also want something like this patch, which fixes the warnings that pop up in releasepage. This one is also a rhel7 custom patch. We could also fix this by backporting the whole series of fscache integration patches (and their related fixes), but that's a much bigger job and I'm not sure we want to go there in rhel7.

Comment 29 Jeff Layton 2017-03-14 11:59:46 UTC
Zheng, do these patches look reasonable? Do you know of any plans to backport the fscache integration patches to rhel7?

Comment 30 Jeff Layton 2017-03-14 12:16:05 UTC
What I don't quite get is why LAZYIO requires a cap.

It's a flag set on the client's inode that allows the client to change its caching rules. If this is true:

     fi->fmode & CEPH_FILE_MODE_LAZY

Is should just mean that we can have "lazy" cache coherency -- we can basically have Fbc behavior even when we can't get those caps.

So, I think the check in my patch may be wrong. If we LAZY is set on the file, then we should be able to use the pagecache if either BUFFER/CACHE or LAZYIO is set. In practical terms, that means that I think we want this in those patches:

    if (!(got & want))
            /* put caps and call default_file_splice_write */

Right?

Comment 31 Jeff Layton 2017-03-14 13:21:14 UTC
Created attachment 1262944 [details]
patch: get caps when handling splice

This is what I think we probably need. Zheng, let me know what you think.

Comment 32 Yan, Zheng 2017-03-14 13:45:16 UTC
Comment on attachment 1262944 [details]
patch: get caps when handling splice

LGTM

Comment 33 Jeff Layton 2017-03-14 15:05:40 UTC
Thanks Zheng!

I'm doing an xfstests run with these two patches on a rhel7 kernel now. So far it looks good, but we won't know for several hours.

Vikhyat, if the customer here has a good way to reproduce the panic, then testing a kernel with these two patches would probably be a good start. It should fix the problem for them.

Comment 52 Rafael Aquini 2017-04-10 16:51:04 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 54 Rafael Aquini 2017-04-11 15:08:04 UTC
Patch(es) available on kernel-3.10.0-647.el7

Comment 56 Zorro Lang 2017-07-15 12:18:43 UTC
Please verify this bug ASAP, it blocks 7.4 kernel ERRATA.

Thanks,
Zorro

Comment 57 Yan, Zheng 2017-07-15 13:13:26 UTC
I think we can close the bug

Comment 58 Zorro Lang 2017-07-15 13:29:42 UTC
(In reply to Yan, Zheng from comment #57)
> I think we can close the bug

Sorry I don't learn about this bug so much. But anyway, ON_QA is not a proper status for all 7.4 kernel bugs now, so would you/someone else please deal with this bug properly?

Others ON_QA ceph bugs are Bug 1378186, Bug 1407016, Bug 1408170, Bug 1427556 and Bug 1429008. If you can help to deal with these bugs, that would be really helpful! They all block 7.4 kernel ERRATA, and 7.4 is going to GA now.

Comment 59 Yan, Zheng 2017-07-15 14:48:12 UTC
I'm developer, not QA. I'm not sure what am I supposed to do with this bug.

Comment 60 Zorro Lang 2017-07-15 15:39:36 UTC
(In reply to Yan, Zheng from comment #59)
> I'm developer, not QA. I'm not sure what am I supposed to do with this bug.

Oh, sorry, I didn't notice that:) OK, I have to keep waiting ceph-qe's response. Or I have to set SanityOnly for all these bugs by myself.

Comment 61 Vasu Kulkarni 2017-07-18 20:58:11 UTC
I picked up the new 7.4 RC 1.2 candidate for verification, I dont see any issues in sanity run, Closing this as sanity only.

Comment 63 errata-xmlrpc 2017-08-02 05:47:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842