Bug 1428973
Summary: | PANIC: "kernel BUG at fs/ceph/addr.c:91!" | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Vikhyat Umrao <vumrao> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
kernel sub component: | CephFS | QA Contact: | Vasu Kulkarni <vakulkar> |
Status: | CLOSED ERRATA | Docs Contact: | John Wilkins <jowilkin> |
Severity: | high | ||
Priority: | high | CC: | cww, idryomov, jlayton, kdreyer, tmuthami, vumrao, zlang, zyan |
Version: | 7.3 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-3.10.0-647.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-02 05:47:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1298243 | ||
Attachments: |
Comment 4
Jeff Layton
2017-03-03 18:17:36 UTC
i_wr_ref is a counter that just shows how many CEPH_CAP_FILE_WR references are currently outstanding. Here, we're doing a splice to a file opened for buffered writes, but that's more or less the same thing as write_begin/write_end. It also isn't taken in write_end at all, AFAICT but that's where the page gets dirtied. What I'm not sure of is why we aren't crashing all over the place? Zheng, I think you added this BUG_ON a couple of years ago in 5dda377cf0a6b. Should we be taking cap refs in write_end? Or, does write_end->set_page_dirty->ceph_set_page_dirty depend on us holding caps elsewhere, and we somehow lost them here? It sort of seems like the right fix would be to take Fw cap refs in write_begin, and release them in write_end. Thoughts? Description of problem: PANIC: "kernel BUG at fs/ceph/addr.c:91!" Version-Release number of selected component (if applicable): crash> sys KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-514.6.1.el7.x86_64/vmlinux DUMPFILE: /cores/retrace/tasks/227843688/crash/vmcore [PARTIAL DUMP] CPUS: 96 DATE: Wed Mar 1 05:16:22 2017 UPTIME: 00:30:58 LOAD AVERAGE: 0.08, 0.26, 0.24 TASKS: 960 NODENAME: cephfsnode RELEASE: 3.10.0-514.6.1.el7.x86_64 VERSION: #1 SMP Sat Dec 10 11:15:38 EST 2016 MACHINE: x86_64 (2400 Mhz) MEMORY: 512 GB PANIC: "kernel BUG at fs/ceph/addr.c:91!" crash> bt PID: 7872 TASK: ffff881fbe3c5e20 CPU: 9 COMMAND: "pool" #0 [ffff881f7f23f8d0] machine_kexec at ffffffff81059c7b #1 [ffff881f7f23f930] __crash_kexec at ffffffff811052d2 #2 [ffff881f7f23fa00] crash_kexec at ffffffff811053c0 #3 [ffff881f7f23fa18] oops_end at ffffffff8168f088 #4 [ffff881f7f23fa40] die at ffffffff8102e93b #5 [ffff881f7f23fa70] do_trap at ffffffff8168e740 #6 [ffff881f7f23fac0] do_invalid_op at ffffffff8102b144 #7 [ffff881f7f23fb70] invalid_op at ffffffff8169805e [exception RIP: ceph_set_page_dirty+448] RIP: ffffffffa1391b80 RSP: ffff881f7f23fc20 RFLAGS: 00010246 RAX: 0000000000140014 RBX: ffffea00f86a76c0 RCX: 0000000000000011 RDX: 0000000000000000 RSI: 0000000000000011 RDI: ffff881f94f7ef50 RBP: ffff881f7f23fc80 R8: 0000000000000011 R9: ffffea00f86a76c0 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881f94f7f280 R13: ffff881f94f7f3d0 R14: ffff881f94f7ef50 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff881f7f23fc48] ceph_update_writeable_page at ffffffffa139362e [ceph] #9 [ffff881f7f23fce8] pagecache_write_end at ffffffff81180386 #10 [ffff881f7f23fd38] pipe_to_file at ffffffff8122d4d6 #11 [ffff881f7f23fdb0] splice_from_pipe_feed at ffffffff8122cd1e #12 [ffff881f7f23fde8] generic_file_splice_write_actor at ffffffff8122d70f #13 [ffff881f7f23fe20] splice_write_to_file at ffffffff8122d589 #14 [ffff881f7f23fea8] generic_file_splice_write at ffffffff8122ee14 #15 [ffff881f7f23fec0] do_splice_from at ffffffff8122dafd #16 [ffff881f7f23ff00] sys_splice at ffffffff8122f3da #17 [ffff881f7f23ff80] system_call_fastpath at ffffffff816967c9 RIP: 00007fe5e01ec043 RSP: 00007fe5d11a07e0 RFLAGS: 00010202 RAX: 0000000000000113 RBX: ffffffff816967c9 RCX: 000000000248cc08 RDX: 0000000000000010 RSI: 0000000000000000 RDI: 0000000000000011 RBP: 00007fe5d11a0898 R8: 0000000000000011 R9: 0000000000000004 R10: 00007fe5d11a0898 R11: 0000000000000293 R12: 00007fe5d11a08a8 R13: 0000000000000011 R14: 0000000000000000 R15: 0000000000000010 ORIG_RAX: 0000000000000113 CS: 0033 SS: 002b crash> dis -lr ffffffffa1391b80 | tail 0xffffffffa1391b59 <ceph_set_page_dirty+409>: mov 0x10(%rdi),%r8 0xffffffffa1391b5d <ceph_set_page_dirty+413>: mov 0x0(%r13),%rdx 0xffffffffa1391b61 <ceph_set_page_dirty+417>: mov %rdi,%rcx 0xffffffffa1391b64 <ceph_set_page_dirty+420>: mov $0xffffffffa13b4308,%rsi 0xffffffffa1391b6b <ceph_set_page_dirty+427>: mov $0xffffffffa13c2df0,%rdi 0xffffffffa1391b72 <ceph_set_page_dirty+434>: xor %eax,%eax 0xffffffffa1391b74 <ceph_set_page_dirty+436>: callq 0xffffffff8133c1b0 <__dynamic_pr_debug> 0xffffffffa1391b79 <ceph_set_page_dirty+441>: jmpq 0xffffffffa13919fa 0xffffffffa1391b7e <ceph_set_page_dirty+446>: ud2 0xffffffffa1391b80 <ceph_set_page_dirty+448>: ud2 Information about ceph module: crash> mod | grep -e NAME -e ceph MODULE NAME SIZE OBJECT FILE ffffffffa12adc40 libceph 287066 (not loaded) [CONFIG_KALLSYMS] ffffffffa13c6320 ceph 327687 (not loaded) [CONFIG_KALLSYMS] crash> px ((struct module *)0xffffffffa13c6320)->name $1 = "ceph\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" crash> px ((struct module *)0xffffffffa13c6320)->version $2 = 0x0 crash> px ((struct module *)0xffffffffa13c6320)->srcversion $3 = 0xffff885fbdab39c0 "35DD4ACCD12B2332E7CF233" we can't guarantee the atomicity of write syscall if we take Fw caps in write_begin. I think the simplest fix is use default_file_splice_{read,write} instead of generic_file_splice_{read,write} Created attachment 1260772 [details]
ceph: take cap references in write_begin, release them in write_end
This is the patch that I was considering. It's not well tested yet, so not quite ready to send. Zheng, I don't quite understand how this would break write atomicity. Either we do the whole write or none at all here. Can you explain?
For example: We write 8k data (two pages). The first write_begin() succeeds in getting Fw caps, the second write_begin() waits (mds revokes Fw caps). If other client reads data at this time, it sees partial updated data. Ok, I think I get it now. What we really need to do here is take the caps in a new ceph_splice_write operation, that can just be a wrapper around generic_file_splice_write. I suspect that this is not a problem in mainline kernels since we switched to write_iter there and that unified the two codepaths. Yes. Implement our ceph_splice_write(), which tries getting Fwb caps. If it gets Fwb caps, call generic_file_splice_write(). If it only get Fw caps, call default_file_splice_wirte(). Besides, we also need to implement our splice_read callback. Created attachment 1261341 [details]
patch: get cap references before splicing
Ok, this patch should do the right thing. It builds but I don't have a working test rig at the moment, so testing will have to wait a bit. If anyone else can test it sooner, then that would be great too.
Note that this is a RHEL7 one-off patch, as I don't think this is necessary with the current upstream code.
Comment on attachment 1261341 [details]
patch: get cap references before splicing
If ceph_splice_write() fails to get CEPH_CAP_FILE_BUFFER and CEPH_CAP_FILE_LAZYIO. It should release the CEPH_CAP_FILE_WR and call default_file_splice_write(). ceph_splice_read() requires similar change too.
Why? We're splicing from a pipe buffer into a ceph file here. BUFFER would be nice to have but I don't see why it's required. The file we're splicing into could even be open with O_DIRECT in which case we have no real need for BUFFER anyway... Maybe I just don't understand quite why we need special handling for the LAZYIO case. Zheng, can you explain? If a program opens a file with LAZYIO, it tells us it does not need cache coherent. But that code is not properly tested, may not work as expected. It turns out that I can reproduce this at will with xfstest generic/095. I have a revised patch that seems to fix that bug, but now I'm seeing the WARN_ONs in ceph_releasepage pop. I'll have a look at those soon. Created attachment 1262656 [details]
Updated patch
Revised patch, closer along the lines of what Zheng was suggesting. This seems to fix the BUG_ON for me, and the testcase completes, but a bunch of WARN_ONs pop:
[ 263.971683] WARNING: at fs/ceph/addr.c:175 ceph_releasepage+0x60/0x80 [ceph]()
[ 263.972819] Modules linked in: ceph(OE) libceph dns_resolver ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc i2c_piix4 pcspkr virtio_balloon parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_console virtio_net virtio_blk crct10dif_pclmul crct10dif_common ata_piix qxl crc32c_intel serio_raw drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm floppy drm libata virtio_pci virtio_ring virtio i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 263.987974] CPU: 1 PID: 4473 Comm: fio Tainted: G W OE ------------ 3.10.0-514.10.2.el7.x86_64 #1
[ 263.989345] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
[ 263.990592] 0000000000000000 00000000d1c64ac1 ffff880050e13990 ffffffff816864ef
[ 263.991773] ffff880050e139c8 ffffffff81085940 ffffea0001424840 ffff880068e09ad0
[ 263.992946] 0000000000000074 ffffea0001424840 0000000000000000 ffff880050e139d8
[ 263.994122] Call Trace:
[ 263.994649] [<ffffffff816864ef>] dump_stack+0x19/0x1b
[ 263.995504] [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0
[ 263.996422] [<ffffffff81085a8a>] warn_slowpath_null+0x1a/0x20
[ 263.997320] [<ffffffffa058cf40>] ceph_releasepage+0x60/0x80 [ceph]
[ 263.998302] [<ffffffff81180662>] try_to_release_page+0x32/0x50
[ 263.999204] [<ffffffff81190769>] invalidate_inode_pages2_range+0x2f9/0x430
[ 264.000196] [<ffffffff8117fed9>] ? filemap_fdatawait_range+0x159/0x1b0
[ 264.001154] [<ffffffff8118d44e>] ? do_writepages+0x1e/0x40
[ 264.001999] [<ffffffff81182035>] ? __filemap_fdatawrite_range+0x65/0x80
[ 264.002964] [<ffffffffa05899fd>] ceph_direct_read_write+0xa2d/0xbc0 [ceph]
[ 264.003963] [<ffffffffa05946db>] ? try_get_cap_refs+0x17b/0x5a0 [ceph]
[ 264.004917] [<ffffffffa0590b36>] ? ceph_pool_perm_check+0x76/0x6b0 [ceph]
[ 264.005904] [<ffffffffa0598697>] ? ceph_get_caps+0x107/0x320 [ceph]
[ 264.006832] [<ffffffff8108d912>] ? current_fs_time+0x12/0x60
[ 264.007681] [<ffffffff81219d8f>] ? file_update_time+0x5f/0xf0
[ 264.008551] [<ffffffffa0589dcc>] ceph_aio_write+0x23c/0xc30 [ceph]
[ 264.011785] [<ffffffff810cf178>] ? check_preempt_wakeup+0x148/0x250
[ 264.012836] [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0
[ 264.013712] [<ffffffff811fdabd>] do_sync_write+0x8d/0xd0
[ 264.014525] [<ffffffff811fe32d>] vfs_write+0xbd/0x1e0
[ 264.015309] [<ffffffff811ff002>] SyS_pwrite64+0x92/0xc0
[ 264.016360] [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
[ 264.020142] ---[ end trace b3ea1ef63fa583af ]---
I think we'll want to fix that as well. It may be an unrelated bug but I'll need to look at it more closely to know for sure.
They're welcome to test the patch as well, but it may not be in its final version yet.
Created attachment 1262881 [details]
Follow-on patch to fix up releasepage warnings
I think that we'll also want something like this patch, which fixes the warnings that pop up in releasepage. This one is also a rhel7 custom patch. We could also fix this by backporting the whole series of fscache integration patches (and their related fixes), but that's a much bigger job and I'm not sure we want to go there in rhel7.
Zheng, do these patches look reasonable? Do you know of any plans to backport the fscache integration patches to rhel7? What I don't quite get is why LAZYIO requires a cap. It's a flag set on the client's inode that allows the client to change its caching rules. If this is true: fi->fmode & CEPH_FILE_MODE_LAZY Is should just mean that we can have "lazy" cache coherency -- we can basically have Fbc behavior even when we can't get those caps. So, I think the check in my patch may be wrong. If we LAZY is set on the file, then we should be able to use the pagecache if either BUFFER/CACHE or LAZYIO is set. In practical terms, that means that I think we want this in those patches: if (!(got & want)) /* put caps and call default_file_splice_write */ Right? Created attachment 1262944 [details]
patch: get caps when handling splice
This is what I think we probably need. Zheng, let me know what you think.
Comment on attachment 1262944 [details]
patch: get caps when handling splice
LGTM
Thanks Zheng! I'm doing an xfstests run with these two patches on a rhel7 kernel now. So far it looks good, but we won't know for several hours. Vikhyat, if the customer here has a good way to reproduce the panic, then testing a kernel with these two patches would probably be a good start. It should fix the problem for them. Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-647.el7 Please verify this bug ASAP, it blocks 7.4 kernel ERRATA. Thanks, Zorro I think we can close the bug (In reply to Yan, Zheng from comment #57) > I think we can close the bug Sorry I don't learn about this bug so much. But anyway, ON_QA is not a proper status for all 7.4 kernel bugs now, so would you/someone else please deal with this bug properly? Others ON_QA ceph bugs are Bug 1378186, Bug 1407016, Bug 1408170, Bug 1427556 and Bug 1429008. If you can help to deal with these bugs, that would be really helpful! They all block 7.4 kernel ERRATA, and 7.4 is going to GA now. I'm developer, not QA. I'm not sure what am I supposed to do with this bug. (In reply to Yan, Zheng from comment #59) > I'm developer, not QA. I'm not sure what am I supposed to do with this bug. Oh, sorry, I didn't notice that:) OK, I have to keep waiting ceph-qe's response. Or I have to set SanityOnly for all these bugs by myself. I picked up the new 7.4 RC 1.2 candidate for verification, I dont see any issues in sanity run, Closing this as sanity only. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842 |