Bug 1254310
Summary: | SLUB allocator breaks fuse_direct_IO (SLAB works) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jillian Morgan <penguin.wrangler> | ||||||
Component: | kernel | Assignee: | Miklos Szeredi <mszeredi> | ||||||
Status: | CLOSED UPSTREAM | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 22 | CC: | gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, maik, mchehab, mszeredi | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | fuse gluster ovirt | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-05-08 01:55:06 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Jillian Morgan
2015-08-17 16:41:10 UTC
This message is a reminder that Fedora 21 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 21. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '21'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 21 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Confirming that this still occurs on fresh Fedora 22 install with: kernel-4.2.5-201.fc22.x86_64 glusterfs-fuse-3.7.5-1.fc22.x86_64 fuse-2.9.4-3.fc22.x86_64 fuse-libs-2.9.4-3.fc22.x86_64 Found a workaround (not a fix): I recompiled kernel-4.2.5-201.fc22.x86_64 to use the older SLAB allocator instead of the default SLUB allocator. Problem avoided. No more crash when using glusterfs (fuse). Now.. what the -bleep- is wrong with SLUB? While using SLAB is a workaround (at least it seems to be working so far; knock-on-wood), I am uncertain what performance impacts it is going to have on my virtualization cluster. :-( And without a run/boot-time method of switching between allocators, I am now going to have to compile my customized kernel from here on out.. not a big deal, but a nuisance.. and have to take extra care to make sure to never boot into a distro-built kernel by mistake and have everything come crashing down. I am trying to get some traction on this bug, open for 6 months with no responses. I have attempted to remove some variables from the equation to see what factors are potentially contributing to this kernel BUG. First test: I have replicated the issue on a host that does NOT run a glusterfsd, and thus only consumes a vm image from a separate server, eliminating any potential conflict from having both glusterfs server and client on the same node. Also, the original hosts used when this bug was first reported were Supermicro Avoton Atom C2750/58. This new replication of the fault is on an older Dell PE2950 (Xeon E54xx), so the specific hardware does not seem to be a factor in the bug. Reproduction steps: - Fresh install of Fedora Server 22, minimal package set, with online updates. - rpm -Uvh http://resources.ovirt.org/pub/yum-repo/ovirt-release36.rpm. - Add this node as a new host via oVirt WebAdmin. - Start a VM on this new node, using a disk image that resides on a glusterfs storage domain. - Boom! kernel-4.3.4-200.fc22.x86_64 glusterfs-fuse-3.7.6-1.fc22.x86_64 fuse-2.9.4-3.fc22.x86_64 fuse-libs-2.9.4-3.fc22.x86_64 [ 316.458148] ------------[ cut here ]------------ [ 316.459052] kernel BUG at mm/slub.c:3517! [ 316.459052] invalid opcode: 0000 [#1] SMP [ 316.459052] Modules linked in: vhost_net vhost macvtap macvlan ebt_arp ebtable_nat tun nfsv3 nfs fscache fuse ebtable_filter ebtables ip6table_filter ip6_tables scsi_transport_iscsi xt_physdev br_netfilter nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack dm_service_time nf_conntrack coretemp kvm_intel iTCO_wdt ipmi_ssif iTCO_vendor_support gpio_ich kvm ipmi_devintf dcdbas bnx2 lpc_ich ipmi_si i5000_edac edac_core ipmi_msghandler i5k_amb shpchp fjes acpi_cpufreq tpm_tis tpm nfsd 8021q auth_rpcgss garp mrp bridge nfs_acl lockd stp grace llc sunrpc bonding dm_multipath amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper ttm drm ata_generic serio_raw pata_acpi megaraid_sas [ 316.515263] CPU: 2 PID: 3055 Comm: qemu-system-x86 Not tainted 4.3.4-200.fc22.x86_64 #1 [ 316.515263] Hardware name: Dell Inc. PowerEdge 2950/0M332H, BIOS 2.7.0 10/30/2010 [ 316.515263] task: ffff88041cbbb980 ti: ffff880418e94000 task.ti: ffff880418e94000 [ 316.515263] RIP: 0010:[<ffffffff81203edc>] [<ffffffff81203edc>] kfree+0x12c/0x130 [ 316.515263] RSP: 0018:ffff880418e97cc8 EFLAGS: 00010246 [ 316.515263] RAX: 003ffff800000000 RBX: ffff88002a43fea0 RCX: dead000000000200 [ 316.515263] RDX: 000077ff80000000 RSI: ffff88041cbbb980 RDI: ffff88002a43fea0 [ 316.515263] RBP: ffff880418e97ce0 R08: ffff880418e97ca8 R09: ffffea0000a90fc0 [ 316.515263] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000006e30400 [ 316.515263] R13: ffffffffa054c60e R14: ffff88042b32e400 R15: ffff880418e97dc8 [ 316.515263] FS: 00007f1c4e3ff700(0000) GS:ffff88043fc80000(0000) knlGS:0000000000000000 [ 316.515263] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 316.515263] CR2: 0000000000000000 CR3: 0000000418c96000 CR4: 00000000000026e0 [ 316.515263] Stack: [ 316.515263] ffff88002a43fea0 0000000006e30400 ffff880418e97e60 ffff880418e97d68 [ 316.515263] ffffffffa054c60e 0000000000007c00 ffff88041c944c00 0000000280000000 [ 316.515263] 0000000000007c00 0000000006e38000 0000000000000000 0000000000000000 [ 316.515263] Call Trace: [ 316.515263] [<ffffffffa054c60e>] fuse_direct_IO+0x1ee/0x310 [fuse] [ 316.515263] [<ffffffff811a791b>] generic_file_read_iter+0x47b/0x5c0 [ 316.515263] [<ffffffffa054910c>] fuse_file_read_iter+0x4c/0x70 [fuse] [ 316.515263] [<ffffffff81223346>] __vfs_read+0xc6/0x100 [ 316.515263] [<ffffffff81223d73>] vfs_read+0x83/0x130 [ 316.515263] [<ffffffff81224c85>] SyS_pread64+0x95/0xb0 [ 316.515263] [<ffffffff8178182e>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 316.515263] Code: 2a 49 8b 01 31 f6 f6 c4 40 74 04 41 8b 71 68 4c 89 cf e8 58 a2 fa ff eb a0 4c 89 d1 48 89 da 4c 89 ce e8 78 fa ff ff eb 90 0f 0b <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54 [ 316.515263] RIP [<ffffffff81203edc>] kfree+0x12c/0x130 [ 316.515263] RSP <ffff880418e97cc8> [ 316.904456] ---[ end trace a63508bc8d44e7be ]--- Second test: Same as above, but with a fresh install of CentOS 7.2 (1511). Result: No bug triggered. The VM runs just fine. kernel-3.10.0-327.4.5.el7.x86_64 glusterfs-fuse-3.7.6-1.el7.x86_64 fuse-2.9.2-6.el7.x86_64 fuse-libs-2.9.2-6.el7.x86_64 Created attachment 1137049 [details]
proposed patch #1
Created attachment 1137050 [details]
proposed patch #2
Could you please test with these two patches?
Miklos, Those patches look promising. I will endeavour to test them ASAP. If not today, then by the end of the week. In the interest of not introducing any additional variables into the tests at this point, I will switch my current in-production kernel (kernel-4.2.5-201.fc22.x86_64 recompiled to use SLAB) back to the default/broken SLUB-based allocator, with your two patches applied and test that. Whether that works or not, I will then apply the patches against the latest kernel-4.4.4-200.fc22 and test that as well. Thank you for your work on this. I am very pleased to see this bug finally get some attention. Miklos, Yahoo! The above two patches have allowed me to return to the SLUB allocator without fuse crashing. VMs started up with no problem, just as they should. This is with kernel 4.2.5. Having one test node with the patches running VMs for only a few minutes now, I am tentatively calling this one a success. I'll will try the patches on 4.4.4 shortly, but I expect that to work as well. What are the odds that the Fedora kernel team will incorporate these patches without waiting for it to hit mainline/stable upstream first? Confirmed fixed on all nodes of my production cluster with the FUSE patches included in kernel-4.4.8-200.fc22.x86_64. |