Bug 487672
Summary: | slab corruption with dlm and clvmd on ppc64 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Bryn M. Reeves <bmr> | ||||
Component: | kernel | Assignee: | David Teigland <teigland> | ||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 5.3 | CC: | ccaulfie, cjt, dhoward, emcnabb, jpirko, tao, teigland | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | ppc64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-09-02 08:12:52 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 491677 | ||||||
Attachments: |
|
Description
Bryn M. Reeves
2009-02-27 12:37:59 UTC
Looking at the words around the corrupted list pointer in the core from comment #0 we find: c0000000f93280f0: 0000000000000000 565f766730315130 ........V_vg01Q0 c0000000f9328100: 32000000ed9e0000 c0000000ed9e0000 2............... Which appears to be an LVM2 lock string. There's a VG named vg01 present on these hosts: $ grep vg01 lvmdump-*/vgs lvmdump-1/vgs: vg01 wz--n- 4.00M 1 7 0 16.52G 176.00M 0IQtLj-uPpD-y82v-FsFV-xplc-oNv7-0ovNi3 lvmdump-2/vgs: vg01 wz--n- 4.00M 1 7 0 16.52G 176.00M 0IQtLj-uPpD-y82v-FsFV-xplc-oNv7-0ovNi3 So the string in comment #2 ("V_vg01Q02\0") appears as though it may have come from: LVM2.*/lib/locking/cluster_locking.c: int lock_resource(struct cmd_context *cmd, const char *resource, uint32_t flags) #endif { [...] switch (flags & LCK_SCOPE_MASK) { case LCK_VG: /* If the VG name is empty then lock the unused PVs */ if (!*resource) /* FIXME Deprecated */ dm_snprintf(lockname, sizeof(lockname), "P_orphans"); else if (*resource == '#') dm_snprintf(lockname, sizeof(lockname), "P_%s", resource + 1); else dm_snprintf(lockname, sizeof(lockname), "V_%s", resource); I'm not sure about the "Q02" - is it possible that's part of the resource string? I'm having trouble getting sane backtraces from the core that showed the assumed lock strings. E.g. I see this backtrace in the log: NIP [C0000000001BE174] .list_del+0x44/0xb0 LR [C0000000001BE170] .list_del+0x40/0xb0 Call Trace: [C00000000F45FAC0] [C0000000001BE170] .list_del+0x40/0xb0 (unreliable) [C00000000F45FB40] [C0000000000E7570] .free_block+0xdc/0x1d4 [C00000000F45FC00] [C0000000000E772C] .drain_array+0xc4/0x144 [C00000000F45FCB0] [C0000000000E9930] .cache_reap+0x118/0x2d0 [C00000000F45FD50] [C0000000000815E8] .run_workqueue+0xdc/0x168 [C00000000F45FDF0] [C000000000082354] .worker_thread+0x12c/0x19c [C00000000F45FEE0] [C000000000086E74] .kthread+0x128/0x178 [C00000000F45FF90] [C000000000026FFC] .kernel_thread+0x4c/0x68 Which seems completely consistent with e.g. LR/NIP in the register dumps but looking at that via the bt command shows: #0 [c00000000f45fac0] .gen_pool_free at c0000000001be170 #1 [c00000000f45fb40] .__drain_alien_cache at c0000000000e7570 #2 [c00000000f45fc00] .drain_freelist at c0000000000e772c #3 [c00000000f45fcb0] .s_show at c0000000000e9930 #4 [c00000000f45fd50] .cleanup_workqueue_thread at c0000000000815e8 #5 [c00000000f45fdf0] .workqueue_cpu_callback at c000000000082354 #6 [c00000000f45fee0] .bit_waitqueue at c000000000086e74 #7 [c00000000f45ff90] .kernel_thread at c000000000026ffc The addresses are correct but the resolved symbol names are not. Same thing if I disassemble e.g. list_del: /usr/src/debug/kernel-2.6.18/linux-2.6.18.ppc64/lib/list_debug.c: 61 0xc0000000001bdb9c <.list_del>: beq- cr7,0xc0000000001bdc0c <list_del+112> 0xc0000000001bdba0 <.list_del+4>: ld r9,0(r31) 0xc0000000001bdba4 <.list_del+8>: addi r9,r9,1 0xc0000000001bdba8 <.list_del+12>: cmpd cr7,r9,r11 0xc0000000001bdbac <.list_del+16>: beq- cr7,0xc0000000001bdc0c <list_del+112> /usr/src/debug/kernel-2.6.18/linux-2.6.18.ppc64/lib/list_debug.c: 63 0xc0000000001bdbb0 <.list_del+20>: bl 0xc0000000001c4c44 <__pci_bus_find_cap+20> /usr/src/debug/kernel-2.6.18/linux-2.6.18.ppc64/lib/list_debug.c: 61 0xc0000000001bdbb4 <.list_del+24>: nop 0xc0000000001bdbb8 <.list_del+28>: mr r4,r29 There are apparent calls to PCI related symbols all through the routine which cannot be correct. I think the disassembly is correct (it looks to be doing more or less what I'd expect) but the symbol resolution is borked. I'm going to try to get some assistance on this from the crash folks. Mystery in comment #4 solved.. seems to be a CAS bug. CAS uses a fairly crude heuristic to determine the kernel version of a given vmcore image and it seems that in this case it is failing because the vmcore in question has rather a lot of things that look like kernel ident strings in pagecache: strings 20090120002426-vmcore | grep '2\.6\.18' [ pages and pages of output ] I'll report this to the CAS maintainer. Getting proper backtraces now: crash> bt PID: 28 TASK: c0000000ffc149c0 CPU: 2 COMMAND: "events/2" R0: c0000000001be170 R1: c00000000f45fac0 R2: c0000000005f1bf0 R3: 0000000000000058 R4: 8000000000001032 R5: 0000000000000000 R6: 0000000000000000 R7: 0000000000000000 R8: 0000000000000002 R9: c000000000680200 R10: c0000000ff19bc08 R11: 0000000000000000 R12: 0000000000004000 R13: c0000000004d5280 R14: 0000000000000000 R15: c0000000003e3048 R16: 4000000001c00000 R17: c0000000003e1a58 R18: 0000000000000000 R19: 0000000000638000 R20: c0000000004953e8 R21: 00000000020953e8 R22: 0000000002095658 R23: 000000000000000b R24: 0000000000000000 R25: 0000000000000000 R26: c0000000efb32900 R27: c0000000ed9e0358 R28: c0000000fc725398 R29: c0000000f9328100 R30: c000000000532918 R31: c0000000ed9e0000 NIP: c0000000001be174 MSR: 8000000000021032 OR3: c00000000006c9a8 CTR: 80000000001c5840 LR: c0000000001be170 XER: 0000000000000006 CCR: 0000000028000082 MQ: 0000000000000000 DAR: 0000000000000000 DSISR: 0000000000000000 Syscall Result: 0000000000000000 NIP [c0000000001be174] .list_del #0 [c00000000f45fac0] .list_del at c0000000001be170 #1 [c00000000f45fb40] .free_block at c0000000000e7570 #2 [c00000000f45fc00] .drain_array at c0000000000e772c #3 [c00000000f45fcb0] .cache_reap at c0000000000e9930 #4 [c00000000f45fd50] .run_workqueue at c0000000000815e8 #5 [c00000000f45fdf0] .worker_thread at c000000000082354 #6 [c00000000f45fee0] .kthread at c000000000086e74 #7 [c00000000f45ff90] .kernel_thread at c000000000026ffc Do we know whether this is a regression between 5.2 and 5.3? The area in question is probably the 32/64 compat code in fs/dlm/user.c. This code is "highly delicate" and was changed in 5.3 by this patch: http://post-office.corp.redhat.com/archives/rhkernel-list/2008-August/msg00847.html which was for bug 458760 -- the compat_input() change is the scary one. Looking at the patch mentioned in comment 7, - compat_input(kbuf, k32buf, count - sizeof(struct dlm_write_request32)); + compat_input(kbuf, k32buf, count + 1); But in compat_input() I don't think we account for the extra "1": + memcpy(kb->i.lspace.name, kb32->i.lspace.name, count - + offsetof(struct dlm_write_request32, i.lspace.name)); and + memcpy(kb->i.lock.name, kb32->i.lock.name, count - + offsetof(struct dlm_write_request32, i.lock.name)); meaning, I think, that we memcpy 1 extra byte. I'm beginning to think we might have two problems here; the one that we were seeing with kernels <107.el5 and possibly something slightly different in the later ones relating to this patch. The cluster where this is being seen triggers the problem very reliably so I'll get a build submitted with that change backed out for testing. (In reply to comment #7) > Do we know whether this is a regression between 5.2 and 5.3? The problem happened before going to 5.3. The test cluster wasn't updated to RHEL5.3 until 23 Feb. (All vmcores before that time, including the first, were from 5.2) > The problem happened before going to 5.3. The test cluster wasn't updated to > RHEL5.3 until 23 Feb. (All vmcores before that time, including the first, were But the recent tests have produced somewhat different footprints - as I mentioned in comment #9 it looks like we might have multiple issues here. I'm building a kernel now with the change Dave pointed out reverted for testing - will update again when that's ready. How is compat_input() not accounting for the new nil byte? The nil is added with kzalloc(), and it's accounted for with the parameters passed to compat_input() so that compat_input() doesn't have to account for it. Aren't all the occurances with the LVM2 lock name before the corruption from vmcore's prior to using 2.6.18-128.0.1? Wouldn't rolling that change back, reintroduce the problem it solves? (In reply to comment #13) > Wouldn't rolling that change back, reintroduce the problem it solves? That didn't read like I intended... of course rolling it back will reintroduce the problem it solves, but is that what we want to do in this case, when it could be one of the multiple problems encountered? Right now I think we're just trying to get some clarity. If the test kernel changes the footprint of the crash (or causes it to revert to the patterns we'd seen in earlier cores) then I'd suggest we fork off a new bugzilla to track the problem introduced by that patch and use this BZ to focus on the original problem that was being seen on this cluster. (In reply to comment #8) > ... I think, that we memcpy 1 extra byte. I understand why the patch didn't work as expected: sizeof(dlm_write_request32) = 0x68 sizeof(dlm_lock_params32) = 0x58 sizeof(dlm_write_request) = 0x78 sizeof(dlm_lock_params) = 0x68 but offsetof(struct dlm_write_request32, i.lock.name) = 0x64 offsetof(struct dlm_write_request, i.lock.name) = 0x78 memcpy(kb->i.lock.name, kb32->i.lock.name, count - offsetof(struct dlm_write_request32, i.lock.name)); In the case of the recent debug kernel vmcore the count passed to device_write is 0x6f and i.lock.namelen=7. In compat_input() this memcpy(kb->i.lock.name, kb32->i.lock.name, count - offsetof(struct dlm_write_request32, i.lock.name)); copies 0xc, not 0x8 bytes. We memcpy'ed 4 extra bytes and didn't effectively nil terminate name either. dlm_lock_params32 needs 4 bytes of padding before name before that patch would work. An alternative view is that libdlm:dlm/lib/libdlm.c in ls_lock_v6() and ls_lock_v5() is miscalculating len when it assumes offsetof(struct dlm_write_request, i.lock.name) = sizeof(dlm_write_request) and does this: req->i.lock.namelen = namelen; memcpy(req->i.lock.name, name, namelen); ... len = sizeof(struct dlm_write_request_v5) + namelen; ... status = write(lsinfo->fd, req, len); Is it possible to fix libdlm and leave the original dlm kernel patch in place? Perhaps this needs a two pronged fix. To keep the slab from getting corrupted: device_write() needs to be changed to: - kbuf = kmalloc(count + 1 + (sizeof(struct dlm_write_request) - - sizeof(struct dlm_write_request32)), GFP_KERNEL); + kbuf = kmalloc(sizeof(struct dlm_write_request) + 1 + + (count - offsetof(struct dlm_write_request32, i.lock.name)) At the same time, libdlm needs to be fixed to provide a proper size to the write() system call in ls_lock_v5() and ls_lock_v6(). There, the 'len" should be calculated as: len = offsetof(struct dlm_write_request, i.lock.name) + namelen NOTE: any libdlm change needs to consider the "count < sizeof(struct dlm_write_request32)" condition in device_write(). If the vgname is 1 byte, and the lock name is V_?, device_write() will return EINVAL. Curtis, thanks for the detailed analysis. I'll step through all your calculations myself to verify, and then post kernel/libdlm patches that we can build for you to test. Given the unobvious way that offsetof works, I'd like to avoid using it at all. I think we should be able to code things in such a way that the particular offset doesn't matter anywhere. It seems to me that the simplest approach to all this is to calculate the name length right away and pass that around to use when copying, like the following diff. Is there something I'm missing that would not work with this? diff --git a/fs/dlm/user.c b/fs/dlm/user.c index 302cf0c..e219e53 100644 --- a/fs/dlm/user.c +++ b/fs/dlm/user.c @@ -84,7 +84,7 @@ struct dlm_lock_result32 { static void compat_input(struct dlm_write_request *kb, struct dlm_write_request32 *kb32, - size_t count) + int namelen) { kb->version[0] = kb32->version[0]; kb->version[1] = kb32->version[1]; @@ -96,8 +96,7 @@ static void compat_input(struct dlm_write_request *kb, kb->cmd == DLM_USER_REMOVE_LOCKSPACE) { kb->i.lspace.flags = kb32->i.lspace.flags; kb->i.lspace.minor = kb32->i.lspace.minor; - memcpy(kb->i.lspace.name, kb32->i.lspace.name, count - - offsetof(struct dlm_write_request32, i.lspace.name)); + memcpy(kb->i.lspace.name, kb32->i.lspace.name, namelen); } else if (kb->cmd == DLM_USER_PURGE) { kb->i.purge.nodeid = kb32->i.purge.nodeid; kb->i.purge.pid = kb32->i.purge.pid; @@ -115,8 +114,7 @@ static void compat_input(struct dlm_write_request *kb, kb->i.lock.bastaddr = (void *)(long)kb32->i.lock.bastaddr; kb->i.lock.lksb = (void *)(long)kb32->i.lock.lksb; memcpy(kb->i.lock.lvb, kb32->i.lock.lvb, DLM_USER_LVB_LEN); - memcpy(kb->i.lock.name, kb32->i.lock.name, count - - offsetof(struct dlm_write_request32, i.lock.name)); + memcpy(kb->i.lock.name, kb32->i.lock.name, namelen); } } @@ -525,15 +523,25 @@ static ssize_t device_write(struct file *file, const char #ifdef CONFIG_COMPAT if (!kbuf->is64bit) { struct dlm_write_request32 *k32buf; + int namelen = 0; + + if (count > sizeof(struct dlm_write_request32)) + namelen = count - sizeof(struct dlm_write_request32); + k32buf = (struct dlm_write_request32 *)kbuf; - kbuf = kmalloc(count + 1 + (sizeof(struct dlm_write_request) - - sizeof(struct dlm_write_request32)), GFP_KERNEL); - if (!kbuf) + + /* add 1 after namelen so that the name string is terminated */ + kbuf = kzalloc(sizeof(struct dlm_write_request) + namelen + 1, + GFP_KERNEL); + if (!kbuf) { + kfree(k32buf); return -ENOMEM; + } if (proc) set_bit(DLM_PROC_FLAGS_COMPAT, &proc->flags); - compat_input(kbuf, k32buf, count + 1); + + compat_input(kbuf, k32buf, namelen); kfree(k32buf); } #endif (In reply to comment #21) In general I like the idea of using your calculation of namelen to get the job done. That change coupled with the kmalloc() change to kzalloc() effectively nil terminates the lock name. Now, any garbage that must be passed because count = sizeof(struct dlm_write_request32) + namelen gets washed away. Yes, I like the idea of passing around namelen. it's far tidier and less prone to error. Created attachment 333877 [details] patch to test Here is the patch from comment 21 that we can build and test. Tests with this patch have been a success. (cluster nodes that were getting the slab corruption predictably, are no longer seeing any problems and can mount their gfs filesystems.) Yes, I'll begin all the steps to get it posted upstream and internally. patch from comment 24 posted to rhkernel http://post-office.corp.redhat.com/archives/rhkernel-list/2009-March/msg00072.html This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-134.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |