Escalated to Bugzilla from IssueTracker
These changes made by errata-xmlrpc. Bugzilla comment added: An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html Bugzilla status changed from 'RELEASE_PENDING' to 'CLOSED' Bugzilla resolution changed from '' to 'ERRATA' https://bugzilla.redhat.com/show_bug.cgi?id=301451 This Bugzilla update was from a 'redhat.com' email address. Internal Status set to 'Waiting on Support' This event sent from IssueTracker by bbraswel [Support Engineering Group] issue 174156
PLEASE NOTE: This is a follow on to bz 301451. Customer installed the errata and the system still crashed. Bill Braswell
OK, thanks for the report. The original bug was opened by a vendor, and they couldn't reproduce it it with -92, so we marked it as fixed. It looks like it can still happen, though. Chris Lalancette
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Mostly notes for myself, but here is what I see so far: OK, given the crash, we know we died in arch/i386/mm/highmem-xen.c:43. This is in kmap_atomic, where we are trying to do a new kmap. Basically what is happening is that the entry we are trying to use for the kmap is already in use: if (!pte_none(*(kmap_pte-idx))) BUG(); Looking at the assembly, we see: /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/arch/i386/mm/highmem-xen.c: 43 0xc0419017 <__kmap_atomic+496>: ud2a Which is exactly where we crashed. However, we could only have gotten here via a jump much further up, which is actually this bit of code: include/asm/mach-xen/asm/pgtable-3level.h: 145 0xc0418ea8 <__kmap_atomic+129>: cmpl $0x0,(%eax) /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/arch/i386/mm/highmem-xen.c: 42 0xc0418eab <__kmap_atomic+132>: mov 0x4(%eax),%edx include/asm/mach-xen/asm/pgtable-3level.h: 145 0xc0418eae <__kmap_atomic+135>: jne 0xc0419017 <__kmap_atomic+496> 0xc0418eb4 <__kmap_atomic+141>: test %edx,%edx 0xc0418eb6 <__kmap_atomic+143>: jne 0xc0419017 <__kmap_atomic+496> And looking at the state of the registers, we can see that there is a value in %eax, which means that first jne fires. This all lines up. Next, we look at where we came from in the stack trace: #3 [c073cb3c] __kmap_atomic at c0419017 #4 [c073cb64] kmap_atomic at c0419057 #5 [c073cb70] __sync_single at c04e7cd8 The important bit is __sync_single, which is in arch/i386/kernel/swiotlb.c: if (PageHighMem(buffer.page)) { size_t len, bytes; char *dev, *host, *kmp; len = size; while (len != 0) { if (((bytes = len) + buffer.offset) > PAGE_SIZE) bytes = PAGE_SIZE - buffer.offset; kmp = kmap_atomic(buffer.page, KM_SWIOTLB); dev = dma_addr + size - len; host = kmp + buffer.offset; if (dir == DMA_FROM_DEVICE) { if (__copy_to_user_inatomic(host, dev, bytes)) /* inaccessible */; } else memcpy(dev, host, bytes); kunmap_atomic(kmp, KM_SWIOTLB); len -= bytes; buffer.page++; buffer.offset = 0; } OK, so we see that we are trying to kmap_atomic(buffer.page, KM_SWIOTLB). Nowhere else in the kernel uses that particular kmap slot, so the problem originates around here. Basically, we have two possibilities. One is that somehow we got pre-empted, and another CPU has come in here and used that slot out from under us. The second is that there is some corruption while computing which slot we should be looking at for the PTE. I'm leaning towards the latter at the moment, but we'll have to investigate both possibilities. Chris Lalancette
OK, I think I found the problem. I had to look deeper into the stack trace on CPU0 to find it. If you do a "bt -T" in crash, you'll see *all* of the symbols on the stack, including the ones previous to the hard-IRQ (which you don't see with a normal "bt" command). In any case, when we do that, we see a lot more of the stack (pasted below, but edited for brevity): crash> bt -T PID: 4154 TASK: c095f000 CPU: 0 COMMAND: "dd" <snip> [c073c928] machine_crash_shutdown at c0414159 [c073c9ac] __kmap_atomic at c0419017 [c073c9d4] machine_kexec at c054be55 [c073c9e0] crash_kexec at c043d5b7 <snip> [c073cfcc] handle_IRQ_event at c0447267 [c073cfe4] __do_IRQ at c0447315 [c073cffc] do_IRQ at c0406e6e --- <hard IRQ> --- bt: invalid stack address for this task: fffffffe (valid range: ebb94000 - ebb95000) [ebb94700] _spin_lock_irqsave at c0609568 [ebb94708] map_single at c04e81b3 [ebb94744] swiotlb_map_sg at c04e8a6d <snip> [ebb94f9c] sys_write at c046e3d1 [ebb94fb8] syscall_call at c0405413 [ebb94fd8] L6 at c040007b What you see here at the top of the stack is the kmap_atomic BUG that caused us to crash. You can also see that we got here because of an interrupt (do_IRQ). But, now you look at what was happening *previous* to the interrupt, and you see that we were already in map_single; we got here via a syscall to write. So it seems like what was happening was that some process was trying to do I/O, which was going through the swiotlb bounce buffering. However, exactly at the wrong time, an interrupt came in, and now we *again* go into into the swiotlb code during the interrupt, and our problem crops up. The solution here seems to be to disable interrupts on the local processor before we do the kmap_atomic. Indeed, I found upstream xen-3.1-testing.hg c/s 13346, which seems to do exactly this. I've now built a test kernel with this patch applied; it's available at http://people.redhat.com/clalance/bz452175. Can you have the customer try this test kernel out, and report back results? Thanks, Chris Lalancette
Oh, I forgot to add: if the kernel doesn't work (that is, if it crashes again), please collect another core and upload it again so I can take a look at it. Thanks, Chris Lalancette
Created attachment 323858 [details] Backport of upstream xen-3.1-testing.hg c/s 13346
in kernel-2.6.18-125.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 452740 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html