Bug 452175 - kernel BUG at arch/i386/mm/highmem-xen.c:43! with errata/RHBA-2008-0314 installed
kernel BUG at arch/i386/mm/highmem-xen.c:43! with errata/RHBA-2008-0314 insta...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.2
All Linux
high Severity high
: rc
: ---
Assigned To: Chris Lalancette
Martin Jenner
:
: 452740 (view as bug list)
Depends On:
Blocks: 448753
  Show dependency treegraph
 
Reported: 2008-06-19 16:38 EDT by Issue Tracker
Modified: 2010-10-22 22:05 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 14:48:11 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Backport of upstream xen-3.1-testing.hg c/s 13346 (931 bytes, patch)
2008-11-18 03:01 EST, Chris Lalancette
no flags Details | Diff

  None (edit)
Description Issue Tracker 2008-06-19 16:38:24 EDT
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2008-06-19 16:38:26 EDT
These changes made by errata-xmlrpc@redhat.com.
Bugzilla comment added:
 
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html


Bugzilla status changed from 'RELEASE_PENDING' to 'CLOSED'
Bugzilla resolution changed from '' to 'ERRATA'

https://bugzilla.redhat.com/show_bug.cgi?id=301451
This Bugzilla update was from a 'redhat.com' email address. Internal
Status set to 'Waiting on Support'
This event sent from IssueTracker by bbraswel  [Support Engineering Group]
 issue 174156
Comment 3 Bill Braswell 2008-06-19 16:41:00 EDT
PLEASE NOTE:

This is a follow on to bz 301451.  Customer installed the errata and the system
still crashed.


Bill Braswell
Comment 4 Chris Lalancette 2008-06-20 02:37:41 EDT
OK, thanks for the report.  The original bug was opened by a vendor, and they
couldn't reproduce it it with -92, so we marked it as fixed.  It looks like it
can still happen, though.

Chris Lalancette
Comment 5 RHEL Product and Program Management 2008-06-20 14:12:07 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 24 Chris Lalancette 2008-11-14 12:17:07 EST
Mostly notes for myself, but here is what I see so far:

OK, given the crash, we know we died in arch/i386/mm/highmem-xen.c:43.  This is in kmap_atomic, where we are trying to do a new kmap.  Basically what is happening is that the entry we are trying to use for the kmap is already in use:

	if (!pte_none(*(kmap_pte-idx)))
		BUG();

Looking at the assembly, we see:

/usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/arch/i386/mm/highmem-xen.c: 43
0xc0419017 <__kmap_atomic+496>: ud2a   

Which is exactly where we crashed.  However, we could only have gotten here via a jump much further up, which is actually this bit of code:

include/asm/mach-xen/asm/pgtable-3level.h: 145
0xc0418ea8 <__kmap_atomic+129>: cmpl   $0x0,(%eax)
/usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/arch/i386/mm/highmem-xen.c: 42
0xc0418eab <__kmap_atomic+132>: mov    0x4(%eax),%edx
include/asm/mach-xen/asm/pgtable-3level.h: 145
0xc0418eae <__kmap_atomic+135>: jne    0xc0419017 <__kmap_atomic+496>
0xc0418eb4 <__kmap_atomic+141>: test   %edx,%edx
0xc0418eb6 <__kmap_atomic+143>: jne    0xc0419017 <__kmap_atomic+496>

And looking at the state of the registers, we can see that there is a value in %eax, which means that first jne fires.  This all lines up.

Next, we look at where we came from in the stack trace:

 #3 [c073cb3c] __kmap_atomic at c0419017
 #4 [c073cb64] kmap_atomic at c0419057
 #5 [c073cb70] __sync_single at c04e7cd8

The important bit is __sync_single, which is in arch/i386/kernel/swiotlb.c:

	if (PageHighMem(buffer.page)) {
		size_t len, bytes;
		char *dev, *host, *kmp;
		len = size;
		while (len != 0) {
			if (((bytes = len) + buffer.offset) > PAGE_SIZE)
				bytes = PAGE_SIZE - buffer.offset;
			kmp  = kmap_atomic(buffer.page, KM_SWIOTLB);
			dev  = dma_addr + size - len;
			host = kmp + buffer.offset;
			if (dir == DMA_FROM_DEVICE) {
				if (__copy_to_user_inatomic(host, dev, bytes))
					/* inaccessible */;
			} else
				memcpy(dev, host, bytes);
			kunmap_atomic(kmp, KM_SWIOTLB);
			len -= bytes;
			buffer.page++;
			buffer.offset = 0;
		}

OK, so we see that we are trying to kmap_atomic(buffer.page, KM_SWIOTLB).  Nowhere else in the kernel uses that particular kmap slot, so the problem originates around here.

Basically, we have two possibilities.  One is that somehow we got pre-empted, and another CPU has come in here and used that slot out from under us.  The second is that there is some corruption while computing which slot we should be looking at for the PTE.  I'm leaning towards the latter at the moment, but we'll have to investigate both possibilities.

Chris Lalancette
Comment 25 Chris Lalancette 2008-11-17 08:36:09 EST
OK, I think I found the problem.  I had to look deeper into the stack trace on CPU0 to find it.  If you do a "bt -T" in crash, you'll see *all* of the symbols on the stack, including the ones previous to the hard-IRQ (which you don't see with a normal "bt" command).  In any case, when we do that, we see a lot more of the stack (pasted below, but edited for brevity):

crash> bt -T
PID: 4154   TASK: c095f000  CPU: 0   COMMAND: "dd"
<snip>
  [c073c928] machine_crash_shutdown at c0414159
  [c073c9ac] __kmap_atomic at c0419017
  [c073c9d4] machine_kexec at c054be55
  [c073c9e0] crash_kexec at c043d5b7
<snip>
  [c073cfcc] handle_IRQ_event at c0447267
  [c073cfe4] __do_IRQ at c0447315
  [c073cffc] do_IRQ at c0406e6e
--- <hard IRQ> ---
bt: invalid stack address for this task: fffffffe
    (valid range: ebb94000 - ebb95000)
  [ebb94700] _spin_lock_irqsave at c0609568
  [ebb94708] map_single at c04e81b3
  [ebb94744] swiotlb_map_sg at c04e8a6d
<snip>
  [ebb94f9c] sys_write at c046e3d1
  [ebb94fb8] syscall_call at c0405413
  [ebb94fd8] L6 at c040007b

What you see here at the top of the stack is the kmap_atomic BUG that caused us to crash.  You can also see that we got here because of an interrupt (do_IRQ).  But, now you look at what was happening *previous* to the interrupt, and you see that we were already in map_single; we got here via a syscall to write.  So it seems like what was happening was that some process was trying to do I/O, which was going through the swiotlb bounce buffering.  However, exactly at the wrong time, an interrupt came in, and now we *again* go into into the swiotlb code during the interrupt, and our problem crops up.

The solution here seems to be to disable interrupts on the local processor before we do the kmap_atomic.  Indeed, I found upstream xen-3.1-testing.hg c/s 13346, which seems to do exactly this.  I've now built a test kernel with this patch applied; it's available at http://people.redhat.com/clalance/bz452175.  Can you have the customer try this test kernel out, and report back results?

Thanks,
Chris Lalancette
Comment 26 Chris Lalancette 2008-11-17 10:50:53 EST
Oh, I forgot to add: if the kernel doesn't work (that is, if it crashes again), please collect another core and upload it again so I can take a look at it.

Thanks,
Chris Lalancette
Comment 29 Chris Lalancette 2008-11-18 03:01:13 EST
Created attachment 323858 [details]
Backport of upstream xen-3.1-testing.hg c/s 13346
Comment 37 Don Zickus 2008-12-02 17:19:08 EST
in kernel-2.6.18-125.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 39 Bill Burns 2008-12-09 16:28:53 EST
*** Bug 452740 has been marked as a duplicate of this bug. ***
Comment 42 errata-xmlrpc 2009-01-20 14:48:11 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.