459728 – kernel BUG at arch/i386/mm/hypervisor.c:196!

Bug 459728 - kernel BUG at arch/i386/mm/hypervisor.c:196!

Summary: kernel BUG at arch/i386/mm/hypervisor.c:196!

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	4.5
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Jones
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	458302
TreeView+	depends on / blocked

Reported:	2008-08-21 16:56 UTC by jeffb69ma
Modified:	2018-11-14 12:09 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-07-28 15:05:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
XenSource	756	0	None	None	None	Never

Description jeffb69ma 2008-08-21 16:56:04 UTC

Description of problem:

RHEL4u5 as a domu guest with the 2.6.9-55.ELxenU kernel crashed


A part of the syslog is indicated below.

Jul 23 22:35:30 asprnt01 kernel: ------------[ cut here ]------------ Jul 23 22:35:30 asprnt01 kernel: kernel BUG at arch/i386/mm/hypervisor.c:196!
Jul 23 22:35:30 asprnt01 kernel: invalid operand: 0000 [#1] Jul 23 22:35:30 asprnt01 kernel: SMP Jul 23 22:35:30 asprnt01 kernel: Modules linked in: autofs4 sunrpc dm_mirror dm_multipath dm_mod xennet ext3 jbd xenblk sd_mod scsi_mod
Jul 23 22:35:30 asprnt01 kernel: CPU: 1
Jul 23 22:35:30 asprnt01 kernel: EIP: 0061:[<c0114c6e>] Not tainted VLI
Jul 23 22:35:30 asprnt01 kernel: EFLAGS: 00010282 (2.6.9-55.ELxenU)
Jul 23 22:35:30 asprnt01 kernel: EIP is at xen_pgd_pin+0x46/0x54
Jul 23 22:35:30 asprnt01 kernel: eax: ffffffea ebx: e9462e2c ecx: 00000001 edx: 00000000
Jul 23 22:35:30 asprnt01 kernel: esi: 00007ff0 edi: ec012840 ebp: ec012840 esp: e9462e2c
Jul 23 22:35:30 asprnt01 kernel: ds: 007b es: 007b ss: 0068
Jul 23 22:35:30 asprnt01 kernel: Process getloadavg.pl (pid: 32097, threadinfo=e9462000 task=d27001b0) Jul 23 22:35:30 asprnt01 kernel: Stack: 00000002 0009fb52 dc032018 1c032000 00380640 ed1a6080 c011246b 1c032000
Jul 23 22:35:30 asprnt01 kernel: dc032000 00000061 80000000 ed1a60c4 c01124ff dc032000 00000001 e9462e78
Jul 23 22:35:30 asprnt01 kernel: c0161d57 ed1a6080 d27001b0 dd058a40 e9462eb4 00000080 c0158417 ed4a9580
Jul 23 22:35:30 asprnt01 kernel: Call Trace:
Jul 23 22:35:30 asprnt01 kernel: [<c011246b>] __pgd_pin+0x2d/0x41 Jul 23 22:35:30 asprnt01 kernel: [<c01124ff>] mm_pin+0x21/0x2e Jul 23 22:35:30 asprnt01 kernel: [<c0161d57>] exec_mmap+0xf8/0x1eb Jul 23 22:35:30 asprnt01 kernel: [<c0158417>] vfs_read+0xcf/0xd8 Jul 23 22:35:30 asprnt01 kernel: [<c0161ef8>] flush_old_exec+0x46/0x228 Jul 23 22:35:31 asprnt01 kernel: [<c017e42c>] load_elf_binary+0x361/0xce5 Jul 23 22:35:31 asprnt01 kernel: [<c017e610>] load_elf_binary+0x545/0xce5 Jul 23 22:35:31 asprnt01 kernel: [<c0146050>] kmap_high+0x2d/0x1fb Jul 23 22:35:31 asprnt01 kernel: [<c0146214>] kmap_high+0x1f1/0x1fb Jul 23 22:35:31 asprnt01 kernel: [<c0146231>] kunmap_high+0x13/0x95 Jul 23 22:35:31 asprnt01 kernel: [<c0146296>] kunmap_high+0x78/0x95 Jul 23 22:35:31 asprnt01 kernel: [<c0161763>] copy_strings+0x22f/0x23a Jul 23 22:35:31 asprnt01 kernel: [<c017e0cb>] load_elf_binary+0x0/0xce5 Jul 23 22:35:31 asprnt01 kernel: [<c016295e>] search_binary_handler+0xb4/0x229 Jul 23 22:35:31 asprnt01 kernel: [<c0162c4b>] do_execve+0x178/0x210 Jul 23 22:35:31 asprnt01 kernel: [<c0105d79>] sys_execve+0x2c/0x8e Jul 23 22:35:31 asprnt01 kernel: [<c010737f>] syscall_call+0x7/0xb Jul 23 22:35:31 asprnt01 kernel: Code: 00 75 0d a1 a0 80 35 c0 8b 04 90 25 ff ff ff 7f 89 44 24 04 89 e3 b9 01 00 00 00 31 d2 be f0 7f 00 00 e8 d6 c6 fe ff 85 c0 79 08 <0f> 0b c4 00 93 2d 27 c0 83 c4 10 5b 5e c3 56 53 83 ec 10 8b 54 Jul 23 22:35:31 asprnt01 kernel: <0>Fatal exception: panic in 5 seconds


Version-Release number of selected component (if applicable):

RHEL4u5 / 2.6.9-55.ELxenU #1 SMP

How reproducible:

Random crash, could not reproduce.


Additional info:

This seems to be a known bug in the Xen community, but not RedHat

http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=756

Comment 3 Bill Burns 2008-10-20 12:23:59 UTC

Is this a 64 bit dom0 and a 32 bit domU?

Comment 4 jeffb69ma 2008-10-20 12:50:11 UTC

32 bit dom0 and domU

Comment 5 Chris Lalancette 2008-11-17 08:22:47 UTC

OK.  Well, do you have your guests set up to dump core automatically when they crash?  If not, please configure it in /etc/xen/xend-config.sxp; then next time it crashes, we should at least get a core.  Also, if it happens again, please attach a full output from "xm dmesg" in the dom0; that will give us a little bit more information about why this is happening.

Thanks,
Chris Lalancette

Comment 16 Andrew Jones 2009-07-15 21:53:57 UTC

This bug is still waiting for a reproducer to be identified. Also, there's an outstanding question of whether or not the crash has been seen on later releases, at least 4.8.  The upstream bug doesn't show any recent reports of it either.

Just to capture some other comments here:

From the xm dmesg log I see a lot of the following:

(XEN) mm.c:649:d27 Error getting mfn 8846 (pfn 5555555555555555) from L1 entry 0000000008846063 for dom27
(XEN) printk: 18834 messages suppressed.

There are also several other mfn errors with more sane pfns and many 'Non-privileged (27) attempt to map I/O space' errors.

Comment 17 Chris Lalancette 2009-07-16 09:19:27 UTC

(In reply to comment #16)
> This bug is still waiting for a reproducer to be identified. Also, there's an
> outstanding question of whether or not the crash has been seen on later
> releases, at least 4.8.  The upstream bug doesn't show any recent reports of it
> either.
> 
> Just to capture some other comments here:
> 
> From the xm dmesg log I see a lot of the following:
> 
> (XEN) mm.c:649:d27 Error getting mfn 8846 (pfn 5555555555555555) from L1 entry
> 0000000008846063 for dom27
> (XEN) printk: 18834 messages suppressed.

Right, this is the crux of the issue.  The domain gave completely bogus PFN's to the hypervisor, which the hypervisor then failed.  And of course, the domain then did BUG_ON(), since it didn't expect it to happen.  But what we don't know is why the domain handed down those bogus PFN's to begin with.  That's something only a reproducer (or a valid crash file) could tell us.  And like you said, this may have been fixed already, since there have been quite a few fixes in this area since 4.5.

> 
> There are also several other mfn errors with more sane pfns and many
> 'Non-privileged (27) attempt to map I/O space' errors.  

These you can ignore, for the most part.

Chris Lalancette

Comment 18 Paolo Bonzini 2010-06-23 15:52:08 UTC

This looks very much like a dup of bug 513537.  Unfortunately that one only includes xend.log and not "xm dmesg", but the bogus pfn's are exactly the symptom of the bug.

*** This bug has been marked as a duplicate of bug 513537 ***

Comment 20 Paolo Bonzini 2011-06-27 11:28:22 UTC

Because without a reliable reproducer, I was hoping that the fix for bug 513537 fixed this as well.  The alternative was CLOSED/INSUFFICIENT_DATA at the time.

By looking at the patch that fixed the xm save bug, I was indeed in error.  I'm reopening this.

Comment 21 Paolo Bonzini 2011-06-27 11:29:33 UTC

That said, as in comment 16, this was only reported on ancient versions.  Does the customer have a reproducer?  If not, this is going to be closed again in a few months.

Note You need to log in before you can comment on or make changes to this bug.