Bug 457879 - xen: 32 bit guest on 64 bit host oops in xen_set_pud()
Summary: xen: 32 bit guest on 64 bit host oops in xen_set_pud()
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel-xen
Version: 9
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Xen Maintainance List
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 467698
TreeView+ depends on / blocked
 
Reported: 2008-08-05 06:35 UTC by Jan ONDREJ
Modified: 2009-12-14 20:41 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-09-16 23:19:33 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Kernel OOPS (30.83 KB, application/octet-stream)
2008-08-05 06:35 UTC, Jan ONDREJ
no flags Details
Before crash (16.00 KB, application/octet-stream)
2008-08-06 19:02 UTC, Jan ONDREJ
no flags Details
After crash (16.00 KB, application/octet-stream)
2008-08-06 19:03 UTC, Jan ONDREJ
no flags Details

Description Jan ONDREJ 2008-08-05 06:35:29 UTC
Created attachment 313430 [details]
Kernel OOPS

Description of problem:
DomU kernel crashed after restart of apache. Multiple oops have been displayed on virtual console.

After this hang, I am unable to start my machind. Always hangs after start of apache.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.25.3-2.fc9.i686
httpd-2.2.8-3.i386

How reproducible:
Unknown, always for me today, worked 

Steps to Reproduce:

  
Actual results:
See attached oops.

Expected results:
No oops.

Additional info:

Comment 1 Mark McLoughlin 2008-08-06 17:37:26 UTC
Pasting the oops here for convenience:

kernel BUG at arch/x86/xen/multicalls.c:103!
invalid opcode: 0000 [#1] SMP
Modules linked in: nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables ip6t_R
EJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_multipath dm_mod pcspkr xen_netfront xen_blkfront ext3 
jbd mbcache uhci_hcd ohci_hcd ehci_hcd

Pid: 1370, comm: httpd Not tainted (2.6.25.3-2.fc9.i686.xen #1)
EIP: 0061:[<c0404043>] EFLAGS: 00010002 CPU: 0
EIP is at xen_mc_flush+0x163/0x16f
EAX: 00000001 EBX: c1403054 ECX: 00000000 EDX: c1403054
ESI: c1403074 EDI: 00000000 EBP: dcc50d68 ESP: dcc50d50
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process httpd (pid: 1370, ti=dcc50000 task=d7c6ee90 task.ti=dcc50000)
Stack: c1403054 00000001 00000000 c1403854 c91b9008 c1403054 dcc50d84 c0404815
       13f26001 00000001 c91b9008 13f26001 c0c721c0 dcc50da4 c0471964 c0c721c0
       00000000 00000000 c91b9008 c0c721f8 00000001 dcc50e4c c047346e 0006e550
Call Trace:
 [<c0404815>] ? xen_set_pud+0xb6/0xcd
 [<c0471964>] ? __pmd_alloc+0x8b/0xb4
 [<c047346e>] ? handle_mm_fault+0xa3/0xa2a
 [<c047282b>] ? unmap_vmas+0x146/0x611
 [<c0637413>] ? do_page_fault+0x3ca/0x8d8
 [<c047535a>] ? free_pgtables+0x7e/0x94
 [<c04f7a7f>] ? prio_tree_insert+0x18c/0x1ff
 [<c046fac0>] ? vma_prio_tree_insert+0x1a/0x2e
 [<c04769a1>] ? vma_link+0xa1/0xbe
 [<c0477c55>] ? mmap_region+0x34d/0x40b
 [<c045c3d4>] ? audit_syscall_exit+0x2b1/0x2cc
 [<c040e224>] ? do_syscall_trace+0x69/0x16d
 [<c0637049>] ? do_page_fault+0x0/0x8d8
 [<c0635c0a>] ? error_code+0x72/0x78
 =======================
Code: e8 8b 84 fa 04 0a 00 00 ff 94 fa 00 0a 00 00 47 8b 5d e8 3b bb 08 0b 00 00 72 e3 c7 83 08 0b 00 00 00 00 00 00 83 7d ec 00 74 04 <0f> 0b eb fe 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 57 89 d7 56 89 c6


I've not seen this before and can't reproduce with the default apache config

Jeremy, have you come across this before?

Jan, are there any messages on the console from the hypervisor when the oops occurs?

Comment 2 Jan ONDREJ 2008-08-06 18:04:59 UTC
(In reply to comment #1)

> Jan, are there any messages on the console from the hypervisor when the oops
> occurs?

My hypervisor is still running, but I can't see any interesting things in current dmesg. Only normal network initialization.

Today I can't reproduce this. It's curious, that my domU was running aprox. 2 days without problems, then after an apache config update and restart of this service my domU crashed.

Then I was unable to start before "chkconfig httpd off". After this I was able to start apache normally by typing "service httpd start".

Today it works with normal startup (chkconfig httpd on), but with modified config again.

Now I tryed to revert my config back to backup. Hangs again. These lines have been added:
<Directory /usr/share/nagios/html/pnp4nagios>
  Allow from .XXXXXX.sk .XXXXX.XXXXX.sk 158.XXX.XXX.
</Directory>
I think it has nothing with these current lines, but with something else in memory.

My machine is not critical, so I can do more tests if required.
It's just an monitoring server, which need to run most of time.

Comment 3 Jeremy Fitzhardinge 2008-08-06 18:24:57 UTC
What version of Xen is it, and is it a 32 or 64-bit hypervisor?

There's a old Xen bug which prevents a 32-bit guest running on a 64-bit hypervisor from changing its own top-level pagetable entries, causing set_pud to fail.  It was fixed some time around Feb-March, I think.

Unfortunately the stack trace is a bit unclear here, so I'm not sure what's really going on in this case.  Aside from the Xen bug, I haven't seen anything like this before.

Comment 4 Jeremy Fitzhardinge 2008-08-06 18:29:02 UTC
BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log.  There should be something there to indicate why it decided to fail the hypercall.

Comment 5 Jan ONDREJ 2008-08-06 19:02:06 UTC
(In reply to comment #4)
> BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log. 
> There should be something there to indicate why it decided to fail the
> hypercall.

Attaching my "xm dmesg" output. I can't exactly tell, what is new.
d1 is before last crash, d2 after last crash.

It is an 32bit guest on 64bit hypervisor. Mentioned problem appeared to me too some months ago.

And another information. "chkconfig httpd off" then boot system normally, then back "chkconfig httpd on" and "reboot". Server is working. I want tell, that it hangs only on first boot, after reboot it works.
There must be something special in memory, when it fails.

Comment 6 Jan ONDREJ 2008-08-06 19:02:48 UTC
Created attachment 313627 [details]
Before crash

Comment 7 Jan ONDREJ 2008-08-06 19:03:13 UTC
Created attachment 313628 [details]
After crash

Comment 8 Jeremy Fitzhardinge 2008-08-06 19:30:53 UTC
(Please set the type on dumps to text, or paste them inline)

(XEN) mm.c:694:d28 Bad L3 flags 6

OK, that's the signature of the Xen bug I mentioned.  The fix is to update xen.

The bug depends on where things get mapped in the process address space.  It may be that address randomization is causing the non-deterministic results for you.

Comment 9 Jan ONDREJ 2008-08-06 19:42:42 UTC
My xen is already updated. My system has uptime 20 days and is updated daily.

[root@vs2 ~]# rpm -q xen kernel-xen
xen-3.1.2-2.fc8
kernel-xen-2.6.21.7-2.fc8
kernel-xen-2.6.21.7-3.fc8
[root@vs2 ~]# uname -a
Linux vs2.XXXX.sk 2.6.21.7-3.fc8xen #1 SMP Thu Mar 20 14:58:12 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@vs2 ~]# cat /var/log/yum.log | grep xen
Feb 14 05:15:13 Installed: kernel-xen - 2.6.21-2957.fc8.x86_64
Feb 20 06:47:18 Installed: kernel-xen - 2.6.21.7-2.fc8.x86_64
Feb 29 05:57:17 Updated: xen-libs - 3.1.2-2.fc8.x86_64
Feb 29 05:57:23 Updated: xen - 3.1.2-2.fc8.x86_64
Mar 27 05:51:01 Installed: kernel-xen - 2.6.21.7-3.fc8.x86_64
[root@vs2 ~]# 

Do you think, I need another reboot?

Comment 10 Jeremy Fitzhardinge 2008-08-06 20:26:34 UTC
The bug fix was committed to xen-unstable in:
changeset:   17061:9d29141a5e52
user:        Keir Fraser <keir.fraser>
date:        Mon Feb 18 13:50:25 2008 +0000
files:       xen/arch/x86/mm.c

So I think the F8 Xen package is out of date and needs updating.  I don't know whether RH are likely to do that.

A workaround might be to run a 64-bit kernel in your guest.  You'd just need to update the kernel; all the 32-bit usermode code should run fine in compat mode.

Comment 11 Mark McLoughlin 2008-08-07 15:51:35 UTC
Thanks for the pointer Jeremy

I've kicked off a build of kernel-xen-2.6-2.6.21.7-4.fc8 with xen-3.1.4, which contains the fix

Comment 12 Jan ONDREJ 2008-08-08 06:44:21 UTC
After a reboot my guest order has been changed and now my previously bad machine does not hang (also with current stable kernel). If you want, I can test this new kernel, but I am unable to reproduce previous bug.

This new kernel works on second xen server. There was a problem with "Error: (9, 'Bad file descriptor')" after first reboot, but I think this happened sometimes also with older kernel. May be this has been caused by me, after multiple of starts of one of my guests. After second reboot server works well.

Comment 13 Fedora Update System 2008-08-08 06:53:39 UTC
kernel-xen-2.6-2.6.21.7-5.fc8 has been submitted as an update for Fedora 8

Comment 14 Mark McLoughlin 2008-08-08 07:00:53 UTC
Jan: I've pushed to updates-testing; please test and bump the karma here in order to get it pushed to stable updates:

  https://admin.fedoraproject.org/updates/F8/pending/kernel-xen-2.6-2.6.21.7-5.fc8

Orion: if you've still got 32-on-64 guests, maybe you could give it a shot too?

Comment 15 Fedora Update System 2008-08-12 18:27:34 UTC
kernel-xen-2.6-2.6.21.7-5.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel-xen-2.6'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-7240

Comment 16 Jan ONDREJ 2008-08-13 08:12:55 UTC
This update works for me on 2 machines. Although I was unable to reproduce previous problem also with older kernel, I can confirm at least that this update does not added any bugs for me. :)

Bodhi is down, so I can't add an +1 karma point.

Comment 17 Fedora Update System 2008-09-16 23:19:29 UTC
kernel-xen-2.6-2.6.21.7-5.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.