Red Hat Bugzilla – Bug 457879
xen: 32 bit guest on 64 bit host oops in xen_set_pud()
Last modified: 2009-12-14 15:41:00 EST
Created attachment 313430 [details]
Description of problem:
DomU kernel crashed after restart of apache. Multiple oops have been displayed on virtual console.
After this hang, I am unable to start my machind. Always hangs after start of apache.
Version-Release number of selected component (if applicable):
Unknown, always for me today, worked
Steps to Reproduce:
See attached oops.
Pasting the oops here for convenience:
kernel BUG at arch/x86/xen/multicalls.c:103!
invalid opcode: 0000 [#1] SMP
Modules linked in: nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables ip6t_R
EJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_multipath dm_mod pcspkr xen_netfront xen_blkfront ext3
jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 1370, comm: httpd Not tainted (18.104.22.168-2.fc9.i686.xen #1)
EIP: 0061:[<c0404043>] EFLAGS: 00010002 CPU: 0
EIP is at xen_mc_flush+0x163/0x16f
EAX: 00000001 EBX: c1403054 ECX: 00000000 EDX: c1403054
ESI: c1403074 EDI: 00000000 EBP: dcc50d68 ESP: dcc50d50
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process httpd (pid: 1370, ti=dcc50000 task=d7c6ee90 task.ti=dcc50000)
Stack: c1403054 00000001 00000000 c1403854 c91b9008 c1403054 dcc50d84 c0404815
13f26001 00000001 c91b9008 13f26001 c0c721c0 dcc50da4 c0471964 c0c721c0
00000000 00000000 c91b9008 c0c721f8 00000001 dcc50e4c c047346e 0006e550
[<c0404815>] ? xen_set_pud+0xb6/0xcd
[<c0471964>] ? __pmd_alloc+0x8b/0xb4
[<c047346e>] ? handle_mm_fault+0xa3/0xa2a
[<c047282b>] ? unmap_vmas+0x146/0x611
[<c0637413>] ? do_page_fault+0x3ca/0x8d8
[<c047535a>] ? free_pgtables+0x7e/0x94
[<c04f7a7f>] ? prio_tree_insert+0x18c/0x1ff
[<c046fac0>] ? vma_prio_tree_insert+0x1a/0x2e
[<c04769a1>] ? vma_link+0xa1/0xbe
[<c0477c55>] ? mmap_region+0x34d/0x40b
[<c045c3d4>] ? audit_syscall_exit+0x2b1/0x2cc
[<c040e224>] ? do_syscall_trace+0x69/0x16d
[<c0637049>] ? do_page_fault+0x0/0x8d8
[<c0635c0a>] ? error_code+0x72/0x78
Code: e8 8b 84 fa 04 0a 00 00 ff 94 fa 00 0a 00 00 47 8b 5d e8 3b bb 08 0b 00 00 72 e3 c7 83 08 0b 00 00 00 00 00 00 83 7d ec 00 74 04 <0f> 0b eb fe 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 57 89 d7 56 89 c6
I've not seen this before and can't reproduce with the default apache config
Jeremy, have you come across this before?
Jan, are there any messages on the console from the hypervisor when the oops occurs?
(In reply to comment #1)
> Jan, are there any messages on the console from the hypervisor when the oops
My hypervisor is still running, but I can't see any interesting things in current dmesg. Only normal network initialization.
Today I can't reproduce this. It's curious, that my domU was running aprox. 2 days without problems, then after an apache config update and restart of this service my domU crashed.
Then I was unable to start before "chkconfig httpd off". After this I was able to start apache normally by typing "service httpd start".
Today it works with normal startup (chkconfig httpd on), but with modified config again.
Now I tryed to revert my config back to backup. Hangs again. These lines have been added:
Allow from .XXXXXX.sk .XXXXX.XXXXX.sk 158.XXX.XXX.
I think it has nothing with these current lines, but with something else in memory.
My machine is not critical, so I can do more tests if required.
It's just an monitoring server, which need to run most of time.
What version of Xen is it, and is it a 32 or 64-bit hypervisor?
There's a old Xen bug which prevents a 32-bit guest running on a 64-bit hypervisor from changing its own top-level pagetable entries, causing set_pud to fail. It was fixed some time around Feb-March, I think.
Unfortunately the stack trace is a bit unclear here, so I'm not sure what's really going on in this case. Aside from the Xen bug, I haven't seen anything like this before.
BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log. There should be something there to indicate why it decided to fail the hypercall.
(In reply to comment #4)
> BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log.
> There should be something there to indicate why it decided to fail the
Attaching my "xm dmesg" output. I can't exactly tell, what is new.
d1 is before last crash, d2 after last crash.
It is an 32bit guest on 64bit hypervisor. Mentioned problem appeared to me too some months ago.
And another information. "chkconfig httpd off" then boot system normally, then back "chkconfig httpd on" and "reboot". Server is working. I want tell, that it hangs only on first boot, after reboot it works.
There must be something special in memory, when it fails.
Created attachment 313627 [details]
Created attachment 313628 [details]
(Please set the type on dumps to text, or paste them inline)
(XEN) mm.c:694:d28 Bad L3 flags 6
OK, that's the signature of the Xen bug I mentioned. The fix is to update xen.
The bug depends on where things get mapped in the process address space. It may be that address randomization is causing the non-deterministic results for you.
My xen is already updated. My system has uptime 20 days and is updated daily.
[root@vs2 ~]# rpm -q xen kernel-xen
[root@vs2 ~]# uname -a
Linux vs2.XXXX.sk 22.214.171.124-3.fc8xen #1 SMP Thu Mar 20 14:58:12 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@vs2 ~]# cat /var/log/yum.log | grep xen
Feb 14 05:15:13 Installed: kernel-xen - 2.6.21-2957.fc8.x86_64
Feb 20 06:47:18 Installed: kernel-xen - 126.96.36.199-2.fc8.x86_64
Feb 29 05:57:17 Updated: xen-libs - 3.1.2-2.fc8.x86_64
Feb 29 05:57:23 Updated: xen - 3.1.2-2.fc8.x86_64
Mar 27 05:51:01 Installed: kernel-xen - 188.8.131.52-3.fc8.x86_64
Do you think, I need another reboot?
The bug fix was committed to xen-unstable in:
user: Keir Fraser <email@example.com>
date: Mon Feb 18 13:50:25 2008 +0000
So I think the F8 Xen package is out of date and needs updating. I don't know whether RH are likely to do that.
A workaround might be to run a 64-bit kernel in your guest. You'd just need to update the kernel; all the 32-bit usermode code should run fine in compat mode.
Thanks for the pointer Jeremy
I've kicked off a build of kernel-xen-2.6-184.108.40.206-4.fc8 with xen-3.1.4, which contains the fix
After a reboot my guest order has been changed and now my previously bad machine does not hang (also with current stable kernel). If you want, I can test this new kernel, but I am unable to reproduce previous bug.
This new kernel works on second xen server. There was a problem with "Error: (9, 'Bad file descriptor')" after first reboot, but I think this happened sometimes also with older kernel. May be this has been caused by me, after multiple of starts of one of my guests. After second reboot server works well.
kernel-xen-2.6-220.127.116.11-5.fc8 has been submitted as an update for Fedora 8
Jan: I've pushed to updates-testing; please test and bump the karma here in order to get it pushed to stable updates:
Orion: if you've still got 32-on-64 guests, maybe you could give it a shot too?
kernel-xen-2.6-18.104.22.168-5.fc8 has been pushed to the Fedora 8 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
su -c 'yum --enablerepo=updates-testing update kernel-xen-2.6'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-7240
This update works for me on 2 machines. Although I was unable to reproduce previous problem also with older kernel, I can confirm at least that this update does not added any bugs for me. :)
Bodhi is down, so I can't add an +1 karma point.
kernel-xen-2.6-22.214.171.124-5.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report.