Red Hat Bugzilla – Bug 467698
xen: 32 bit guest on 64 bit host oops in xen_set_pud()
Last modified: 2011-07-18 11:54:27 EDT
+++ This bug was initially created as a clone of Bug #457879 +++
Created an attachment (id=313430)
Description of problem:
DomU kernel crashed after restart of apache. Multiple oops have been displayed on virtual console.
After this hang, I am unable to start my machind. Always hangs after start of apache.
Version-Release number of selected component (if applicable):
Unknown, always for me today, worked
Steps to Reproduce:
See attached oops.
--- Additional comment from email@example.com on 2008-08-06 13:37:26 EDT ---
Pasting the oops here for convenience:
kernel BUG at arch/x86/xen/multicalls.c:103!
invalid opcode: 0000 [#1] SMP
Modules linked in: nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables ip6t_R
EJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_multipath dm_mod pcspkr xen_netfront xen_blkfront ext3
jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 1370, comm: httpd Not tainted (184.108.40.206-2.fc9.i686.xen #1)
EIP: 0061:[<c0404043>] EFLAGS: 00010002 CPU: 0
EIP is at xen_mc_flush+0x163/0x16f
EAX: 00000001 EBX: c1403054 ECX: 00000000 EDX: c1403054
ESI: c1403074 EDI: 00000000 EBP: dcc50d68 ESP: dcc50d50
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process httpd (pid: 1370, ti=dcc50000 task=d7c6ee90 task.ti=dcc50000)
Stack: c1403054 00000001 00000000 c1403854 c91b9008 c1403054 dcc50d84 c0404815
13f26001 00000001 c91b9008 13f26001 c0c721c0 dcc50da4 c0471964 c0c721c0
00000000 00000000 c91b9008 c0c721f8 00000001 dcc50e4c c047346e 0006e550
[<c0404815>] ? xen_set_pud+0xb6/0xcd
[<c0471964>] ? __pmd_alloc+0x8b/0xb4
[<c047346e>] ? handle_mm_fault+0xa3/0xa2a
[<c047282b>] ? unmap_vmas+0x146/0x611
[<c0637413>] ? do_page_fault+0x3ca/0x8d8
[<c047535a>] ? free_pgtables+0x7e/0x94
[<c04f7a7f>] ? prio_tree_insert+0x18c/0x1ff
[<c046fac0>] ? vma_prio_tree_insert+0x1a/0x2e
[<c04769a1>] ? vma_link+0xa1/0xbe
[<c0477c55>] ? mmap_region+0x34d/0x40b
[<c045c3d4>] ? audit_syscall_exit+0x2b1/0x2cc
[<c040e224>] ? do_syscall_trace+0x69/0x16d
[<c0637049>] ? do_page_fault+0x0/0x8d8
[<c0635c0a>] ? error_code+0x72/0x78
Code: e8 8b 84 fa 04 0a 00 00 ff 94 fa 00 0a 00 00 47 8b 5d e8 3b bb 08 0b 00 00 72 e3 c7 83 08 0b 00 00 00 00 00 00 83 7d ec 00 74 04 <0f> 0b eb fe 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 57 89 d7 56 89 c6
I've not seen this before and can't reproduce with the default apache config
Jeremy, have you come across this before?
Jan, are there any messages on the console from the hypervisor when the oops occurs?
--- Additional comment from firstname.lastname@example.org on 2008-08-06 14:04:59 EDT ---
(In reply to comment #1)
> Jan, are there any messages on the console from the hypervisor when the oops
My hypervisor is still running, but I can't see any interesting things in current dmesg. Only normal network initialization.
Today I can't reproduce this. It's curious, that my domU was running aprox. 2 days without problems, then after an apache config update and restart of this service my domU crashed.
Then I was unable to start before "chkconfig httpd off". After this I was able to start apache normally by typing "service httpd start".
Today it works with normal startup (chkconfig httpd on), but with modified config again.
Now I tryed to revert my config back to backup. Hangs again. These lines have been added:
Allow from .XXXXXX.sk .XXXXX.XXXXX.sk 158.XXX.XXX.
I think it has nothing with these current lines, but with something else in memory.
My machine is not critical, so I can do more tests if required.
It's just an monitoring server, which need to run most of time.
--- Additional comment from email@example.com on 2008-08-06 14:24:57 EDT ---
What version of Xen is it, and is it a 32 or 64-bit hypervisor?
There's a old Xen bug which prevents a 32-bit guest running on a 64-bit hypervisor from changing its own top-level pagetable entries, causing set_pud to fail. It was fixed some time around Feb-March, I think.
Unfortunately the stack trace is a bit unclear here, so I'm not sure what's really going on in this case. Aside from the Xen bug, I haven't seen anything like this before.
--- Additional comment from firstname.lastname@example.org on 2008-08-06 14:29:02 EDT ---
BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log. There should be something there to indicate why it decided to fail the hypercall.
--- Additional comment from email@example.com on 2008-08-06 15:02:06 EDT ---
(In reply to comment #4)
> BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log.
> There should be something there to indicate why it decided to fail the
Attaching my "xm dmesg" output. I can't exactly tell, what is new.
d1 is before last crash, d2 after last crash.
It is an 32bit guest on 64bit hypervisor. Mentioned problem appeared to me too some months ago.
And another information. "chkconfig httpd off" then boot system normally, then back "chkconfig httpd on" and "reboot". Server is working. I want tell, that it hangs only on first boot, after reboot it works.
There must be something special in memory, when it fails.
--- Additional comment from firstname.lastname@example.org on 2008-08-06 15:02:48 EDT ---
Created an attachment (id=313627)
--- Additional comment from email@example.com on 2008-08-06 15:03:13 EDT ---
Created an attachment (id=313628)
--- Additional comment from firstname.lastname@example.org on 2008-08-06 15:30:53 EDT ---
(Please set the type on dumps to text, or paste them inline)
(XEN) mm.c:694:d28 Bad L3 flags 6
OK, that's the signature of the Xen bug I mentioned. The fix is to update xen.
The bug depends on where things get mapped in the process address space. It may be that address randomization is causing the non-deterministic results for you.
--- Additional comment from email@example.com on 2008-08-06 15:42:42 EDT ---
My xen is already updated. My system has uptime 20 days and is updated daily.
[root@vs2 ~]# rpm -q xen kernel-xen
[root@vs2 ~]# uname -a
Linux vs2.XXXX.sk 220.127.116.11-3.fc8xen #1 SMP Thu Mar 20 14:58:12 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@vs2 ~]# cat /var/log/yum.log | grep xen
Feb 14 05:15:13 Installed: kernel-xen - 2.6.21-2957.fc8.x86_64
Feb 20 06:47:18 Installed: kernel-xen - 18.104.22.168-2.fc8.x86_64
Feb 29 05:57:17 Updated: xen-libs - 3.1.2-2.fc8.x86_64
Feb 29 05:57:23 Updated: xen - 3.1.2-2.fc8.x86_64
Mar 27 05:51:01 Installed: kernel-xen - 22.214.171.124-3.fc8.x86_64
Do you think, I need another reboot?
--- Additional comment from firstname.lastname@example.org on 2008-08-06 16:26:34 EDT ---
The bug fix was committed to xen-unstable in:
user: Keir Fraser <email@example.com>
date: Mon Feb 18 13:50:25 2008 +0000
So I think the F8 Xen package is out of date and needs updating. I don't know whether RH are likely to do that.
A workaround might be to run a 64-bit kernel in your guest. You'd just need to update the kernel; all the 32-bit usermode code should run fine in compat mode.
--- Additional comment from firstname.lastname@example.org on 2008-08-07 11:51:35 EDT ---
Thanks for the pointer Jeremy
I've kicked off a build of kernel-xen-2.6-126.96.36.199-4.fc8 with xen-3.1.4, which contains the fix
--- Additional comment from email@example.com on 2008-08-08 02:44:21 EDT ---
After a reboot my guest order has been changed and now my previously bad machine does not hang (also with current stable kernel). If you want, I can test this new kernel, but I am unable to reproduce previous bug.
This new kernel works on second xen server. There was a problem with "Error: (9, 'Bad file descriptor')" after first reboot, but I think this happened sometimes also with older kernel. May be this has been caused by me, after multiple of starts of one of my guests. After second reboot server works well.
--- Additional comment from firstname.lastname@example.org on 2008-08-08 02:53:39 EDT ---
kernel-xen-2.6-188.8.131.52-5.fc8 has been submitted as an update for Fedora 8
--- Additional comment from email@example.com on 2008-08-08 03:00:53 EDT ---
Jan: I've pushed to updates-testing; please test and bump the karma here in order to get it pushed to stable updates:
Orion: if you've still got 32-on-64 guests, maybe you could give it a shot too?
--- Additional comment from firstname.lastname@example.org on 2008-08-12 14:27:34 EDT ---
kernel-xen-2.6-184.108.40.206-5.fc8 has been pushed to the Fedora 8 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
su -c 'yum --enablerepo=updates-testing update kernel-xen-2.6'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-7240
--- Additional comment from email@example.com on 2008-08-13 04:12:55 EDT ---
This update works for me on 2 machines. Although I was unable to reproduce previous problem also with older kernel, I can confirm at least that this update does not added any bugs for me. :)
Bodhi is down, so I can't add an +1 karma point.
--- Additional comment from firstname.lastname@example.org on 2008-09-16 19:19:29 EDT ---
kernel-xen-2.6-220.127.116.11-5.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report.
Additional notes: this is a problem in the RHEL-5 hypervisor as well, when trying to install a F10 i386 PV guest on an x86_64 RHEL-5 HV. As Jeremy pointed out, the upstream xen-unstable c/s was 17061, and the upstream xen-3.1-testing.hg c/s was 15653. I'll attach a backport to the BZ, which seems to fix the problem for me.
Created attachment 320860 [details]
Backport of upstream xen-3.1-testing c/s 15653, to fix F-10 32-on-64 crash
Is there a public version of a working xen for 5.2? The ones here: http://fedorapeople.org/~crobinso/rhel5/install_f10/ don't work for me.
Hm, I'm not sure if you posted in the right bug, but those packages you mentioned are the preview packages for 5.3. So if they don't work, please let us know why.
Here are the messages. I post here because of xen_set_pud. Happy to open new bug if needed.
Checking if this processor honours the WP bit even in supervisor mode...Ok.
1 multicall(s) failed: cpu 0
Pid: 0, comm: swapper Not tainted 18.104.22.168-58.fc10.i686.PAE #1
[<c06d1213>] ? printk+0xf/0x14
call 1/1: op=1 arg=[c2b96854] result=-22
------------[ cut here ]------------
kernel BUG at arch/x86/xen/multicalls.c:104!
invalid opcode: 0000 [#1] SMP
Modules linked in:
Pid: 0, comm: swapper Not tainted (22.214.171.124-58.fc10.i686.PAE #1)
EIP: e019:[<c0404a97>] EFLAGS: 00010002 CPU: 0
EIP is at xen_mc_flush+0x17b/0x187
EAX: c2b96054 EBX: 00000000 ECX: ffffffff EDX: c2b96054
ESI: 00000001 EDI: 00000001 EBP: c0846ef4 ESP: c0846ee0
DS: e021 ES: e021 FS: 00d8 GS: 0000 SS: e021
Process swapper (pid: 0, ti=c0846000 task=c0808344 task.ti=c0846000)
Stack: c2b96054 00000000 00000001 7373d001 00000000 c0846f00 c0405332 c0833000
c0846f24 c04058d2 737b4000 00000000 7373d001 00000000 c0833000 7373d001
00000000 c0846f38 c040590e c0833000 c0834000 00000000 c0846f50 c041f9a0
[<c0405332>] ? xen_mc_issue+0x14/0x48
[<c04058d2>] ? xen_set_pud_hyper+0x39/0x41
[<c040590e>] ? xen_set_pud+0x34/0x39
[<c041f9a0>] ? zap_low_mappings+0x2f/0x47
[<c08609a6>] ? mem_init+0x2c7/0x2cf
[<c084b7e9>] ? start_kernel+0x246/0x2f0
[<c084b091>] ? i386_start_kernel+0x80/0x88
[<c08511e2>] ? xen_start_kernel+0x7dd/0x7e5
Code: 8b 55 ec 8b 84 da 04 0a 00 00 ff 94 da 00 0a 00 00 43 8b 45 ec 3b 98 08 0b 00 00 72 e3 85 ff c7 80 08 0b 00 00 00 00 00 00 74 04 <0f> 0b eb fe 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 57 89 d7 56 89 c6
EIP: [<c0404a97>] xen_mc_flush+0x17b/0x187 SS:ESP e021:c0846ee0
---[ end trace 4eaa2a86a8e2da22 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Oh, I see. Well, there are two problems:
1. Those packages are only the userspace portion, while this ends up being a hypervisor bug. The hypervisor is packaged into the kernel, so you would need updated kernel-xen packages.
2. Regardless, this patch isn't in the latest kernel-xen packages. It still needs to go through internal review and testing first.
Thanks for the testing, though.
*** Bug 471276 has been marked as a duplicate of this bug. ***
I've uploaded a test kernel that contains this fix (along with several others)
to this location:
Could the original reporter try out the test kernels there, and report back if
it fixes the problem?
# rpm -ivh kernel-xen-2.6.18-128.el5virttest3.x86_64.rpm
error: Failed dependencies:
ecryptfs-utils < 44 conflicts with kernel-xen-2.6.18-128.el5virttest3.x86_64
# rpm -q ecryptfs-utils
Sigh. Can you temporarily just remove ecryptfs-utils (assuming you aren't using encrypted partitions)? The newer ecryptfs-utils will be shipped as part of 5.3, but hasn't been released yet.
Okay, removed ecryptfs-utils, didn't quite realize it was optional.
Looking good for me, I'm able to start a 32-bit fedora rawhide install, which wasn't even able to boot before. Also able to install 32-bit fedora 10 guest.
Yeah, ecryptfs-utils is optional unless you are using encrypted partitions, in which case it is mandatory. But I guess you are not doing that :). In any case, that is great news; it also seemed to fix the problem in my testing. I'll get this ready to go into the next RHEL release.
Thanks for the testing,
I am starting to see the following:
xen_net: Memory squeeze in netback driver.
and networking stop working in the guests. This may not be related to this new kernel, just that I am overloading the machine now (I am adding new guests), but thought I'd mention here before filing a new issue if necessary.
OK, yeah. There's another open bug about this (BZ 454285); one of the patches in this kernel seems to be exacerbating the problem, though, since I also saw it on one of my loaded machines. It needs to be debugged further.
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
I was seeing this bug on CentOS 5.3 x86_64 dom0; I could not start i386 F10 or F11 installation using virt-install. The graphical VNC console would never show up. When running "xm console <dom>" I saw a domU kernel crash.
After uprading the x86_64 dom0 kernel+xen to -159.el5 the problem is fixed. I can now successfully install i386 Fedora 10 and Fedora 11 guests/domUs.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.