467698 – xen: 32 bit guest on 64 bit host oops in xen_set_pud()

Bug 467698 - xen: 32 bit guest on 64 bit host oops in xen_set_pud()

Summary: xen: 32 bit guest on 64 bit host oops in xen_set_pud()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Lalancette
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	471276 (view as bug list)
Depends On:	457879
Blocks:	718066
TreeView+	depends on / blocked

Reported:	2008-10-20 11:02 UTC by Chris Lalancette
Modified:	2011-07-18 15:54 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:40:54 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backport of upstream xen-3.1-testing c/s 15653, to fix F-10 32-on-64 crash (3.16 KB, patch) 2008-10-20 11:07 UTC, Chris Lalancette	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Chris Lalancette 2008-10-20 11:02:22 UTC

+++ This bug was initially created as a clone of Bug #457879 +++

Created an attachment (id=313430)
Kernel OOPS

Description of problem:
DomU kernel crashed after restart of apache. Multiple oops have been displayed on virtual console.

After this hang, I am unable to start my machind. Always hangs after start of apache.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.25.3-2.fc9.i686
httpd-2.2.8-3.i386

How reproducible:
Unknown, always for me today, worked 

Steps to Reproduce:

  
Actual results:
See attached oops.

Expected results:
No oops.

Additional info:

--- Additional comment from markmc on 2008-08-06 13:37:26 EDT ---

Pasting the oops here for convenience:

kernel BUG at arch/x86/xen/multicalls.c:103!
invalid opcode: 0000 [#1] SMP
Modules linked in: nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables ip6t_R
EJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_multipath dm_mod pcspkr xen_netfront xen_blkfront ext3 
jbd mbcache uhci_hcd ohci_hcd ehci_hcd

Pid: 1370, comm: httpd Not tainted (2.6.25.3-2.fc9.i686.xen #1)
EIP: 0061:[<c0404043>] EFLAGS: 00010002 CPU: 0
EIP is at xen_mc_flush+0x163/0x16f
EAX: 00000001 EBX: c1403054 ECX: 00000000 EDX: c1403054
ESI: c1403074 EDI: 00000000 EBP: dcc50d68 ESP: dcc50d50
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process httpd (pid: 1370, ti=dcc50000 task=d7c6ee90 task.ti=dcc50000)
Stack: c1403054 00000001 00000000 c1403854 c91b9008 c1403054 dcc50d84 c0404815
       13f26001 00000001 c91b9008 13f26001 c0c721c0 dcc50da4 c0471964 c0c721c0
       00000000 00000000 c91b9008 c0c721f8 00000001 dcc50e4c c047346e 0006e550
Call Trace:
 [<c0404815>] ? xen_set_pud+0xb6/0xcd
 [<c0471964>] ? __pmd_alloc+0x8b/0xb4
 [<c047346e>] ? handle_mm_fault+0xa3/0xa2a
 [<c047282b>] ? unmap_vmas+0x146/0x611
 [<c0637413>] ? do_page_fault+0x3ca/0x8d8
 [<c047535a>] ? free_pgtables+0x7e/0x94
 [<c04f7a7f>] ? prio_tree_insert+0x18c/0x1ff
 [<c046fac0>] ? vma_prio_tree_insert+0x1a/0x2e
 [<c04769a1>] ? vma_link+0xa1/0xbe
 [<c0477c55>] ? mmap_region+0x34d/0x40b
 [<c045c3d4>] ? audit_syscall_exit+0x2b1/0x2cc
 [<c040e224>] ? do_syscall_trace+0x69/0x16d
 [<c0637049>] ? do_page_fault+0x0/0x8d8
 [<c0635c0a>] ? error_code+0x72/0x78
 =======================
Code: e8 8b 84 fa 04 0a 00 00 ff 94 fa 00 0a 00 00 47 8b 5d e8 3b bb 08 0b 00 00 72 e3 c7 83 08 0b 00 00 00 00 00 00 83 7d ec 00 74 04 <0f> 0b eb fe 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 57 89 d7 56 89 c6


I've not seen this before and can't reproduce with the default apache config

Jeremy, have you come across this before?

Jan, are there any messages on the console from the hypervisor when the oops occurs?

--- Additional comment from ondrejj on 2008-08-06 14:04:59 EDT ---

(In reply to comment #1)

> Jan, are there any messages on the console from the hypervisor when the oops
> occurs?

My hypervisor is still running, but I can't see any interesting things in current dmesg. Only normal network initialization.

Today I can't reproduce this. It's curious, that my domU was running aprox. 2 days without problems, then after an apache config update and restart of this service my domU crashed.

Then I was unable to start before "chkconfig httpd off". After this I was able to start apache normally by typing "service httpd start".

Today it works with normal startup (chkconfig httpd on), but with modified config again.

Now I tryed to revert my config back to backup. Hangs again. These lines have been added:
<Directory /usr/share/nagios/html/pnp4nagios>
  Allow from .XXXXXX.sk .XXXXX.XXXXX.sk 158.XXX.XXX.
</Directory>
I think it has nothing with these current lines, but with something else in memory.

My machine is not critical, so I can do more tests if required.
It's just an monitoring server, which need to run most of time.

--- Additional comment from jeremy on 2008-08-06 14:24:57 EDT ---

What version of Xen is it, and is it a 32 or 64-bit hypervisor?

There's a old Xen bug which prevents a 32-bit guest running on a 64-bit hypervisor from changing its own top-level pagetable entries, causing set_pud to fail.  It was fixed some time around Feb-March, I think.

Unfortunately the stack trace is a bit unclear here, so I'm not sure what's really going on in this case.  Aside from the Xen bug, I haven't seen anything like this before.

--- Additional comment from jeremy on 2008-08-06 14:29:02 EDT ---

BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log.  There should be something there to indicate why it decided to fail the hypercall.

--- Additional comment from ondrejj on 2008-08-06 15:02:06 EDT ---

(In reply to comment #4)
> BTW, if/when it crashes again, look at "xm dmesg" to see Xen's console log. 
> There should be something there to indicate why it decided to fail the
> hypercall.

Attaching my "xm dmesg" output. I can't exactly tell, what is new.
d1 is before last crash, d2 after last crash.

It is an 32bit guest on 64bit hypervisor. Mentioned problem appeared to me too some months ago.

And another information. "chkconfig httpd off" then boot system normally, then back "chkconfig httpd on" and "reboot". Server is working. I want tell, that it hangs only on first boot, after reboot it works.
There must be something special in memory, when it fails.

--- Additional comment from ondrejj on 2008-08-06 15:02:48 EDT ---

Created an attachment (id=313627)
Before crash

--- Additional comment from ondrejj on 2008-08-06 15:03:13 EDT ---

Created an attachment (id=313628)
After crash

--- Additional comment from jeremy on 2008-08-06 15:30:53 EDT ---

(Please set the type on dumps to text, or paste them inline)

(XEN) mm.c:694:d28 Bad L3 flags 6

OK, that's the signature of the Xen bug I mentioned.  The fix is to update xen.

The bug depends on where things get mapped in the process address space.  It may be that address randomization is causing the non-deterministic results for you.

--- Additional comment from ondrejj on 2008-08-06 15:42:42 EDT ---

My xen is already updated. My system has uptime 20 days and is updated daily.

[root@vs2 ~]# rpm -q xen kernel-xen
xen-3.1.2-2.fc8
kernel-xen-2.6.21.7-2.fc8
kernel-xen-2.6.21.7-3.fc8
[root@vs2 ~]# uname -a
Linux vs2.XXXX.sk 2.6.21.7-3.fc8xen #1 SMP Thu Mar 20 14:58:12 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@vs2 ~]# cat /var/log/yum.log | grep xen
Feb 14 05:15:13 Installed: kernel-xen - 2.6.21-2957.fc8.x86_64
Feb 20 06:47:18 Installed: kernel-xen - 2.6.21.7-2.fc8.x86_64
Feb 29 05:57:17 Updated: xen-libs - 3.1.2-2.fc8.x86_64
Feb 29 05:57:23 Updated: xen - 3.1.2-2.fc8.x86_64
Mar 27 05:51:01 Installed: kernel-xen - 2.6.21.7-3.fc8.x86_64
[root@vs2 ~]# 

Do you think, I need another reboot?

--- Additional comment from jeremy on 2008-08-06 16:26:34 EDT ---

The bug fix was committed to xen-unstable in:
changeset:   17061:9d29141a5e52
user:        Keir Fraser <keir.fraser>
date:        Mon Feb 18 13:50:25 2008 +0000
files:       xen/arch/x86/mm.c

So I think the F8 Xen package is out of date and needs updating.  I don't know whether RH are likely to do that.

A workaround might be to run a 64-bit kernel in your guest.  You'd just need to update the kernel; all the 32-bit usermode code should run fine in compat mode.

--- Additional comment from markmc on 2008-08-07 11:51:35 EDT ---

Thanks for the pointer Jeremy

I've kicked off a build of kernel-xen-2.6-2.6.21.7-4.fc8 with xen-3.1.4, which contains the fix

--- Additional comment from ondrejj on 2008-08-08 02:44:21 EDT ---

After a reboot my guest order has been changed and now my previously bad machine does not hang (also with current stable kernel). If you want, I can test this new kernel, but I am unable to reproduce previous bug.

This new kernel works on second xen server. There was a problem with "Error: (9, 'Bad file descriptor')" after first reboot, but I think this happened sometimes also with older kernel. May be this has been caused by me, after multiple of starts of one of my guests. After second reboot server works well.

--- Additional comment from updates on 2008-08-08 02:53:39 EDT ---

kernel-xen-2.6-2.6.21.7-5.fc8 has been submitted as an update for Fedora 8

--- Additional comment from markmc on 2008-08-08 03:00:53 EDT ---

Jan: I've pushed to updates-testing; please test and bump the karma here in order to get it pushed to stable updates:

  https://admin.fedoraproject.org/updates/F8/pending/kernel-xen-2.6-2.6.21.7-5.fc8

Orion: if you've still got 32-on-64 guests, maybe you could give it a shot too?

--- Additional comment from updates on 2008-08-12 14:27:34 EDT ---

kernel-xen-2.6-2.6.21.7-5.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel-xen-2.6'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-7240

--- Additional comment from ondrejj on 2008-08-13 04:12:55 EDT ---

This update works for me on 2 machines. Although I was unable to reproduce previous problem also with older kernel, I can confirm at least that this update does not added any bugs for me. :)

Bodhi is down, so I can't add an +1 karma point.

--- Additional comment from updates on 2008-09-16 19:19:29 EDT ---

kernel-xen-2.6-2.6.21.7-5.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 1 Chris Lalancette 2008-10-20 11:03:58 UTC

Additional notes:  this is a problem in the RHEL-5 hypervisor as well, when trying to install a F10 i386 PV guest on an x86_64 RHEL-5 HV.  As Jeremy pointed out, the upstream xen-unstable c/s was 17061, and the upstream xen-3.1-testing.hg c/s was 15653.  I'll attach a backport to the BZ, which seems to fix the problem for me.

Chris Lalancette

Comment 2 Chris Lalancette 2008-10-20 11:07:19 UTC

Created attachment 320860 [details]
Backport of upstream xen-3.1-testing c/s 15653, to fix F-10 32-on-64 crash

Comment 4 Orion Poplawski 2008-10-29 21:54:32 UTC

Is there a public version of a working xen for 5.2?  The ones here: http://fedorapeople.org/~crobinso/rhel5/install_f10/ don't work for me.

Comment 5 Chris Lalancette 2008-10-30 07:27:11 UTC

Hm, I'm not sure if you posted in the right bug, but those packages you mentioned are the preview packages for 5.3.  So if they don't work, please let us know why.

Chris Lalancette

Comment 6 Orion Poplawski 2008-10-30 16:06:51 UTC

Here are the messages.  I post here because of xen_set_pud.  Happy to open new bug if needed.

Checking if this processor honours the WP bit even in supervisor mode...Ok.      
1 multicall(s) failed: cpu 0
Pid: 0, comm: swapper Not tainted 2.6.27.4-58.fc10.i686.PAE #1 
 [<c06d1213>] ? printk+0xf/0x14
 [<c04049d7>] xen_mc_flush+0xbb/0x187 
 [<c0405332>] xen_mc_issue+0x14/0x48 
 [<c04058d2>] xen_set_pud_hyper+0x39/0x41
 [<c040590e>] xen_set_pud+0x34/0x39 
 [<c041f9a0>] zap_low_mappings+0x2f/0x47   
 [<c08609a6>] mem_init+0x2c7/0x2cf 
 [<c084b7e9>] start_kernel+0x246/0x2f0
 [<c084b091>] i386_start_kernel+0x80/0x88
 [<c08511e2>] xen_start_kernel+0x7dd/0x7e5 
 =======================                                                         
  call  1/1: op=1 arg=[c2b96854] result=-22
------------[ cut here ]------------ 
kernel BUG at arch/x86/xen/multicalls.c:104! 
invalid opcode: 0000 [#1] SMP        
Modules linked in:

Pid: 0, comm: swapper Not tainted (2.6.27.4-58.fc10.i686.PAE #1)
EIP: e019:[<c0404a97>] EFLAGS: 00010002 CPU: 0                  
EIP is at xen_mc_flush+0x17b/0x187                              
EAX: c2b96054 EBX: 00000000 ECX: ffffffff EDX: c2b96054         
ESI: 00000001 EDI: 00000001 EBP: c0846ef4 ESP: c0846ee0         
 DS: e021 ES: e021 FS: 00d8 GS: 0000 SS: e021                   
Process swapper (pid: 0, ti=c0846000 task=c0808344 task.ti=c0846000)
Stack: c2b96054 00000000 00000001 7373d001 00000000 c0846f00 c0405332 c0833000 
       c0846f24 c04058d2 737b4000 00000000 7373d001 00000000 c0833000 7373d001 
       00000000 c0846f38 c040590e c0833000 c0834000 00000000 c0846f50 c041f9a0 
Call Trace:                                                                    
 [<c0405332>] ? xen_mc_issue+0x14/0x48                                         
 [<c04058d2>] ? xen_set_pud_hyper+0x39/0x41                                    
 [<c040590e>] ? xen_set_pud+0x34/0x39                                          
 [<c041f9a0>] ? zap_low_mappings+0x2f/0x47                                     
 [<c08609a6>] ? mem_init+0x2c7/0x2cf                                           
 [<c084b7e9>] ? start_kernel+0x246/0x2f0                                       
 [<c084b091>] ? i386_start_kernel+0x80/0x88                                    
 [<c08511e2>] ? xen_start_kernel+0x7dd/0x7e5                                   
 =======================                                                       
Code: 8b 55 ec 8b 84 da 04 0a 00 00 ff 94 da 00 0a 00 00 43 8b 45 ec 3b 98 08 0b 00 00 72 e3 85 ff c7 80 08 0b 00 00 00 00 00 00 74 04 <0f> 0b eb fe 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 57 89 d7 56 89 c6
EIP: [<c0404a97>] xen_mc_flush+0x17b/0x187 SS:ESP e021:c0846ee0 
---[ end trace 4eaa2a86a8e2da22 ]--- 
Kernel panic - not syncing: Attempted to kill the idle task!

Comment 7 Chris Lalancette 2008-10-31 10:48:57 UTC

Oh, I see.  Well, there are two problems:

1.  Those packages are only the userspace portion, while this ends up being a hypervisor bug.  The hypervisor is packaged into the kernel, so you would need updated kernel-xen packages.

2.  Regardless, this patch isn't in the latest kernel-xen packages.  It still needs to go through internal review and testing first.

Thanks for the testing, though.

Chris Lalancette

Comment 8 Mark McLoughlin 2008-11-12 21:44:28 UTC

*** Bug 471276 has been marked as a duplicate of this bug. ***

Comment 9 Chris Lalancette 2009-01-15 14:50:26 UTC

I've uploaded a test kernel that contains this fix (along with several others)
to this location:

http://people.redhat.com/clalance/virttest

Could the original reporter try out the test kernels there, and report back if
it fixes the problem?

Thanks,
Chris Lalancette

Comment 10 Orion Poplawski 2009-01-15 15:33:19 UTC

# rpm -ivh kernel-xen-2.6.18-128.el5virttest3.x86_64.rpm
error: Failed dependencies:
        ecryptfs-utils < 44 conflicts with kernel-xen-2.6.18-128.el5virttest3.x86_64
# rpm -q ecryptfs-utils
ecryptfs-utils-41-1.el5

Comment 11 Chris Lalancette 2009-01-15 15:49:19 UTC

Sigh.  Can you temporarily just remove ecryptfs-utils (assuming you aren't using encrypted partitions)?  The newer ecryptfs-utils will be shipped as part of 5.3, but hasn't been released yet.

Chris Lalancette

Comment 12 Orion Poplawski 2009-01-15 17:30:16 UTC

Okay, removed ecryptfs-utils, didn't quite realize it was optional.

Looking good for me, I'm able to start a 32-bit fedora rawhide install, which wasn't even able to boot before.  Also able to install 32-bit fedora 10 guest.

Comment 13 Chris Lalancette 2009-01-16 08:11:32 UTC

Yeah, ecryptfs-utils is optional unless you are using encrypted partitions, in which case it is mandatory.  But I guess you are not doing that :).  In any case, that is great news; it also seemed to fix the problem in my testing.  I'll get this ready to go into the next RHEL release.

Thanks for the testing,
Chris Lalancette

Comment 14 Orion Poplawski 2009-01-16 19:33:14 UTC

I am starting to see the following:

xen_net: Memory squeeze in netback driver.

and networking stop working in the guests.  This may not be related to this new kernel,  just that I am overloading the machine now (I am adding new guests), but thought I'd mention here before filing a new issue if necessary.

Comment 15 Chris Lalancette 2009-01-16 21:05:18 UTC

OK, yeah.  There's another open bug about this (BZ 454285); one of the patches in this kernel seems to be exacerbating the problem, though, since I also saw it on one of my loaded machines.  It needs to be debugged further.

Chris Lalancette

Comment 16 Don Zickus 2009-04-20 17:11:03 UTC

in kernel-2.6.18-140.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 18 Pasi Karkkainen 2009-07-27 22:44:44 UTC

I was seeing this bug on CentOS 5.3 x86_64 dom0; I could not start i386 F10 or F11 installation using virt-install. The graphical VNC console would never show up. When running "xm console <dom>" I saw a domU kernel crash.

After uprading the x86_64 dom0 kernel+xen to -159.el5 the problem is fixed. I can now successfully install i386 Fedora 10 and Fedora 11 guests/domUs.

Comment 20 errata-xmlrpc 2009-09-02 08:40:54 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.