Bug 499276

Summary: RHEL5.3.z i386 HVM PAE Kernel soft lockup/panic under x86-64 Xen hypervisor
Product: Red Hat Enterprise Linux 5 Reporter: Qian Cai <qcai>
Component: xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: mgahagan, mjenner, riel, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-05 22:33:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
RHTS XML to schedule the test none

Description Qian Cai 2009-05-05 22:12:34 UTC
Created attachment 342550 [details]
RHTS XML to schedule the test

Description of problem:
This is RHEL5.3.z candidate kernel which is to due to out tomorrow (7
Mar.).

PAE kernel on an IA-32 HVM can't boot under x86-64 Xen hypervisor. It
seems either running into infinite soft lockup loop or kernel panic,
Both happened when starting udev, and using
kernel-xen-2.6.18-128.1.10.el5. I have verified that the problem did not
occur for non-PAE kernel in HVM guest. The problem seems intermittent. I
have re-run the tests for 4 times, and 2 failed.

The corresponding job ids are,

Failed,
http://wright.rhts.bos.redhat.com/cgi-bin/rhts/jobs.cgi?id=56888
http://wright.rhts.bos.redhat.com/cgi-bin/rhts/jobs.cgi?id=57092 [1]

Passed,
http://wright.rhts.bos.redhat.com/cgi-bin/rhts/jobs.cgi?id=57091
http://wright.rhts.bos.redhat.com/cgi-bin/rhts/jobs.cgi?id=57093

soft lockup - CPU#1 stuck for 10s! [udev_run_hotplu:715]

Pid: 715, comm:      udev_run_hotplu
EIP: 0060:[<c0415911>] CPU: 1
EIP is at smp_call_function+0x99/0xc3
 EFLAGS: 00000297    Not tainted  (2.6.18-128.1.10.el5PAE #1)
EAX: 00000000 EBX: 00000000 ECX: 00000001 EDX: 000000fb
ESI: 00000001 EDI: 00000000 EBP: c0415ae0 DS: 007b ES: 007b
CR0: 8005003b CR2: 00320032 CR3: 1fe1bc00 CR4: 000006f0
 [<c0415ae0>] stop_this_cpu+0x0/0x33
 [<c041594e>] smp_send_stop+0x13/0x1c
 [<c04243ff>] panic+0x4c/0x16d
 [<c04064eb>] die+0x25d/0x291
 [<c0406b22>] do_bounds+0x0/0x63
 [<c0406b7c>] do_bounds+0x5a/0x63
 [<c0405a89>] error_code+0x39/0x40
 [<c0460cbe>] __handle_mm_fault+0x2a3/0xb7b
 [<c060dbe2>] wait_for_completion+0x32/0x8f
 [<c04634a1>] vma_merge+0x14e/0x15f
 [<c061037b>] do_page_fault+0x2d2/0x600
 [<c04e91f5>] copy_from_user+0x31/0x5d
 [<c06100a9>] do_page_fault+0x0/0x600
 [<c0405a89>] error_code+0x39/0x40
 =======================
BUG: soft lockup - CPU#1 stuck for 10s! [udev_run_hotplu:715]

Pid: 715, comm:      udev_run_hotplu
EIP: 0060:[<c0415911>] CPU: 1
Pid: 715, comm:      udev_run_hotplu
EIP: 0060:[<c0415911>] CPU: 1
EIP is at smp_call_function+0x99/0xc3
 EFLAGS: 00000297    Not tainted  (2.6.18-128.1.10.el5PAE #1)
EAX: 00000000 EBX: 00000000 ECX: 00000001 EDX: 000000fb
ESI: 00000001 EDI: 00000000 EBP: c0415ae0 DS: 007b ES: 007b
CR0: 8005003b CR2: 00320032 CR3: 1fe1bc00 CR4: 000006f0
 [<c0415ae0>] stop_this_cpu+0x0/0x33
 [<c041594e>] smp_send_stop+0x13/0x1c
 [<c04243ff>] panic+0x4c/0x16d
 [<c04064eb>] die+0x25d/0x291
 [<c0406b22>] do_bounds+0x0/0x63
 [<c0406b7c>] do_bounds+0x5a/0x63
 [<c0405a89>] error_code+0x39/0x40
 [<c0460cbe>] __handle_mm_fault+0x2a3/0xb7b
 [<c060dbe2>] wait_for_completion+0x32/0x8f
 [<c04634a1>] vma_merge+0x14e/0x15f
 [<c061037b>] do_page_fault+0x2d2/0x600
 [<c04e91f5>] copy_from_user+0x31/0x5d
 [<c06100a9>] do_page_fault+0x0/0x600
 [<c0405a89>] error_code+0x39/0x40
 =======================
BUG: soft lockup - CPU#1 stuck for 10s! [udev_run_hotplu:715]

Pid: 715, comm:      udev_run_hotplu
EIP: 0060:[<c0415911>] CPU: 1
Pid: 715, comm:      udev_run_hotplu
EIP: 0060:[<c0415911>] CPU: 1
EIP is at smp_call_function+0x99/0xc3
 EFLAGS: 00000297    Not tainted  (2.6.18-128.1.10.el5PAE #1)
EAX: 00000000 EBX: 00000000 ECX: 00000001 EDX: 000000fb
ESI: 00000001 EDI: 00000000 EBP: c0415ae0 DS: 007b ES: 007b
CR0: 8005003b CR2: 00320032 CR3: 1fe1bc00 CR4: 000006f0
 [<c0415ae0>] stop_this_cpu+0x0/0x33
 [<c041594e>] smp_send_stop+0x13/0x1c
 [<c04243ff>] panic+0x4c/0x16d
 [<c04064eb>] die+0x25d/0x291
 [<c0406b22>] do_bounds+0x0/0x63
 [<c0406b7c>] do_bounds+0x5a/0x63
 [<c0405a89>] error_code+0x39/0x40
 [<c0460cbe>] __handle_mm_fault+0x2a3/0xb7b
 [<c060dbe2>] wait_for_completion+0x32/0x8f
 [<c04634a1>] vma_merge+0x14e/0x15f
 [<c061037b>] do_page_fault+0x2d2/0x600
 [<c04e91f5>] copy_from_user+0x31/0x5d
 [<c06100a9>] do_page_fault+0x0/0x600
 [<c0405a89>] error_code+0x39/0x40
 =======================

=====================================================================================
I have even got a panic when using the previous released 5.3.z kernel for the guest.

Starting udev: BUG: unable to handle kernel paging request at virtual address ffffff86
 printing eip:
*pde = 00000000
Oops: 0002 [#1]
SMP
last sysfs file: /devices/pnp0/00:00/id
Modules linked in: dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata
last sysfs file: /devices/pnp0/00:00/id
Modules linked in: dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata
sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    1
EIP:    0060:[<c0605fff>]    Not tainted VLI
EFLAGS: 00010203   (2.6.18-128.1.6.el5PAE #1)
EIP is at unix_attach_fds+0x2d/0x42
eax: c07de4e0   ebx: 00000001   ecx: c07de4e8   edx: df536c80
esi: df536c80   edi: dfbec6c0   ebp: 00000001   esp: de886f3c
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 581, ti=de886000 task=de885aa0 task.ti=de886000)
Stack: c060698f ffffff9f 00000001 dfbec6c0 c06069f5 c05ab1f2 00000000 00000003
       00256fd8 00000000 de886000 c05ab2eb de886f7c 00000000 c05ab519 de886f7c
       de886000 00000003 c05abf12 00000001 00000001 00000000 00000001 00000001
Call Trace:
 [<c060698f>] unix_create1+0xda/0xe8
 [<c06069f5>] unix_create+0x58/0x63
 [<c05ab1f2>] __sock_create+0x133/0x213
 [<c05ab2eb>] sock_create+0xb/0xe
 [<c05ab519>] sys_socket+0x18/0x38
 [<c05abf12>] sys_socketcall+0x6c/0x19e
 [<c0404ead>] sysenter_past_esp+0x56/0x79
 =======================
Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00last sysfs file: /devices/pnp0/00:00/id
Modules linked in: dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata
sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    1
EIP:    0060:[<c0605fff>]    Not tainted VLI
EFLAGS: 00010203   (2.6.18-128.1.6.el5PAE #1)
EIP is at unix_attach_fds+0x2d/0x42
eax: c07de4e0   ebx: 00000001   ecx: c07de4e8   edx: df536c80
esi: df536c80   edi: dfbec6c0   ebp: 00000001   esp: de886f3c
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 581, ti=de886000 task=de885aa0 task.ti=de886000)
Stack: c060698f ffffff9f 00000001 dfbec6c0 c06069f5 c05ab1f2 00000000 00000003
       00256fd8 00000000 de886000 c05ab2eb de886f7c 00000000 c05ab519 de886f7c
       de886000 00000003 c05abf12 00000001 00000001 00000000 00000001 00000001
Call Trace:
 [<c060698f>] unix_create1+0xda/0xe8
 [<c06069f5>] unix_create+0x58/0x63
 [<c05ab1f2>] __sock_create+0x133/0x213
 [<c05ab2eb>] sock_create+0xb/0xe
 [<c05ab519>] sys_socket+0x18/0x38
 [<c05abf12>] sys_socketcall+0x6c/0x19e
 [<c0404ead>] sysenter_past_esp+0x56/0x79
 =======================
Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 4b 85 db 79 ef c7 47 78 19 77 60 c\0 31 d2 5b 89 d0 5e 5f c3
EIP: [<c0605fff>] unix_attach_fds+0x2d/0x42 SS:ESP 0068:de886f3c
 <0>Kernel panic - not syncing: Fatal exception
EIP: [<c0605fff>] unix_attach_fds+0x2d/0x42 SS:ESP 0068:de886f3c
 <0>Kernel panic - not syncing: Fatal exception
 BUG: warning at arch/i386/kernel/smp.c:550/smp_call_function() (Not tainted)
 [<c0415ae0>] stop_this_cpu+0x0/0x33
 [<c04158cf>] smp_call_function+0x57/0xc3
 [<c0424e65>] printk+0x18/0x8e
 [<c041594e>] smp_send_stop+0x13/0x1c
 [<c04243ff>] panic+0x4c/0x16d
 [<c04064eb>] die+0x25d/0x291
 [<c0610081>] do_page_fault+0x0/0x600
 [<c06105a4>] do_page_fault+0x523/0x600
 [<c0610081>] do_page_fault+0x0/0x600
 [<c0405a89>] error_code+0x39/0x40
 [<c0605fff>] unix_attach_fds+0x2d/0x42
 [<c060698f>] unix_create1+0xda/0xe8
 [<c06069f5>] unix_create+0x58/0x63
 [<c05ab1f2>] __sock_create+0x133/0x213
 [<c05ab2eb>] sock_create+0xb/0xe
 [<c05ab519>] sys_socket+0x18/0x38
 [<c05abf12>] sys_socketcall+0x6c/0x19e
 [<c0404ead>] sysenter_past_esp+0x56/0x79
 =======================

[1]
You can see that,

confirm_kernel/kernel-PAE-2.6.18-128.1.10.el5 failed,
http://wright.rhts.bos.redhat.com/cgi-bin/rhts/test_log.cgi?id=7991288

This was due to I have login the machine, and then had manually reset
the guest which was running into soft lockup loop/panic. Later, tried
the non-PAE kernel but worked.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-128.1.10.el5
kernel-PAE-2.6.18-128.1.10.el5

How reproducible:
Intermittent. 2 out of 4 jobs failed so far.

Steps to Reproduce:
Schedule the test in RHTS with the attached XML like this,
# submit_job -S rhts.redhat.com x86_64_Intel_SMP.xml

Actual results:
i386 HVM guest can't boot kernel-PAE kernel.

Expected results:
i386 HVM guest should boot kernel-PAE kernel.

Comment 1 Rik van Riel 2009-05-05 22:17:24 UTC
Looks like a duplicate of bug 449346 which is scheduled to be fixed in RHEL 5.4.

Comment 2 Qian Cai 2009-05-05 22:30:40 UTC
Thanks Rik van Riel. Looks like a similar issue. I'll close this out.

Comment 3 Qian Cai 2009-05-05 22:33:04 UTC

*** This bug has been marked as a duplicate of bug 449346 ***