Bug 821663

Summary:

kernel panic when booting guest(rhel5.8x64 rhel5.7x64 rhel4.9x64 ) with -smp 65 in rhel6.3 host

Product:

Red Hat Enterprise Linux 6

Reporter:

FuXiangChun <xfu>

Component:

qemu-kvm

Assignee:

Gleb Natapov <gleb>

Status:

CLOSED WONTFIX

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.3

CC:

acathrow, areis, bsarathy, chayang, dyasny, flang, juzhang, knoel, michen, mkenneth, qzhang, shu, sluo, virt-maint

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-06-14 07:13:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
full log	none

Description FuXiangChun 2012-05-15 10:01:54 UTC

Description of problem:
boot guest rhel5.8x64 rhel5.7x64 and rhel4.9x64 with -smp >64, guest will kernel panic.  if smp value<=64 then guest work well. 

Version-Release number of selected component (if applicable):
#rpm -qa|grep qemu
qemu-kvm-0.12.1.2-2.292.el6.x86_64

# uname -r
2.6.32-270.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1./usr/libexec/qemu-kvm -M rhel6.3.0 -cpu host --enable-kvm -m 512G -smp 64,maxcpus=161 -name rhel6.3 -uuid ddcbfb49-3411-1701-3c36-6bdbc00bedbc -rtc base=utc,clock=host,driftfix=slew -drive file=/home/images/RHEL-Server-5.7-64-virtio.qcow2,if=none,id=virtio,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,drive=virtio,id=drive-virtio0-0-0,bootindex=1 -netdev tap,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=86:12:50:a4:35:75 -spice port=5911,disable-ticketing -vga qxl -device sga -chardev socket,id=serial0,path=/var/test3,server,nowait -device isa-serial,chardev=serial0 -balloon virtio -monitor unix:/tmp/monitor3,server,nowait -monitor stdio
2.
3.
  
Actual results:
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8008d4ca>] sd_degenerate+0x31/0x44
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: 
CPU 0 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.18-300.el5 #1
RIP: 0010:[<ffffffff8008d4ca>]  [<ffffffff8008d4ca>] sd_degenerate+0x31/0x44
RSP: 0000:ffff81801fc17be0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff810001c02be0 RCX: 000000000000004d
RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000000
RBP: ffff81801fc17bf0 R08: 0000000000000040 R09: 0000000000000000
R10: ffff81801fc17e40 R11: 000000d000000000 R12: ffff810001c02be0
R13: ffff810001c02700 R14: ffff810001c01420 R15: 0000000000000027
FS:  0000000000000000(0000) GS:ffffffff8042f000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff81801fc16000, task ffff81801fc057a0)
Stack:  ffff810001c02a40 ffff810001c02a40 ffff81801fc17c30 ffffffff8008eac6
 0000004d000000d0 0000000000000040 00000000000000ff ffff810001c02a40
 00000000000000ff 0000000000000040 ffff81801fc17e30 ffffffff800917d0
Call Trace:
 [<ffffffff8008eac6>] cpu_attach_domain+0x4b/0xce
 [<ffffffff800917d0>] __build_sched_domains+0xd42/0x13d3
 [<ffffffff80155d7f>] __next_cpu+0x19/0x28
 [<ffffffff80091f57>] arch_init_sched_domains+0x2e/0x35
 [<ffffffff8047e049>] sched_init_smp+0x1e/0xa5
 [<ffffffff8046b9e8>] init+0x183/0x2f7
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8018835c>] acpi_ds_init_one_object+0x0/0x80
 [<ffffffff8046b865>] init+0x0/0x2f7
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Code: 48 3b 00 75 08 31 d2 f6 c1 70 0f 94 c2 59 5b c9 89 d0 c3 55 
RIP  [<ffffffff8008d4ca>] sd_degenerate+0x31/0x44
 RSP <ffff81801fc17be0>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Fatal exception


Expected results:
guest boot successful

Additional info:
for rhel5.8x86 rhel5.7x86 rhel4.9x86 guest, don't hit this issue.

Comment 1 Gleb Natapov 2012-06-10 08:14:48 UTC

Check our product limits before testing: https://home.corp.redhat.com/wiki/enterprise-linux-product-limits. We do not support more then 64 cpus with any of this products and x86 variations do not even try to initialize all available vcpus.

Your "reproduce" command line has "smp 64,maxcpus=161". Does this mean you hit this issue with 64 cpus, or the command line is incorrect? Regardless I do not hit this issue (which looks like kernel bug) with much more than 64 vcpus and rhel5. What host cpu do you have? Attach full console log.

Comment 2 FuXiangChun 2012-06-11 10:22:39 UTC

My "repdouce" command line should be "smp 65, maxcpus=161".
the first reproduce it on AMD 6172 (48 cores)
JUst can reproduce it in local host.
AMD Phenom(tm) 9600B Quad-Core Processor, 4 cores.

This is full console log.

Booting 'Red Hat Enterprise Linux Server (2.6.18-274.el5)'

root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
kernel /vmlinuz-2.6.18-274.el5 ro root=/dev/VolGroup00/LogVol00 crashkernel=128
M@32M console=tty0 console=ttyS0,115200n8 rhgb quiet
   [Linux-bzImage, setup=0x1e00, size=0x20029c]
initrd /initrd-2.6.18-274.el5.img
   [Linux-initrd @ 0x37c9d000, 0x3520f5 bytes]

WARNING calibrate_APIC_clock: the APIC timer calibration may be wrong.
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8008cecd>] sd_degenerate+0x31/0x44
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: 
CPU 0 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.18-274.el5 #1
RIP: 0010:[<ffffffff8008cecd>]  [<ffffffff8008cecd>] sd_degenerate+0x31/0x44
RSP: 0000:ffff81007ffbfbe0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff81000a755e60 RCX: 000000000000004d
RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000000
RBP: ffff81007ffbfbf0 R08: 0000000000000040 R09: 0000000000000000
R10: ffff81007ffbfe40 R11: 000000d000000000 R12: ffff81000a755e60
R13: ffff81000a755980 R14: ffff81000a7546a0 R15: 0000000000000027
FS:  0000000000000000(0000) GS:ffffffff8042a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff81007ffbe000, task ffff81007ffad7a0)
Stack:  ffff81000a755cc0 ffff81000a755cc0 ffff81007ffbfc30 ffffffff8008e4c9
 0000004d000000d0 0000000000000040 00000000000000ff ffff81000a755cc0
 00000000000000ff 0000000000000040 ffff81007ffbfe30 ffffffff800911d3
Call Trace:
 [<ffffffff8008e4c9>] cpu_attach_domain+0x4b/0xce
 [<ffffffff800911d3>] __build_sched_domains+0xd42/0x13d3
 [<ffffffff8009195a>] arch_init_sched_domains+0x2e/0x35
 [<ffffffff8047810a>] sched_init_smp+0x1e/0xa5
 [<ffffffff804659e8>] init+0x183/0x2f7
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8018681a>] acpi_ds_init_one_object+0x0/0x80
 [<ffffffff80465865>] init+0x0/0x2f7
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 48 3b 00 75 08 31 d2 f6 c1 70 0f 94 c2 59 5b c9 89 d0 c3 55 
RIP  [<ffffffff8008cecd>] sd_degenerate+0x31/0x44
 RSP <ffff81007ffbfbe0>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Fatal exception
 
(In reply to comment #1)
> Check our product limits before testing:
> https://home.corp.redhat.com/wiki/enterprise-linux-product-limits. We do not
> support more then 64 cpus with any of this products and x86 variations do
> not even try to initialize all available vcpus.
> 
> Your "reproduce" command line has "smp 64,maxcpus=161". Does this mean you
> hit this issue with 64 cpus, or the command line is incorrect? Regardless I
> do not hit this issue (which looks like kernel bug) with much more than 64
> vcpus and rhel5. What host cpu do you have? Attach full console log.

you can login my testing host.
 ip:10.66.9.97 
 user/password:redhat/redhat
 image path:/home/

Comment 3 Gleb Natapov 2012-06-11 11:18:18 UTC

(In reply to comment #2)
> My "repdouce" command line should be "smp 65, maxcpus=161".
> the first reproduce it on AMD 6172 (48 cores)
> JUst can reproduce it in local host.
> AMD Phenom(tm) 9600B Quad-Core Processor, 4 cores.
> 
Do you see the same on Intel?

> This is full console log.
This is not full console log. To get full console log remove "rhgb quiet" from the kernel command line.

Comment 4 FuXiangChun 2012-06-12 11:52:42 UTC

I attached full console log in attachment. I don't hit this issue on Intel.

Comment 5 FuXiangChun 2012-06-12 11:53:30 UTC

Created attachment 591169 [details]
full log

Comment 6 Gleb Natapov 2012-06-14 07:13:52 UTC

I am closing the bug since this is not support configuration. It looks like guest kernel limitation.