Bug 727552

Summary: host crashes when overcommitted guest quitting
Product: Red Hat Enterprise Linux 6 Reporter: Xiaoqing Wei <xwei>
Component: kernelAssignee: Andrew Jones <drjones>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.2CC: drjones, juzhang, michen, qcai, shuang, tburke
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-12 15:28:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 523117    
Attachments:
Description Flags
host crash analyze info
none
kdump part0
none
kdump part1
none
foreach bt > foreach_bt.txt none

Description Xiaoqing Wei 2011-08-02 12:50:28 UTC
Created attachment 516318 [details]
host crash analyze info

Description of problem:
install a (or a few, which is easier to reproduce) overcommitted guest , and quit it when guest ending installation. host may crash

Version-Release number of selected component (if applicable):
2.6.32-174.el6.x86_64

How reproducible:
2 / 25   2.6.32-174.el6.x86_64
2 / 20+  2.6.32-171.el6.x86_64
0 / 20   2.6.32-170.el6.x86_64

Steps to Reproduce:
1. install a overcommitted guest (say host = 32G , guest = 33G)
2. quit guest by type "q" in monitor
3.
  
Actual results:
host crash

Expected results:
guest quit and host work well

Additional info:
host info :
qemu-kvm-0.12.1.2-2.175.el6.x86_64
32G

12 cpus :
processor	: 11
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 8
model name	: Six-Core AMD Opteron(tm) Processor 2427

Comment 2 Xiaoqing Wei 2011-08-02 12:55:31 UTC
cmd to start guest: 

 #qemu-kvm -name rhel61_32_ins -monitor stdio -chardev
socket,id=serial_id_20110802-084213-3dmc,path=/tmp/serial-20110802-084213-3dmc,server,nowait
-device isa-serial,chardev=serial_id_20110802-084213-3dmc -drive
file=/images/RHEL-Server-6.1-32.qcow2,index=0,if=none,id=drive-ide0-0-0,media=disk,cache=none,format=qcow2,aio=native
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device
e1000,netdev=idQVRARF,mac=9a:a4:46:90:43:4e,id=ndev00idQVRARF,bus=pci.0,addr=0x3
-netdev
tap,id=idQVRARF,vhost=on,ifname=t0-083dmc,script=/qemu-ifup-switch,downscript=no\
\
 -m 33792 -smp 12,cores=1,threads=1,sockets=12 \
\
-drive
file=/RHEL6.1-Server-i386.iso,index=1,if=none,id=drive-ide0-0-1,media=cdrom,readonly=on,format=raw
-device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive
file=/rhel61-32/ks.iso,index=2,if=none,id=drive-ide0-1-0,media=cdrom,readonly=on,format=raw
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -cpu
cpu64-rhel6,+sse2,+x2apic -kernel /rhel61-32/vmlinuz -initrd
/rhel61-32/initrd.img -vnc :10 -rtc base=utc,clock=host,driftfix=none -M
rhel6.1.0 -boot order=cdn,once=n,menu=off -usbdevice tablet
-no-kvm-pit-reinjection --append 'ks=cdrom nicdelay=60 console=ttyS0,115200
console=tty0' -enable-kvm

Comment 3 Xiaoqing Wei 2011-08-02 12:55:51 UTC
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-174.el6.x86_64/vmlinux
    DUMPFILE: analized.vmcore  [PARTIAL DUMP]
        CPUS: 12
        DATE: Tue Aug  2 08:56:12 2011
      UPTIME: 23:23:05
LOAD AVERAGE: 1.05, 1.07, 0.65
       TASKS: 355
    NODENAME: amd-2427-32-2.englab.nay.redhat.com
     RELEASE: 2.6.32-174.el6.x86_64
     VERSION: #1 SMP Thu Jul 28 00:31:11 EDT 2011
     MACHINE: x86_64  (2199 Mhz)
      MEMORY: 32 GB
       PANIC: "kernel BUG at mm/mmap.c:2346!"
         PID: 22872
     COMMAND: "qemu"
        TASK: ffff8804133d2080  [THREAD_INFO: ffff8804023fc000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash>

Comment 4 Xiaoqing Wei 2011-08-16 08:15:12 UTC
meet same problem when using kernel-2.6-32-184

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-184.el6.x86_64/vmlinux
    DUMPFILE: kdump_analyzing/kern-184  [PARTIAL DUMP]
        CPUS: 2
        DATE: Tue Aug 16 00:54:20 2011
      UPTIME: 17:42:08
LOAD AVERAGE: 1.24, 0.90, 0.84
       TASKS: 155
    NODENAME: amd-5400b-4-3.englab.nay.redhat.com
     RELEASE: 2.6.32-184.el6.x86_64
     VERSION: #1 SMP Tue Aug 9 12:20:06 EDT 2011
     MACHINE: x86_64  (2805 Mhz)
      MEMORY: 3.9 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 1405
     COMMAND: "rpciod/1"
        TASK: ffff880118fd6100  [THREAD_INFO: ffff880117b4e000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

Comment 5 Xiaoqing Wei 2011-08-16 08:20:47 UTC
Created attachment 518417 [details]
kdump part0

Comment 6 Xiaoqing Wei 2011-08-16 08:23:02 UTC
Created attachment 518419 [details]
kdump part1

Comment 7 Qian Cai 2011-08-17 03:11:20 UTC
Where is the crash bt output?

Comment 8 Xiaoqing Wei 2011-08-17 04:44:00 UTC
Created attachment 518590 [details]
foreach bt > foreach_bt.txt

Comment 9 Xiaoqing Wei 2011-08-17 04:44:30 UTC
(In reply to comment #7)
> Where is the crash bt output?

Hi Cai Qian:

Attached foreach bt output




FYI: 
the bt output and other detail info is in the folder.
download the attached "kdump part(part0,part1)" ,
cat ..part0 part1 > kdump.tb2
tar xjf kdump.tb2 



Best Regards,
Xiaoqing.

Comment 10 Qian Cai 2011-08-17 05:02:37 UTC
I am interested in the panic in the comment #3.
PANIC: "kernel BUG at mm/mmap.c:2346!"

The panic in the comment #4 looks like a known issue - bug 730756, as I read from the log-m.txt from the attachment.

<1>BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
<1>IP: [<ffffffffa03b17f1>] __br_deliver+0x61/0x100 [bridge]

Are you able to reproduce the first panic using the latest kernel?

Comment 11 Xiaoqing Wei 2011-08-17 05:18:10 UTC
(In reply to comment #10)
Hi Cai Qian:

> I am interested in the panic in the comment #3.
> PANIC: "kernel BUG at mm/mmap.c:2346!"
> 
> The panic in the comment #4 looks like a known issue - bug 730756, as I read
> from the log-m.txt from the attachment.
> 

Actually, I dont really sure that the above two crashes are the same issue,their outputs are very different while they were triggered when running same job.


> <1>BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
> <1>IP: [<ffffffffa03b17f1>] __br_deliver+0x61/0x100 [bridge]
> 
> Are you able to reproduce the first panic using the latest kernel?

It's not always reproduciable, I will try more to reproduce :)

Thanks and Best Regards,
Xiaoqing.

Comment 12 Qian Cai 2011-08-17 05:21:06 UTC
> > I am interested in the panic in the comment #3.
> > PANIC: "kernel BUG at mm/mmap.c:2346!"
In addition, do you have vmcore available or logs for this panic to have a look?

Comment 14 Qian Cai 2011-08-17 06:32:57 UTC
Feel free to re-open if you can reproduce the kernel BUG at mm/mmap.c:2346! panic on the latest kernel.

*** This bug has been marked as a duplicate of bug 724037 ***

Comment 15 Suqin Huang 2011-08-22 05:32:17 UTC
re-open and assign to Andrew according to Dor's comment

> ok, we will re-open it if mm/mmap.c:2346! crash is reproduced.

Please assign Andrew for it

Comment 16 Andrew Jones 2011-08-29 15:36:08 UTC
This bug appears to be reporting crashes that have been reported in two other bugs. The one that we were to focus on (comment 3) looks very much like a dup of bug 724037. So why was this reopened? Does it reproduce with 2.6.32-182?

Comment 17 Andrew Jones 2011-09-12 13:38:15 UTC
Suqin,

See comment 16, why has this bug been reopened? Is there still an issue with latest RHEL6 builds? I believe the issue (comment 3) that this bug was reporting has been resolved and this bug was correctly duped.

Drew

Comment 18 Suqin Huang 2011-09-12 14:55:12 UTC
(In reply to comment #17)
> Suqin,
> 
> See comment 16, why has this bug been reopened? Is there still an issue with
> latest RHEL6 builds? I believe the issue (comment 3) that this bug was
> reporting has been resolved and this bug was correctly duped.
> 
> Drew

I re-open it according to Dor's comment in the email: "Please assign Andrew for it"

Comment 19 Andrew Jones 2011-09-12 15:28:43 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > Suqin,
> > 
> > See comment 16, why has this bug been reopened? Is there still an issue with
> > latest RHEL6 builds? I believe the issue (comment 3) that this bug was
> > reporting has been resolved and this bug was correctly duped.
> > 
> > Drew
> 
> I re-open it according to Dor's comment in the email: "Please assign Andrew for
> it"

You should only have done that if I hadn't already dealt with it :-) I'm redupping.

*** This bug has been marked as a duplicate of bug 724037 ***