Bug 492688 - kvm-84 BUG: soft lockup - CPU#0 stuck 100% load on one cpu
Summary: kvm-84 BUG: soft lockup - CPU#0 stuck 100% load on one cpu
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.4
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: Eduardo Habkost
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-03-28 10:49 UTC by Gerrit Slomma
Modified: 2009-12-14 21:20 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-07-24 20:29:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg from virtual machine (10.65 KB, application/octet-stream)
2009-03-28 10:49 UTC, Gerrit Slomma
no flags Details
/var/log/messages of virtual machine (1.58 MB, application/octet-stream)
2009-03-28 10:51 UTC, Gerrit Slomma
no flags Details

Description Gerrit Slomma 2009-03-28 10:49:45 UTC
Created attachment 337100 [details]
dmesg from virtual machine

Description of problem:

Virtual machine gets stuck after migration. One CPU is 100% load, other one idles. The virtual machine could be pinged, ssh-login is not always possible.
If stopped on migration target and continued on migration-source the virtual machine recovers after some time without load.

Version-Release number of selected component (if applicable):

kvm-84-1.el5.x86_64.rpm
kvm-kmod-84-1.el5.x86_64.rpm
qemu-0.9.1-11.el5.x86_64.rpm
qemu-img-0.9.1-11.el5.x86_64.rpm

kvm and kvm-kmod compiled from sourceforge sources and installed from built rpm, qemu and qemu-img are from EPEL-repository

How reproducible:

Start a kvm-virtual-machine on host A, start a kvm-virtual-machine with the same parameters on host B in incoming-Mode. Migrate the virtual-machine from host A to host B. Watch the kvm-process go to 100% after migration finishs.
Maybe you have to wait up to 5 seconds or apply a ls or such in the virtual machine. After stopping virtual machine on host B and continuing virtual machine on host A the virtual machine recovers

Steps to Reproduce:
1. start kvm-virtual-machine on host A

kvm -hda /dev/disk/by-path/ip-192.168.1.1:3260-iscsi-rr010:01-lun-5 -smp 2 -m 1024 -boot c -net nic,macaddr=00:16:3e:69:93:f5,model=rtl8139 -net tap,ifname=vnet0 -k en-us -monitor unix:/etc/kvm/rr019v2/run/monitor,server,nowait -pidfile /etc/kvm/rr019v2/run/pid -vnc 127.0.0.1:0

2. start kvm-virtual-machine in incoming mode on host B 

kvm -hda /dev/disk/by-path/ip-192.168.1.1:3260-iscsi-rr010:01-lun-5 -smp 2 -m 1024 -boot c -net nic,macaddr=00:16:3e:69:93:f5,model=rtl8139 -net tap,ifname=vnet0 -k de -monitor unix:/etc/kvm/rr019v2/run/monitor,server,nowait -pidfile /etc/kvm/rr019v2/run/pid -S -incoming tcp:192.168.1.102:4444 -vnc 127.0.0.1:0

3. wait until the kvm-virtual-machine shows up, e.g. you could login via ssh
4. migrate the kvm-virtual-machine from host A to host B

nc -U /etc/kvm/rr019v2/run/monitor
(qemu) migrate tcp:192.168.1.102:4444
(qemu) info migrate
info migrate
Migration status: completed

> on host B:
rr016# top -d 1 -p 30193
30193 root      15   0 1161m 1.0g 1916 S 96.0 27.2  10:49.70 kvm

> in kvm-virtual-machine
rr019v2# dmesg
(... lots of ...)

BUG: soft lockup - CPU#0 stuck for 10s! [bash:1747]
CPU 0:
Modules linked in: ipv6 xfrm_nalgo crypto_api dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac lp floppy i2c_piix4 8139too virtio_pci parport_pc i2c_core ide_cd 8139cp virtio_ring parport serio_raw mii virtio cdrom pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1747, comm: bash Not tainted 2.6.18-128.el5 #1
RIP: 0010:[<ffffffff80022a1c>]  [<ffffffff80022a1c>] flush_tlb_others+0x8c/0xbc
RSP: 0018:ffff81003c68dc38  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff810001575200 RCX: ffff810001575208
RDX: 0000000000000018 RSI: 00000000000000ff RDI: ffff810001575200
RBP: 00000000b71ebafe R08: 0000000000000003 R09: 000000000000003e
R10: ffff81003c68dbd8 R11: 00000000b71ebafe R12: ffff810001575200
R13: 0000000000000000 R14: ffffffff8000c30d R15: ffff81003c68dca8
FS:  0000000000000000(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006b2578 CR3: 000000003c68a000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff80022a29>] flush_tlb_others+0x99/0xbc
 [<ffffffff80075c88>] flush_tlb_mm+0xca/0xd5
 [<ffffffff80039ae2>] exit_mmap+0xad/0xf3
 [<ffffffff8003bc07>] mmput+0x30/0x83
 [<ffffffff8002be3a>] flush_old_exec+0x7b4/0xb08
 [<ffffffff8000b464>] vfs_read+0x13c/0x171
 [<ffffffff80018211>] load_elf_binary+0x478/0x181a
 [<ffffffff800e12a8>] get_arg_page+0x3c/0x95
 [<ffffffff800178ee>] copy_strings+0x1ef/0x200
 [<ffffffff8003f2a5>] search_binary_handler+0xbb/0x26d
 [<ffffffff8003e83a>] do_execve+0x16a/0x1f7
 [<ffffffff8005492d>] sys_execve+0x36/0x4c
 [<ffffffff8005d4d3>] stub_execve+0x67/0xb0

5. on host B stop the virtual machine
rr016# nc -U /etc/kvm/rr019v2/run/monitor
(qemu) quit

6. on host A continue the virtual machine
rr017# nc -U /etc/kvm/rr019v2/run/monitor
(qemu) c

7. on host A check the process
rr017# top -d 1 -p 4573
 4573 root      15   0 1161m 1.0g 934m R  0.0 35.3   1:19.96 kvm

8. log into the vm via ssh and fetch the dmesg and messages

  
Actual results:

kvm-virtual-machine gets stuck, logged-in ssh-session is slow to response or got stock too. Ping is possible. New ssh-console is sometimes possible, sometimes not. After stopping the virtual machine on the target host and recontinueing the virtual machine on the source host everything is fine.

Expected results:

kvm-virtual-machine migrated flawless. Interaction is possible

Additional info:

This applies to all of my tested hosts amongst them.

rr016# grep "model name" /proc/cpuinfo
model name      : Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
model#name      : Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
rr016: grep "MemTotal" /proc/meminfo
MemTotal:      3977960 kB
rr016# brctl show
bridge name     bridge id               STP enabled     interfaces
sw0             8000.002186521ea8       no              vnet0
                                                        eth0

rr017# grep "model name" /proc/cpuinfo
model name      : Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz
model name      : Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz
rr017# grep "MemTotal" /proc/meminfo
MemTotal:      3062956 kB
rr017# brctl show
bridge name     bridge id               STP enabled     interfaces
sw0             8000.001e3728a1c2       no              vnet0
                                                        eth0

rr019# grep "model name" /proc/cpuinfo
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
model name      : Quad-Core AMD Opteron(tm) Processor 2344 HE
rr019# grep "MemTotal" /proc/meminfo
MemTotal:      4047848 kB
rr019# brctl show
bridge name     bridge id               STP enabled     interfaces
sw0             8000.00e08176899e       no              vnet0
                                                        eth0

Comment 1 Gerrit Slomma 2009-03-28 10:51:51 UTC
Created attachment 337101 [details]
/var/log/messages of virtual machine

Comment 2 Gerrit Slomma 2009-03-28 10:53:59 UTC
From the virtual machine
rr019v2# grep "model name" /proc/cpuinfo
model name      : QEMU Virtual CPU version 0.9.1
model name      : QEMU Virtual CPU version 0.9.1
rr019v2# grep "MemTotal" /proc/meminfo
MemTotal:      1026536 kB
rr019v2# free
             total       used       free     shared    buffers     cached
Mem:       1026536      87380     939156          0      27056      32352
-/+ buffers/cache:      27972     998564
Swap:       524280          0     524280

all operations performed as root on hosts.

Comment 3 Gerrit Slomma 2009-03-28 10:57:33 UTC
rr016# uname -a
Linux rr016 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

rr017# uname -a
Linux rr017 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

rr019# uname -a
Linux rr019 2.6.18-92.1.22.el5 #1 SMP Tue Dec 16 11:57:43 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

selinux disabled on all hosts and on the virtual machine, iptables shut off too to avoid side-effects.

Comment 4 Gerrit Slomma 2009-03-28 11:49:46 UTC
tried also on rr016 with 2.6.18-128.el5 kernel but to no avail.

rr016# uname -a
Linux rr016 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

# modinfo kvm
filename:       /lib/modules/2.6.18-128.el5/extra/kvm.ko
license:        GPL
author:         Qumranet
version:        kvm-84
srcversion:     D964574B5665D21B64CD65A
depends:
vermagic:       2.6.18-128.el5 SMP mod_unload gcc-4.1
parm:           oos_shadow:bool
parm:           msi2intx:bool

# modinfo kvm_intel
filename:       /lib/modules/2.6.18-128.el5/extra/kvm-intel.ko
license:        GPL
author:         Qumranet
version:        kvm-84
srcversion:     4829C8B5FA311860FEA4B9A
depends:        kvm
vermagic:       2.6.18-128.el5 SMP mod_unload gcc-4.1
parm:           bypass_guest_pf:bool
parm:           enable_vpid:bool
parm:           flexpriority_enabled:bool
parm:           enable_ept:bool
parm:           emulate_invalid_guest_state:bool

Comment 5 Gerrit Slomma 2009-03-28 12:34:53 UTC
Problem exists with -smp 1 for the virtual machine too.

[root@rr019v2 ~]# dmesg
BUG: soft lockup - CPU#0 stuck for 10s! [hald-addon-stor:1568]
CPU 0:
Modules linked in: ipv6 xfrm_nalgo crypto_api dm_mirror dm_multipath scsi_dh
video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac lp
floppy i2c_piix4 ide_cd pcspkr 8139too cdrom i2c_core 8139cp parport_pc mii
virtio_pci parport serio_raw virtio_ring virtio dm_raid45 dm_message
dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata sd_mod scsi_mod ext3
jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1568, comm: hald-addon-stor Not tainted 2.6.18-128.el5 #1
RIP: 0010:[<ffffffff8000ec28>]  [<ffffffff8000ec28>] ide_do_request+0x30f/0x78d
RSP: 0018:ffffffff80425d78  EFLAGS: 00000246
RAX: 0000000000204108 RBX: ffff81003fd18480 RCX: ffff81003fd18480
RDX: ffff810000000000 RSI: ffff81003fd18480 RDI: 000000000000000f
RBP: ffffffff80425cf0 R08: 000000003ff98000 R09: 0000000000000000
R10: ffff81003fd18480 R11: 0000000000000110 R12: ffffffff8005dc8e
R13: ffffffff804cb918 R14: ffffffff800774da R15: ffffffff80425cf0
FS:  00002b7ca95d36e0(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aded67c7000 CR3: 000000003aed4000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff8000eba8>] ide_do_request+0x28f/0x78d
 [<ffffffff88206103>] :ide_cd:cdrom_decode_status+0x31c/0x347
 [<ffffffff882066fe>] :ide_cd:cdrom_pc_intr+0x27/0x21c
 [<ffffffff8000d4f5>] ide_intr+0x1af/0x1df
 [<ffffffff80010a46>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800b7ade>] __do_IRQ+0xa4/0x103
 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff80059cd6>] ide_outsw+0x0/0x9
 [<ffffffff80011f84>] __do_softirq+0x51/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cada>] do_softirq+0x2c/0x85
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff80059cd6>] ide_outsw+0x0/0x9
 [<ffffffff80059cde>] ide_outsw+0x8/0x9
 [<ffffffff801cd618>] atapi_output_bytes+0x23/0x5e
 [<ffffffff882066d7>] :ide_cd:cdrom_pc_intr+0x0/0x21c
 [<ffffffff88206d19>] :ide_cd:cdrom_transfer_packet_command+0xb0/0xdb
 [<ffffffff88206d91>] :ide_cd:cdrom_do_pc_continuation+0x0/0x2b
 [<ffffffff882055ad>] :ide_cd:cdrom_start_packet_command+0x14f/0x15b
 [<ffffffff8000eec4>] ide_do_request+0x5ab/0x78d
 [<ffffffff8013cb30>] elv_insert+0xd6/0x1f7
 [<ffffffff800414bc>] ide_do_drive_cmd+0xc0/0x116
 [<ffffffff882035f7>] :ide_cd:cdrom_queue_packet_command+0x46/0xe2
 [<ffffffff80059cde>] ide_outsw+0x8/0x9
 [<ffffffff801cc987>] ide_init_drive_cmd+0x10/0x24
 [<ffffffff88203b0f>] :ide_cd:cdrom_check_status+0x62/0x71
 [<ffffffff8013e034>] blk_end_sync_rq+0x0/0x2e
 [<ffffffff88203b3a>] :ide_cd:ide_cdrom_check_media_change_real+0x1c/0x37
 [<ffffffff881d7076>] :cdrom:media_changed+0x44/0x74
 [<ffffffff800df8d7>] check_disk_change+0x1f/0x50
 [<ffffffff881db33b>] :cdrom:cdrom_open+0x8ef/0x93c
 [<ffffffff8000cbb6>] do_lookup+0x65/0x1e6
 [<ffffffff8000d0d4>] dput+0x2c/0x114
 [<ffffffff8000a3be>] __link_path_walk+0xdf8/0xf42
 [<ffffffff8002c77e>] mntput_no_expire+0x19/0x89
 [<ffffffff8000e881>] link_path_walk+0xd3/0xe5
 [<ffffffff80063db6>] do_nanosleep+0x47/0x70
 [<ffffffff8000d0d4>] dput+0x2c/0x114
 [<ffffffff80057987>] kobject_get+0x12/0x17
 [<ffffffff80140caf>] get_disk+0x3f/0x81
 [<ffffffff8005a659>] exact_lock+0xc/0x14
 [<ffffffff801b8f11>] kobj_lookup+0x132/0x19b
 [<ffffffff88203e8d>] :ide_cd:idecd_open+0x9f/0xd0
 [<ffffffff800dff49>] do_open+0xa2/0x30f
 [<ffffffff800e040a>] blkdev_open+0x0/0x4f
 [<ffffffff800e042d>] blkdev_open+0x23/0x4f
 [<ffffffff8001e4f2>] __dentry_open+0xd9/0x1dc
 [<ffffffff80026f1f>] do_filp_open+0x2a/0x38
 [<ffffffff80063db6>] do_nanosleep+0x47/0x70
 [<ffffffff8000d0d4>] dput+0x2c/0x114
 [<ffffffff800198ab>] do_sys_open+0x44/0xbe
 [<ffffffff8005d116>] system_call+0x7e/0x83

but CPU does not go up to 100% and is useable.
Migrating back from B to A also works, the lockup is thrown one, the virtual
machine is still useable.
Maybe there is a problem with the threads and they should be locked to a
definite cpu?

Comment 6 Gerrit Slomma 2009-03-28 21:07:03 UTC
Okay, works at least for CentOS 5.2 with 2.6.18-92.el5 kernel in i686 for the virtual machine
rr019v3# uname -a
Linux rr019v3 2.6.18-92.el5 #1 SMP Tue Jun 10 18:49:47 EDT 2008 i686 i686 i386 GNU/Linux

Migrating to and fro works without problems.
2.6.18-128.1.1.el5 x86_64 from Red Hat does not work for the virtual machine.
noapic, nmi-watchdog=1 or pci=noirqpoll does not help with -128 or -128.1.1

Comment 7 Gerrit Slomma 2009-03-30 07:27:03 UTC
Kernel 2.6.18-128.el5.i686 for the virtual machine from RHEL also works without problems (new install) on 2.6.18-128.el5.x86_64 host.

Comment 8 Gerrit Slomma 2009-03-31 20:21:41 UTC
I have tested following:

above stated command for the virtual machine and additionally

-no-kvm-irqchip     => doesn't work either, same as without this parameter
-no-kvm-pit     => doesn't work either, same as without this parameter
-no-kvm-pit-reinjection     => doesn't work either, same as without this parameter
-no-kvm     => works like i686 without any errormessages in dmesg to and fro in between my hosts
-tdf     => doesn't work either, same as without this parameter

So anyone any ideas?

Comment 9 Gerrit Slomma 2009-04-21 21:38:32 UTC
The problem still exists - or persists - with the today-released kvm-85.

Comment 10 newellista 2009-05-08 21:37:25 UTC
I am getting the same results.  My system config is:

# uname -a
Linux h39 2.6.18-128.1.6.el5 #1 SMP Wed Apr 1 09:10:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

]# modinfo kvm
filename:       /lib/modules/2.6.18-128.1.6.el5/extra/kvm/kvm.ko
license:        GPL
author:         Qumranet
version:        kvm-85
srcversion:     F75E598A4C8C6749972C7BB
depends:        
vermagic:       2.6.18-128.1.6.el5 SMP mod_unload gcc-4.1
parm:           oos_shadow:bool

# modinfo kvm_intel
filename:       /lib/modules/2.6.18-128.1.6.el5/extra/kvm/kvm-intel.ko
license:        GPL
author:         Qumranet
version:        kvm-85
srcversion:     C039A0000B33711A2AA2703
depends:        kvm
vermagic:       2.6.18-128.1.6.el5 SMP mod_unload gcc-4.1
parm:           bypass_guest_pf:bool
parm:           vpid:bool
parm:           flexpriority:bool
parm:           ept:bool
parm:           emulate_invalid_guest_state:bool

I am using libvirtd to manage the system.

Comment 11 Gerrit Slomma 2009-07-24 20:28:22 UTC
Seems like the problem is gone with rhel5.4 beta and kvm from the virtualization-channel.

on both hosts:

Installed Packages
etherboot-zroms-kvm.x86_64       5.4.4-10.el5              installed
kernel.x86_64                    2.6.18-128.1.14.el5       installed
kernel.x86_64                    2.6.18-128.1.16.el5       installed
kernel.x86_64                    2.6.18-155.el5            installed
kmod-kvm.x86_64                  83-80.el5                 installed
kvm.x86_64                       83-80.el5                 installed
libvirt.x86_64                   0.6.3-11.el5              installed
redhat-release.x86_64            5Server-5.4.0.2           installed

on vm:

Installed Packages
kernel.x86_64                    2.6.18-128.1.6.el5        installed

disk-image is on iscsi as before.

rr017# virsh migrate --live rr019v4 qemu+tcp://192.168.1.20/system tcp:192.168.1.20:4444

rr019v4# dmesg
(empty)
rr019v4# cat /proc/cpuinfo |grep processor
processor       : 0
processor       : 1

no process is going up to 100%

and back

virsh migrate --live rr019v4 qemu+tcp://192.168.1.17/system tcp:192.168.1.17:4444

rr019v4# dmesg
(empty)

I would close this bug with reason "NEXTRELEASE" if i am abled to do so.


Note You need to log in before you can comment on or make changes to this bug.