Bug 620512

Summary: RHEL5 KVM guest crashes sometimes
Product: Red Hat Enterprise Linux 5 Reporter: Rik van Riel <riel>
Component: kernelAssignee: Karen Noel <knoel>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: dgibson, gcosta
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-11 09:01:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rik van Riel 2010-08-02 18:17:07 UTC
Description of problem:

Running a RHEL5 SMP guest on Fedora 12 KVM results in occasional hangs, but only with one of my virtual machines.

Version-Release number of selected component (if applicable):

2.6.18-164.11.1.el5

How reproducible:

I'm really not sure what triggers the bug.  Sometimes it happens after a few days, sometimes things run fine for a few weeks.

Additional info:

The call trace that prints on the guest serial console when it hangs:

BUG: soft lockup - CPU#2 stuck for 11s! [events/2:13]
CPU 2:
Modules linked in: lockd sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ata_piix libata scsi_mod virtio_net ide_cd cdrom i2c_piix4 i2c_core pcspkr serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 13, comm: events/2 Not tainted 2.6.18-164.11.1.el5 #1
RIP: 0010:[<ffffffff800daec1>]  [<ffffffff800daec1>] drain_array+0xb5/0xc0
RSP: 0018:ffff810037f43dd8  EFLAGS: 00000283
RAX: ffff810037ce6018 RBX: 000000000000001e RCX: 0000000000000000
RDX: 00000000000000f0 RSI: ffff810037ce61f8 RDI: ffff810037ce6108
RBP: 0000000000000001 R08: ffff810037ce6018 R09: ffff81007efd9540
R10: ffff810037fee860 R11: ffff810074b90330 R12: ffff810000ba2218
R13: 0000000000000286 R14: 0000000000000286 R15: ffff810000ba21e0
FS:  0000000000000000(0000) GS:ffff810037c8ee40(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002afe44fcd1b0 CR3: 0000000076c38000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff800db8d4>] cache_reap+0x0/0x217
 [<ffffffff800db9ab>] cache_reap+0xd7/0x217
 [<ffffffff8004d8ed>] run_workqueue+0x94/0xe4
 [<ffffffff8004a12f>] worker_thread+0x0/0x122
 [<ffffffff8004a21f>] worker_thread+0xf0/0x122
 [<ffffffff8008c86c>] default_wake_function+0x0/0xe
 [<ffffffff80032950>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff80032852>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Kernel panic - not syncing: softlockup: hung tasks
BUG: soft lockup - CPU#1 stuck for 11s! [swapper:0]
CPU 1:
Modules linked in: lockd sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ata_piix libata scsi_mod virtio_net ide_cd cdrom i2c_piix4 i2c_core pcspkr serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-164.11.1.el5 #1
RIP: 0010:[<ffffffff8006b31d>]  [<ffffffff8006b31d>] default_idle+0x29/0x50
RSP: 0018:ffff810037c9fef0  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff80302698
RBP: ffff810037c802f0 R08: ffff810037c9e000 R09: 000000000000003f
R10: ffff81000277c008 R11: 0000000000000282 R12: 0000000004346c16
R13: 000281ff2cf28e8b R14: ffff810037fef0c0 R15: ffff810037c80100
FS:  0000000000000000(0000) GS:ffff81007ffbc7c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fffadf6bff8 CR3: 0000000076aaf000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff8004947b>] cpu_idle+0x95/0xb8
 [<ffffffff80077474>] start_secondary+0x498/0x4a7

BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0]
Kernel panic - not syncing: softlockup: hung tasks
CPU 0:
Modules linked in: lockd sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ata_piix libata scsi_mod virtio_net ide_cd cdrom i2c_piix4 i2c_core pcspkr serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-164.11.1.el5 #1
RIP: 0010:[<ffffffff80064b50>]  [<ffffffff80064b50>] _spin_unlock_irqrestore+0x8/0x9
RSP: 0018:ffffffff8043dee0  EFLAGS: 00000246
RAX: ffff81007f3eaea8 RBX: ffff81007f3eac00 RCX: 0000000000000001
RDX: ffff81007f3eaea8 RSI: 0000000000000246 RDI: ffff81007f3eaea0
RBP: ffffffff8043de60 R08: ffff81005e875600 R09: ffff81007f2e9800
R10: ffff810072285ba0 R11: ffffffff8808b53d R12: ffffffff8005dc8e
R13: ffff81007f3eae90 R14: ffffffff80077d08 R15: ffffffff8043de60
FS:  0000000000000000(0000) GS:ffffffff803c1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aff17c5ee34 CR3: 00000000748c6000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff8808b107>] :virtio_pci:vp_interrupt+0xa9/0xb8
 [<ffffffff80010b81>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800ba2c5>] __do_IRQ+0xa4/0x103
 [<ffffffff8006c9bf>] do_IRQ+0xe7/0xf5
 [<ffffffff8006b2f4>] default_idle+0x0/0x50
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8006b31d>] default_idle+0x29/0x50
 [<ffffffff8004947b>] cpu_idle+0x95/0xb8
 [<ffffffff803fd7fd>] start_kernel+0x220/0x225
   [<ffffffff803fd22f>] _sinittext+0x22f/0x236

Kernel panic - not syncing: softlockup: hung tasks
 cat: /dev/pts/1: Input/output error

Comment 1 Glauber Costa 2010-08-02 18:29:31 UTC
Ok, my first wild hunch on this, is that it is interrupt-related, and not directly kvmclock. On RHEL5-64, due to lack of clocksources impl, we still call kvmclock only when we get timer interrupts.

Are you seeing any other strange stuff around it? Does it happen, by any chance, with other clocks ?

Comment 2 Rik van Riel 2010-08-03 13:19:01 UTC
I got another hang overnight.  And yes, I can believe this is related to interrupts, or a memory crunch, or something along those lines.

BUG: soft lockup - CPU#1 stuck for 10s! [swapper:0]
CPU 1:
Modules linked in: lockd sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy virtio_net ata_piix libata scsi_mod i2c_piix4 i2c_core ide_cd serio_raw cdrom pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-164.11.1.el5 #1
RIP: 0010:[<ffffffff8006b31d>]  [<ffffffff8006b31d>] default_idle+0x29/0x50
RSP: 0018:ffff810037c9fef0  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff80302698
RBP: ffff810037c802f0 R08: ffff810037c9e000 R09: 000000000000003f
R10: ffff81000277c008 R11: ffff81006f2b2d30 R12: 0000000005f2c07d
R13: 00025e7d77b5531f R14: ffff8100776fb040 R15: ffff810037c80100
FS:  0000000000000000(0000) GS:ffff81007ffbc7c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b56f3f17e34 CR3: 00000000761da000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff8004947b>] cpu_idle+0x95/0xb8
 [<ffffffff80077474>] start_secondary+0x498/0x4a7

BUG: soft lockup - CPU#0 stuck for 15s! [kjournald:1219]
Kernel panic - not syncing: softlockup: hung tasks
 CPU 0:
Modules linked in: lockd sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy virtio_net ata_piix libata scsi_mod i2c_piix4 i2c_core ide_cd serio_raw cdrom pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1219, comm: kjournald Not tainted 2.6.18-164.11.1.el5 #1
RIP: 0010:[<ffffffff800122e6>]  [<ffffffff800122e6>] __do_softirq+0x51/0x133
RSP: 0000:ffffffff8043df60  EFLAGS: 00000206
RAX: 0000000000000022 RBX: 0000000000000022 RCX: 7ffffffffffffffe
RDX: ffff8100784bffd8 RSI: 0000000000000046 RDI: ffff810077972860
RBP: ffffffff8043dee0 R08: ffffffff8005e2fc R09: ffff810081f42000
R10: ffffffff8043df98 R11: ffffffff8808b53d R12: ffffffff8005dc8e
R13: 0000000000000046 R14: ffffffff80077d08 R15: ffffffff8043dee0
FS:  0000000000000000(0000) GS:ffffffff803c1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b56f3d13000 CR3: 0000000078bce000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb3c>] do_softirq+0x2c/0x85
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff8808b53d>] :virtio_pci:vp_notify+0x0/0x1c
 [<ffffffff8005aa91>] generic_unplug_device+0x31/0x32
 [<ffffffff880a5c2c>] :dm_mod:dm_table_unplug_all+0x3f/0x83
 [<ffffffff880a4955>] :dm_mod:dm_request+0x11d/0x124
 [<ffffffff880a3d80>] :dm_mod:dm_unplug_all+0x1d/0x28
 [<ffffffff800153c1>] sync_buffer+0x36/0x3f
 [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
 [<ffffffff8001538b>] sync_buffer+0x0/0x3f
 [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
 [<ffffffff88033da4>] :jbd:journal_commit_transaction+0x942/0x106a
 [<ffffffff8003dad1>] lock_timer_base+0x1b/0x3c
 [<ffffffff8803758c>] :jbd:kjournald+0xc1/0x213
 [<ffffffff800a00b7>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
 [<ffffffff880374cb>] :jbd:kjournald+0x0/0x213
 [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032950>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032852>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Kernel panic - not syncing: softlockup: hung tasks
BUG: soft lockup - CPU#2 stuck for 13s! [swapper:0]
CPU 2:
Modules linked in: lockd sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy virtio_net ata_piix libata scsi_mod i2c_piix4 i2c_core ide_cd serio_raw cdrom pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-164.11.1.el5 #1
RIP: 0010:[<ffffffff8006b31d>]  [<ffffffff8006b31d>] default_idle+0x29/0x50
RSP: 0018:ffff810002757ef0  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff80302698
RBP: ffff810037c8f270 R08: ffff810002756000 R09: 000000000000003e
R10: ffff81000277c038 R11: ffff81006c971c90 R12: 000000001ae5412b
R13: 00025e7d6f25c33b R14: ffff810037fee860 R15: ffff810037c8f080
FS:  0000000000000000(0000) GS:ffff810037c8ee40(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b56f3c3775d CR3: 000000006bc7d000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff8004947b>] cpu_idle+0x95/0xb8
 [<ffffffff80077474>] start_secondary+0x498/0x4a7

Kernel panic - not syncing: softlockup: hung tasks
 <3>BUG: scheduling while atomic: kjournald/0x10010100/1219

Call Trace:
 <IRQ>  [<ffffffff8006243d>] __sched_text_start+0x7d/0xbd6
 [<ffffffff80091d2d>] printk+0x52/0xbd
 [<ffffffff800914dd>] __call_console_drivers+0x5b/0x69
 [<ffffffff8008df39>] __cond_resched+0x1c/0x44
 [<ffffffff800630cb>] cond_resched+0x37/0x42
 [<ffffffff801ac263>] do_unblank_screen+0x16/0x133
 [<ffffffff80081cba>] bust_spinlocks+0x1c/0x46
 [<ffffffff8009131f>] panic+0x88/0x1eb
 [<ffffffff8009e654>] kernel_text_address+0x1a/0x26
 [<ffffffff8006b921>] dump_trace+0x206/0x22f
 [<ffffffff8006b97e>] show_trace+0x34/0x47
 [<ffffffff800b9f8c>] softlockup_tick+0x15b/0x167
 [<ffffffff8001538b>] sync_buffer+0x0/0x3f
 [<ffffffff80097a6a>] update_process_times+0x4e/0x78
 [<ffffffff800776cb>] smp_local_timer_interrupt+0x2f/0x66
 [<ffffffff8001538b>] sync_buffer+0x0/0x3f
 [<ffffffff80077d08>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 [<ffffffff8808b53d>] :virtio_pci:vp_notify+0x0/0x1c
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff800122e6>] __do_softirq+0x51/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb3c>] do_softirq+0x2c/0x85
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff8808b53d>] :virtio_pci:vp_notify+0x0/0x1c
 [<ffffffff8005aa91>] generic_unplug_device+0x31/0x32
 [<ffffffff880a5c2c>] :dm_mod:dm_table_unplug_all+0x3f/0x83
 [<ffffffff880a4955>] :dm_mod:dm_request+0x11d/0x124
 [<ffffffff880a3d80>] :dm_mod:dm_unplug_all+0x1d/0x28
 [<ffffffff800153c1>] sync_buffer+0x36/0x3f
 [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e
 [<ffffffff8001538b>] sync_buffer+0x0/0x3f
 [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
 [<ffffffff88033da4>] :jbd:journal_commit_transaction+0x942/0x106a
 [<ffffffff8003dad1>] lock_timer_base+0x1b/0x3c
 [<ffffffff8803758c>] :jbd:kjournald+0xc1/0x213
 [<ffffffff800a00b7>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
 [<ffffffff880374cb>] :jbd:kjournald+0x0/0x213
 [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032950>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032852>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 3 Avi Kivity 2011-08-11 09:01:56 UTC
Likely a Fedora 12 bug; unsupported.

Comment 5 David Gibson 2013-11-05 07:36:48 UTC
Could this be a duplicate of bug 750237?