Bug 587265

Summary: [abrt] crash in kernel: BUG: soft lockup - CPU#1 stuck for 4096s! [sync_supers:18]
Product: Red Hat Enterprise Linux 6 Reporter: Andrew Hecox <ahecox>
Component: kernelAssignee: Andrew Jones <drjones>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: drjones, mgahagan, nenad
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: abrt_hash:501133792
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-16 09:07:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sos from domU none

Description Andrew Hecox 2010-04-29 13:53:26 UTC
abrt 1.0.7 detected a crash.

architecture: x86_64
cmdline: not_applicable
component: kernel
executable: kernel
kernel: 2.6.32-19.el6.x86_64
package: kernel
reason: BUG: soft lockup - CPU#1 stuck for 4096s! [sync_supers:18]
release: Red Hat Enterprise Linux release 6.0 Beta (Santiago)

kerneloops
-----
BUG: soft lockup - CPU#1 stuck for 4096s! [sync_supers:18]
Modules linked in: autofs4(U) sunrpc(U) ip6t_REJECT(U) nf_conntrack_ipv6(U) ip6table_filter(U) ip6_tables(U) ipv6(U) dm_mirror(U) dm_region_hash(U) dm_log(U) joydev(U) xen_netfront(U) ext4(U) mbcache(U) jbd2(U) xen_blkfront(U) dm_mod(U) [last unloaded: scsi_wait_scan]
CPU 1:
Modules linked in: autofs4(U) sunrpc(U) ip6t_REJECT(U) nf_conntrack_ipv6(U) ip6table_filter(U) ip6_tables(U) ipv6(U) dm_mirror(U) dm_region_hash(U) dm_log(U) joydev(U) xen_netfront(U) ext4(U) mbcache(U) jbd2(U) xen_blkfront(U) dm_mod(U) [last unloaded: scsi_wait_scan]
Pid: 18, comm: sync_supers Not tainted 2.6.32-19.el6.x86_64 #1 
RIP: e030:[<ffffffff8100922a>]  [<ffffffff8100922a>] hypercall_page+0x22a/0x1010
RSP: e02b:ffff88007d13fd50  EFLAGS: 00000246
RAX: 0000000000030001 RBX: ffff880003795400 RCX: ffffffff8100922a
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88007d13fd68 R08: ffff88007d13e000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: ffff88007d13d540
R13: ffff88007d337540 R14: 0000000000000001 R15: ffff880004324100
FS:  00007fbb108657c0(0000) GS:ffff88000430e000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f1c9d489000 CR3: 0000000003528000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
Call Trace:
[<ffffffff8100f19d>] ? xen_force_evtchn_callback+0xd/0x10
[<ffffffff8100f9c2>] check_events+0x12/0x20
[<ffffffff8100f969>] ? xen_irq_enable_direct_end+0x0/0x7
[<ffffffff810566e3>] ? finish_task_switch+0x53/0xd0
[<ffffffff814c04ce>] thread_return+0x4e/0x740
[<ffffffff811224e0>] ? bdi_sync_supers+0x0/0x60
[<ffffffff8112251b>] bdi_sync_supers+0x3b/0x60
[<ffffffff8108d8a6>] kthread+0x96/0xa0
[<ffffffff810141ca>] child_rip+0xa/0x20
[<ffffffff81013391>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff81013b1d>] ? retint_restore_args+0x5/0x6
[<ffffffff810141c0>] ? child_rip+0x0/0x20

Comment 1 Andrew Hecox 2010-04-29 14:03:22 UTC
Created attachment 410117 [details]
sos from domU

Comment 3 RHEL Program Management 2010-04-29 15:25:39 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 4 Andrew Jones 2010-04-30 09:00:00 UTC
There's a discussion about this upstream on xen-devel right now

http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01561.html

That machine and the machine this bug is reported for are both AMD.

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 5
model name	: AMD Opteron(tm) Processor 250
stepping	: 10
cpu MHz		: 2393.180
cache size	: 1024 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good
bogomips	: 4786.36
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts fid vid ttp


This machine has 2 vcpus and xen-devel one has 3. While I don't have much confidence it will help, it would probably be a good idea to test a guest with only 1 vcpu. Can you try running this guest again with only one vcpu?

Comment 5 Andrew Jones 2010-04-30 09:16:47 UTC
Oops, I think I shot from the hip a bit too fast there on my last comment. Now that I'm looking closer it doesn't look like the upstream issue is related, so sorry about that noise. This bug does look like something I've seen before though bug 550724. It looks like that bug because I poked at the sos report and see that dmesg shows all tasks are getting stuck, i.e. locked up on D state. I'll look closer now before I make my next comment :-)  Also, is it possible for me to get access to this machine?

Comment 6 Andrew Hecox 2010-05-02 21:40:12 UTC
yeah, access is not a problem: should I switch back to 2vCPUs? 

I think this might have happened right after ntp-syncing post-install, if that helps.

Comment 9 RHEL Program Management 2010-07-15 14:50:47 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 10 Andrew Jones 2010-08-02 15:55:36 UTC
Resetting this to 6.1. We need a reliable reproducer to work on it.

Comment 11 Andrew Jones 2010-09-20 08:43:18 UTC
This is likely a dup of bug 550724. Are you still seeing this problem? If so, please try running with irqbalance off and see if it goes away so we can dup it.

Thanks,
Drew

Comment 13 Andrew Jones 2010-11-16 09:07:37 UTC
Closing this as a dup of bug 550724, if latest kernels still have the problem, then it should be reopened with more information.

*** This bug has been marked as a duplicate of bug 550724 ***