Bug 755360

Summary:	Soft lockup for kernel-2.6.41.1-1.fc15 on Xen 4.0.1, but only on HP hardware
Product:	[Fedora] Fedora	Reporter:	Edward Z. Yang <ezyang>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	15	CC:	gansalmon, itamar, jonathan, kernel-maint, ketuzsezr, madhu.chinakonda
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-11-21 18:09:13 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Edward Z. Yang 2011-11-20 22:10:59 UTC

Today, we upgraded to kernel-2.6.41.1-1.fc15 on our servers, and were surprised to find some (though not all) of them started soft-locking. We tracked down the difference, and it appears to be the case that only guests which were running on a Xen 4.0.1 hypervisor (dom0 running Debian Squeeze) on HP ProLiant DL380 G4. We think this may be due to some of the large VFS changes that occurred in 3.1.

This might be hard to reproduce. Please let us know any information you would like us to acquire.

Here is what the soft lockup looks like:

[  564.032009] BUG: soft lockup - CPU#0 stuck for 23s! [php-cgi:6769]
[  564.032009] Stack:
[  564.032009] Call Trace:
[  564.032009] Code: e5 66 66 66 66 90 48 c7 c0 60 01 01 00 65 48 03 04 25 a0 d3 00 00 ba 00 00 01 00 f0 0f c1 10 0f b7 ca c1 ea 10 39 d1 74 07 f3 90 <0f> b7 08 eb f5 5d c3 55 48 89 e5 66 66 66 66 90 48 c7 c0 60 01 
[  564.033011] BUG: soft lockup - CPU#1 stuck for 23s! [php-cgi:6775]
[  564.033011] Stack:
[  564.033011] Call Trace:
[  564.033011] Code: e5 66 66 66 66 90 48 c7 c0 60 01 01 00 65 48 03 04 25 a0 d3 00 00 ba 00 00 01 00 f0 0f c1 10 0f b7 ca c1 ea 10 39 d1 74 07 f3 90 <0f> b7 08 eb f5 5d c3 55 48 89 e5 66 66 66 66 90 48 c7 c0 60 01 
[  585.367401] SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) 
[  591.140322] SysRq : Show State
[  595.320360] SysRq : Show backtrace of all active CPUs
[  598.394010] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 0} (detected by 1, t=60002 jiffies)

Here are some representative traces:

[ 1648.032009] BUG: soft lockup - CPU#0 stuck for 22s! [php-cgi:6769]
[ 1648.032009] Modules linked in: capi kernelcapi microcode xen_netfront openafs(P) ipv6 xen_blkfront [last unloaded: scsi_wait_scan]
[ 1648.032009] CPU 0 
[ 1648.032009] Modules linked in: capi kernelcapi microcode xen_netfront openafs(P) ipv6 xen_blkfront [last unloaded: scsi_wait_scan]
[ 1648.032009] 
[ 1648.032009] Pid: 6769, comm: php-cgi Tainted: P            2.6.41.1-1.fc15.x86_64 #1  
[ 1648.032009] RIP: e030:[<ffffffff8113f770>]  [<ffffffff8113f770>] vfsmount_lock_local_lock+0x2e/0x35
[ 1648.032009] RSP: e02b:ffff8800b43e1d48  EFLAGS: 00000297
[ 1648.032009] RAX: ffff8800fff76160 RBX: ffff8800b4385000 RCX: 000000000000d26e
[ 1648.032009] RDX: 000000000000d26f RSI: ffff8800b4385000 RDI: 00000000ffffff9c
[ 1648.032009] RBP: ffff8800b43e1d48 R08: ffff8800b43e1e00 R09: ffff880004339110
[ 1648.032009] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
[ 1648.032009] R13: ffff8800b43e1e48 R14: ffff8800b43e1f28 R15: ffff8800b43e1e00
[ 1648.032009] FS:  00007fa4af3517e0(0000) GS:ffff8800fff66000(0000) knlGS:0000000000000000
[ 1648.032009] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1648.032009] CR2: 0000003fb1a32810 CR3: 00000000a9543000 CR4: 0000000000000660
[ 1648.032009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1648.032009] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1648.032009] Process php-cgi (pid: 6769, threadinfo ffff8800b43e0000, task ffff8800a94bdcc0)
[ 1648.032009] Stack:
[ 1648.032009]  ffff8800b43e1da8 ffffffff81133d06 0000000000000000 0000000000000000
[ 1648.032009]  ffff8800bbf32300 0000000000000001 ffff8800b43e1d88 ffff8800b43e1e48
[ 1648.032009]  ffff8800b43e1f28 0000000000000001 ffff8800b43e1f28 ffff8800b4385000
[ 1648.032009] Call Trace:
[ 1648.032009]  [<ffffffff81133d06>] path_init+0xcf/0x2a2
[ 1648.032009]  [<ffffffff8113582a>] path_openat+0x8b/0x310
[ 1648.032009]  [<ffffffff810fc548>] ? handle_mm_fault+0x1c8/0x1db
[ 1648.032009]  [<ffffffff81135bac>] do_filp_open+0x38/0x86
[ 1648.032009]  [<ffffffff81043ffb>] ? should_resched+0xe/0x2d
[ 1648.032009]  [<ffffffff8149b2cd>] ? _cond_resched+0xe/0x22
[ 1648.032009]  [<ffffffff8124ceb1>] ? __strncpy_from_user+0x1e/0x48
[ 1648.032009]  [<ffffffff8113f164>] ? alloc_fd+0x72/0x11d
[ 1648.032009]  [<ffffffff81129010>] do_sys_open+0x6e/0x100
[ 1648.032009]  [<ffffffff810a37c8>] ? audit_syscall_entry+0x145/0x171
[ 1648.032009]  [<ffffffff811290c2>] sys_open+0x20/0x22
[ 1648.032009]  [<ffffffff814a3102>] system_call_fastpath+0x16/0x1b
[ 1648.032009] Code: e5 66 66 66 66 90 48 c7 c0 60 01 01 00 65 48 03 04 25 a0 d3 00 00 ba 00 00 01 00 f0 0f c1 10 0f b7 ca c1 ea 10 39 d1 74 07 f3 90 <0f> b7 08 eb f5 5d c3 55 48 89 e5 66 66 66 66 90 48 c7 c0 60 01 
[ 1648.032009] Call Trace:
[ 1648.032009]  [<ffffffff81133d06>] path_init+0xcf/0x2a2
[ 1648.032009]  [<ffffffff8113582a>] path_openat+0x8b/0x310
[ 1648.032009]  [<ffffffff810fc548>] ? handle_mm_fault+0x1c8/0x1db
[ 1648.032009]  [<ffffffff81135bac>] do_filp_open+0x38/0x86
[ 1648.032009]  [<ffffffff81043ffb>] ? should_resched+0xe/0x2d
[ 1648.032009]  [<ffffffff8149b2cd>] ? _cond_resched+0xe/0x22
[ 1648.032009]  [<ffffffff8124ceb1>] ? __strncpy_from_user+0x1e/0x48
[ 1648.032009]  [<ffffffff8113f164>] ? alloc_fd+0x72/0x11d
[ 1648.032009]  [<ffffffff81129010>] do_sys_open+0x6e/0x100
[ 1648.032009]  [<ffffffff810a37c8>] ? audit_syscall_entry+0x145/0x171
[ 1648.032009]  [<ffffffff811290c2>] sys_open+0x20/0x22
[ 1648.032009]  [<ffffffff814a3102>] system_call_fastpath+0x16/0x1b

One of our team members thinks http://lists.xen.org/archives/html/xen-devel/2011-10/msg01830.html might be related.

Comment 1 Edward Z. Yang 2011-11-20 22:38:56 UTC

For comparison, we did NOT get soft locks on our Supermicro X8DT3. Hardware is the only difference; we setup our hypervisor and servers identically.

Comment 2 Edward Z. Yang 2011-11-21 18:09:13 UTC

These locks appear to be stemming from an OpenAFS module which compiled with the new kernel headers, but were actually buggy due to changes in how RCU locking works.

We applied these patches and the soft locks went away:

http://gerrit.openafs.org/6088
http://git.openafs.org/?p=openafs.git;a=commit;h=cbaefa266d433af3b9a082a360e23a42f161d80f

Not of interest to Fedora since you guys don't package OpenAFS.