Bug 888380

Summary:	Almost all CPU time spent in _raw_spin_lock_irqsave
Product:	[Fedora] Fedora	Reporter:	r3obh <Robert.Harley>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED DUPLICATE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	17	CC:	gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-01-03 18:33:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description r3obh 2012-12-18 15:32:53 UTC

Description of problem:
Top shows nearly 100% system time on all CPUs, some commands like 'ps auxw' or 'cat /proc/1234/cmdline' hang in an unkillable state.  Running 'perf top' shows this:

Samples: 10M of event 'cycles', Event count (approx.): 1034786006755                                                       
 82.37%  [kernel]                      [k] _raw_spin_lock_irqsave
  8.60%  libjvm.so                     [.] SpinPause
  2.97%  libjvm.so                     [.] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
  2.32%  [kernel]                      [k] compact_zone
  1.35%  [kernel]                      [k] migrate_pages
  0.67%  [kernel]                      [k] compact_checklock_irqsave.isra.15
  0.54%  [kernel]                      [k] __zone_watermark_ok
  0.19%  [kernel]                      [k] isolate_migratepages_range
  0.17%  [kernel]                      [k] _raw_spin_unlock_irqrestore
[...]

Version-Release number of selected component (if applicable):
kernel-3.6.9-2.fc17.x86_64

How reproducible:
Not very.  Our server is running Java processes with UJMP, doing multi-threaded matrix calculations and gets into this state every half an hour or so and then many processes freeze (e.g., 'pstree -a' does but 'pstree -p' does not, memtester hangs failing to mlock, etc.) until the Java jobs are killed.

If it was easy to reproduce it would be a nice DOS attack.

The processes are big, allocating around 20GB of virtual memory, 10GB resident according to top, five or six threads each, but the machine has plenty of RAM, disk, and CPU.
  
Actual results:
All CPUs (24 with dual hexa-core processors and hyperthreading) are showing 99.7%sy (or similar) in 'top' and many processes freeze in kernel calls.

Expected results:
Not hanging!

Comment 1 r3obh 2012-12-19 14:41:04 UTC

We noticed khugepaged taking up a good bit of CPU when the machine is in a bad state and found similar bugs reported recently e.g., <a href="https://bugzilla.redhat.com/show_bug.cgi?id=">879801</a>.

Running "sync && /sbin/sysctl vm.drop_caches=3" cures the bad state(!), temporarily though... it locks up again after a while.

I grepped some potentially relevant lines from /proc/vmstat while in a bad state.  These are 10 seconds apart:

compact_blocks_moved 558689782497
compact_pages_moved 249367966
compact_pagemigrate_failed 45361973
compact_stall 1006717
compact_fail 922930
compact_success 82339
thp_fault_alloc 1592298
thp_fault_fallback 1157130
thp_collapse_alloc 25907
thp_collapse_alloc_failed 3520
thp_split 4797

compact_blocks_moved 558713450361
compact_pages_moved 249368093
compact_pagemigrate_failed 45361995
compact_stall 1006719
compact_fail 922932
compact_success 82339
thp_fault_alloc 1592298
thp_fault_fallback 1157132
thp_collapse_alloc 25907
thp_collapse_alloc_failed 3520
thp_split 4797

Comment 2 r3obh 2012-12-19 17:05:08 UTC

Typo... try https://bugzilla.redhat.com/show_bug.cgi?id=879801

Comment 3 Josh Boyer 2013-01-03 18:33:00 UTC

Closing this bug as a dup for now.

*** This bug has been marked as a duplicate of bug 879801 ***