888380 – Almost all CPU time spent in _raw_spin_lock_irqsave

Bug 888380 - Almost all CPU time spent in _raw_spin_lock_irqsave

Summary: Almost all CPU time spent in _raw_spin_lock_irqsave

Keywords:
Status:	CLOSED DUPLICATE of bug 879801
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-12-18 15:32 UTC by r3obh
Modified:	2013-01-03 18:33 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-01-03 18:33:00 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description r3obh 2012-12-18 15:32:53 UTC

Description of problem:
Top shows nearly 100% system time on all CPUs, some commands like 'ps auxw' or 'cat /proc/1234/cmdline' hang in an unkillable state.  Running 'perf top' shows this:

Samples: 10M of event 'cycles', Event count (approx.): 1034786006755                                                       
 82.37%  [kernel]                      [k] _raw_spin_lock_irqsave
  8.60%  libjvm.so                     [.] SpinPause
  2.97%  libjvm.so                     [.] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
  2.32%  [kernel]                      [k] compact_zone
  1.35%  [kernel]                      [k] migrate_pages
  0.67%  [kernel]                      [k] compact_checklock_irqsave.isra.15
  0.54%  [kernel]                      [k] __zone_watermark_ok
  0.19%  [kernel]                      [k] isolate_migratepages_range
  0.17%  [kernel]                      [k] _raw_spin_unlock_irqrestore
[...]

Version-Release number of selected component (if applicable):
kernel-3.6.9-2.fc17.x86_64

How reproducible:
Not very.  Our server is running Java processes with UJMP, doing multi-threaded matrix calculations and gets into this state every half an hour or so and then many processes freeze (e.g., 'pstree -a' does but 'pstree -p' does not, memtester hangs failing to mlock, etc.) until the Java jobs are killed.

If it was easy to reproduce it would be a nice DOS attack.

The processes are big, allocating around 20GB of virtual memory, 10GB resident according to top, five or six threads each, but the machine has plenty of RAM, disk, and CPU.
  
Actual results:
All CPUs (24 with dual hexa-core processors and hyperthreading) are showing 99.7%sy (or similar) in 'top' and many processes freeze in kernel calls.

Expected results:
Not hanging!

Comment 1 r3obh 2012-12-19 14:41:04 UTC

We noticed khugepaged taking up a good bit of CPU when the machine is in a bad state and found similar bugs reported recently e.g., <a href="https://bugzilla.redhat.com/show_bug.cgi?id=">879801</a>.

Running "sync && /sbin/sysctl vm.drop_caches=3" cures the bad state(!), temporarily though... it locks up again after a while.

I grepped some potentially relevant lines from /proc/vmstat while in a bad state.  These are 10 seconds apart:

compact_blocks_moved 558689782497
compact_pages_moved 249367966
compact_pagemigrate_failed 45361973
compact_stall 1006717
compact_fail 922930
compact_success 82339
thp_fault_alloc 1592298
thp_fault_fallback 1157130
thp_collapse_alloc 25907
thp_collapse_alloc_failed 3520
thp_split 4797

compact_blocks_moved 558713450361
compact_pages_moved 249368093
compact_pagemigrate_failed 45361995
compact_stall 1006719
compact_fail 922932
compact_success 82339
thp_fault_alloc 1592298
thp_fault_fallback 1157132
thp_collapse_alloc 25907
thp_collapse_alloc_failed 3520
thp_split 4797

Comment 2 r3obh 2012-12-19 17:05:08 UTC

Typo... try https://bugzilla.redhat.com/show_bug.cgi?id=879801

Comment 3 Josh Boyer 2013-01-03 18:33:00 UTC

Closing this bug as a dup for now.

*** This bug has been marked as a duplicate of bug 879801 ***

Note You need to log in before you can comment on or make changes to this bug.