Bug 888380 - Almost all CPU time spent in _raw_spin_lock_irqsave
Almost all CPU time spent in _raw_spin_lock_irqsave
Status: CLOSED DUPLICATE of bug 879801
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
17
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-18 10:32 EST by r3obh
Modified: 2013-01-03 13:33 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-01-03 13:33:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description r3obh 2012-12-18 10:32:53 EST
Description of problem:
Top shows nearly 100% system time on all CPUs, some commands like 'ps auxw' or 'cat /proc/1234/cmdline' hang in an unkillable state.  Running 'perf top' shows this:

Samples: 10M of event 'cycles', Event count (approx.): 1034786006755                                                       
 82.37%  [kernel]                      [k] _raw_spin_lock_irqsave
  8.60%  libjvm.so                     [.] SpinPause
  2.97%  libjvm.so                     [.] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
  2.32%  [kernel]                      [k] compact_zone
  1.35%  [kernel]                      [k] migrate_pages
  0.67%  [kernel]                      [k] compact_checklock_irqsave.isra.15
  0.54%  [kernel]                      [k] __zone_watermark_ok
  0.19%  [kernel]                      [k] isolate_migratepages_range
  0.17%  [kernel]                      [k] _raw_spin_unlock_irqrestore
[...]

Version-Release number of selected component (if applicable):
kernel-3.6.9-2.fc17.x86_64

How reproducible:
Not very.  Our server is running Java processes with UJMP, doing multi-threaded matrix calculations and gets into this state every half an hour or so and then many processes freeze (e.g., 'pstree -a' does but 'pstree -p' does not, memtester hangs failing to mlock, etc.) until the Java jobs are killed.

If it was easy to reproduce it would be a nice DOS attack.

The processes are big, allocating around 20GB of virtual memory, 10GB resident according to top, five or six threads each, but the machine has plenty of RAM, disk, and CPU.
  
Actual results:
All CPUs (24 with dual hexa-core processors and hyperthreading) are showing 99.7%sy (or similar) in 'top' and many processes freeze in kernel calls.

Expected results:
Not hanging!
Comment 1 r3obh 2012-12-19 09:41:04 EST
We noticed khugepaged taking up a good bit of CPU when the machine is in a bad state and found similar bugs reported recently e.g., <a href="https://bugzilla.redhat.com/show_bug.cgi?id=">879801</a>.

Running "sync && /sbin/sysctl vm.drop_caches=3" cures the bad state(!), temporarily though... it locks up again after a while.

I grepped some potentially relevant lines from /proc/vmstat while in a bad state.  These are 10 seconds apart:

compact_blocks_moved 558689782497
compact_pages_moved 249367966
compact_pagemigrate_failed 45361973
compact_stall 1006717
compact_fail 922930
compact_success 82339
thp_fault_alloc 1592298
thp_fault_fallback 1157130
thp_collapse_alloc 25907
thp_collapse_alloc_failed 3520
thp_split 4797

compact_blocks_moved 558713450361
compact_pages_moved 249368093
compact_pagemigrate_failed 45361995
compact_stall 1006719
compact_fail 922932
compact_success 82339
thp_fault_alloc 1592298
thp_fault_fallback 1157132
thp_collapse_alloc 25907
thp_collapse_alloc_failed 3520
thp_split 4797
Comment 2 r3obh 2012-12-19 12:05:08 EST
Typo... try https://bugzilla.redhat.com/show_bug.cgi?id=879801
Comment 3 Josh Boyer 2013-01-03 13:33:00 EST
Closing this bug as a dup for now.

*** This bug has been marked as a duplicate of bug 879801 ***

Note You need to log in before you can comment on or make changes to this bug.