Red Hat Bugzilla – Bug 888380
Almost all CPU time spent in _raw_spin_lock_irqsave
Last modified: 2013-01-03 13:33:00 EST
Description of problem:
Top shows nearly 100% system time on all CPUs, some commands like 'ps auxw' or 'cat /proc/1234/cmdline' hang in an unkillable state. Running 'perf top' shows this:
Samples: 10M of event 'cycles', Event count (approx.): 1034786006755
82.37% [kernel] [k] _raw_spin_lock_irqsave
8.60% libjvm.so [.] SpinPause
2.97% libjvm.so [.] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
2.32% [kernel] [k] compact_zone
1.35% [kernel] [k] migrate_pages
0.67% [kernel] [k] compact_checklock_irqsave.isra.15
0.54% [kernel] [k] __zone_watermark_ok
0.19% [kernel] [k] isolate_migratepages_range
0.17% [kernel] [k] _raw_spin_unlock_irqrestore
Version-Release number of selected component (if applicable):
Not very. Our server is running Java processes with UJMP, doing multi-threaded matrix calculations and gets into this state every half an hour or so and then many processes freeze (e.g., 'pstree -a' does but 'pstree -p' does not, memtester hangs failing to mlock, etc.) until the Java jobs are killed.
If it was easy to reproduce it would be a nice DOS attack.
The processes are big, allocating around 20GB of virtual memory, 10GB resident according to top, five or six threads each, but the machine has plenty of RAM, disk, and CPU.
All CPUs (24 with dual hexa-core processors and hyperthreading) are showing 99.7%sy (or similar) in 'top' and many processes freeze in kernel calls.
We noticed khugepaged taking up a good bit of CPU when the machine is in a bad state and found similar bugs reported recently e.g., <a href="https://bugzilla.redhat.com/show_bug.cgi?id=">879801</a>.
Running "sync && /sbin/sysctl vm.drop_caches=3" cures the bad state(!), temporarily though... it locks up again after a while.
I grepped some potentially relevant lines from /proc/vmstat while in a bad state. These are 10 seconds apart:
Typo... try https://bugzilla.redhat.com/show_bug.cgi?id=879801
Closing this bug as a dup for now.
*** This bug has been marked as a duplicate of bug 879801 ***