Description of problem: Top shows nearly 100% system time on all CPUs, some commands like 'ps auxw' or 'cat /proc/1234/cmdline' hang in an unkillable state. Running 'perf top' shows this: Samples: 10M of event 'cycles', Event count (approx.): 1034786006755 82.37% [kernel] [k] _raw_spin_lock_irqsave 8.60% libjvm.so [.] SpinPause 2.97% libjvm.so [.] ParallelTaskTerminator::offer_termination(TerminatorTerminator*) 2.32% [kernel] [k] compact_zone 1.35% [kernel] [k] migrate_pages 0.67% [kernel] [k] compact_checklock_irqsave.isra.15 0.54% [kernel] [k] __zone_watermark_ok 0.19% [kernel] [k] isolate_migratepages_range 0.17% [kernel] [k] _raw_spin_unlock_irqrestore [...] Version-Release number of selected component (if applicable): kernel-3.6.9-2.fc17.x86_64 How reproducible: Not very. Our server is running Java processes with UJMP, doing multi-threaded matrix calculations and gets into this state every half an hour or so and then many processes freeze (e.g., 'pstree -a' does but 'pstree -p' does not, memtester hangs failing to mlock, etc.) until the Java jobs are killed. If it was easy to reproduce it would be a nice DOS attack. The processes are big, allocating around 20GB of virtual memory, 10GB resident according to top, five or six threads each, but the machine has plenty of RAM, disk, and CPU. Actual results: All CPUs (24 with dual hexa-core processors and hyperthreading) are showing 99.7%sy (or similar) in 'top' and many processes freeze in kernel calls. Expected results: Not hanging!
We noticed khugepaged taking up a good bit of CPU when the machine is in a bad state and found similar bugs reported recently e.g., <a href="https://bugzilla.redhat.com/show_bug.cgi?id=">879801</a>. Running "sync && /sbin/sysctl vm.drop_caches=3" cures the bad state(!), temporarily though... it locks up again after a while. I grepped some potentially relevant lines from /proc/vmstat while in a bad state. These are 10 seconds apart: compact_blocks_moved 558689782497 compact_pages_moved 249367966 compact_pagemigrate_failed 45361973 compact_stall 1006717 compact_fail 922930 compact_success 82339 thp_fault_alloc 1592298 thp_fault_fallback 1157130 thp_collapse_alloc 25907 thp_collapse_alloc_failed 3520 thp_split 4797 compact_blocks_moved 558713450361 compact_pages_moved 249368093 compact_pagemigrate_failed 45361995 compact_stall 1006719 compact_fail 922932 compact_success 82339 thp_fault_alloc 1592298 thp_fault_fallback 1157132 thp_collapse_alloc 25907 thp_collapse_alloc_failed 3520 thp_split 4797
Typo... try https://bugzilla.redhat.com/show_bug.cgi?id=879801
Closing this bug as a dup for now. *** This bug has been marked as a duplicate of bug 879801 ***