Bug 870573

Summary: Abnormally high ksoftirqd CPU usage on CentOS 6.3
Product: Red Hat Enterprise Linux 6 Reporter: kbergk
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.3CC: abhalla, ajb, ank, david, deatrich, dgregor, fredrik, iain.t.morris, jonathansteffan, mihai, orion, pasteur, prarit, roland.friedwagner, sam, shawn.siefkas, simon.d.matthews, suren, toracat
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-13 12:02:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cat /proc/cpuinfo
none
lspci
none
Simple test of 4 recent kernels
none
Simple test on 3.0.48-1.el6.elrepo.x86_64 none

Description kbergk 2012-10-26 22:41:52 UTC
Created attachment 634086 [details]
cat /proc/cpuinfo

Description of problem:
Seeing higher than usual CPU usage by ksoftirqd in CentOS 6.3 as of kernel 2.6.32-220.13.1.el6.x86_64.  This is on a server with dual Intel Xeon L5420 processors and 5400 chipset.


Version-Release number of selected component (if applicable):
CentOS 6.3, kernels 2.6.32-220.13.1.el6.x86_64 and 2.6.32-279.9.1.el6.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install kernel 2.6.32-220.13.1.el6.x86_64 on system with Xeon L5420 processor(s).
2. Monitor cpu usage of ksoftirqd.
  
Actual results:
After 10 minutes of heavy CPU load on kernel 2.6.32-220.13.1.el6.x86_64, cpu time of ksoftirqd:
root 4 2 13 20:28 ? 00:00:53 [ksoftirqd/0]
root 9 2 8 20:28 ? 00:00:34 [ksoftirqd/1]
root 13 2 0 20:28 ? 00:00:00 [ksoftirqd/2]
root 17 2 0 20:28 ? 00:00:02 [ksoftirqd/3]
root 21 2 9 20:28 ? 00:00:39 [ksoftirqd/4]
root 25 2 11 20:28 ? 00:00:44 [ksoftirqd/5]
root 29 2 0 20:28 ? 00:00:02 [ksoftirqd/6]
root 33 2 0 20:28 ? 00:00:01 [ksoftirqd/7]


Expected results:
After 10 minutes of heavy CPU load on kernel 2.6.32-220.4.1.el6.x86_64, cpu time of ksoftirqd:
root 4 2 0 19:22 ? 00:00:00 [ksoftirqd/0]
root 9 2 0 19:22 ? 00:00:00 [ksoftirqd/1]
root 13 2 0 19:22 ? 00:00:00 [ksoftirqd/2]
root 17 2 0 19:22 ? 00:00:00 [ksoftirqd/3]
root 21 2 0 19:22 ? 00:00:00 [ksoftirqd/4]
root 25 2 0 19:22 ? 00:00:00 [ksoftirqd/5]
root 29 2 0 19:22 ? 00:00:00 [ksoftirqd/6]
root 33 2 0 19:22 ? 00:00:00 [ksoftirqd/7]

Additional info:
Link to CentOS bug with possibly useful information: http://bugs.centos.org/view.php?id=5813

I also found that this issue is not present in a much newer mainline kernel, 3.6.3-1.el6.elrepo.x86_64.

Comment 1 kbergk 2012-10-26 22:42:17 UTC
Created attachment 634087 [details]
lspci

Comment 3 kbergk 2012-10-26 22:50:32 UTC
Created attachment 634088 [details]
Simple test of 4 recent kernels

I briefly tested the latest mainline kernel (3.6.3-1 as of this writing). I don't see this ksoftirqd issue with 3.6.3-1. I ran some quick tests on various kernels by booting and running 6 cpuburn threads for 10 minutes. Here is the cpu time of the ksoftirqd processes and the output of /proc/interrupts for the following kernels:
2.6.32-220.4.1.el6.x86_64 (unaffected)
2.6.32-220.13.1.el6.x86_64 (affected)
2.6.32-279.9.1.el6.x86_64 (affected)
3.6.3-1.el6.elrepo.x86_64 (unaffected)

Comment 4 Akemi Yagi 2012-10-26 23:11:59 UTC
ELRepo also offers "long-term" kernel (now named kernel-lt). The current version is kernel-lt-3.0.48. Could you try this one so that the target can be narrowed down?

Comment 5 kbergk 2012-10-27 00:26:28 UTC
Created attachment 634099 [details]
Simple test on 3.0.48-1.el6.elrepo.x86_64

It appears that the latest long-term ELRepo kernel, 3.0.48-1.el6.elrepo.x86_64, is not affected.

Comment 6 simon.d.matthews 2012-11-02 02:40:45 UTC
I see the same problem. 

The affected machines are all virtual machines, running Centos 6.3 on an AMD host. The kernel is 2.6.32-279.1.1.el6.centos.plus.x86_64, but other kernel versions were affected. 


The problem happens when the machines are under high load.

Comment 7 Alan Bartlett 2012-11-02 02:54:03 UTC
(In reply to comment #6)
> I see the same problem. 
> 
> The affected machines are all virtual machines, running Centos 6.3 on an AMD
> host. The kernel is 2.6.32-279.1.1.el6.centos.plus.x86_64, but other kernel
> versions were affected. 
> 
> 
> The problem happens when the machines are under high load.

Could you please confirm whether the current long-term support kernel from the ELRepo Project (kernel-lt-3.0.50.el6.elrepo, as of the date of this comment) resolves the issue for you?

Comment 8 simon.d.matthews 2012-11-02 03:02:54 UTC
(In reply to comment #6)
> I see the same problem. 
> 
> The affected machines are all virtual machines, running Centos 6.3 on an AMD
> host. The kernel is 2.6.32-279.1.1.el6.centos.plus.x86_64, but other kernel
> versions were affected. 
> 
> 
> The problem happens when the machines are under high load.

I also see it on another VM running 2.6.32-279.11.1.el6.x86_64. High network I/O load seems to trigger the problem with ksoftirqd

Comment 9 simon.d.matthews 2012-11-02 03:03:50 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > I see the same problem. 
> > 
> > The affected machines are all virtual machines, running Centos 6.3 on an AMD
> > host. The kernel is 2.6.32-279.1.1.el6.centos.plus.x86_64, but other kernel
> > versions were affected. 
> > 
> > 
> > The problem happens when the machines are under high load.
> 
> Could you please confirm whether the current long-term support kernel from
> the ELRepo Project (kernel-lt-3.0.50.el6.elrepo, as of the date of this
> comment) resolves the issue for you?

Difficult. It is a core production machine and I am travelling next week.

Comment 10 Orion Poplawski 2012-11-02 13:01:52 UTC
FWIW - I see this on physical hardware as well as VMs.

Comment 11 kbergk 2012-11-02 16:20:19 UTC
Virt guest? While I do see this issue in my KVM guests, it's also very much apparent on my bare metal physical hosts.  I just want to be clear that this is not limited to VM's. Thanks.

Comment 12 Fredrik Jonsson 2012-11-15 16:29:54 UTC
Perhaps a clue... I have a lot of mixed hardware and only observe this problem on the Intel systems. AMD systems hums along just fine.

Comment 13 Andreas Kasenides 2012-12-03 15:03:40 UTC
Bare metal machine with AMD processor has this problem with 
Linux xxx.xx.cy 2.6.32-279.14.1.el6.x86_64 and Centos 6.3

/etc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 37
model name      : AMD Opteron(tm) Processor 250
stepping        : 1
cpu MHz         : 1800.000
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni
bogomips        : 3607.82
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Problem goes away with nohz=off on the kernel.

Comment 14 Orion Poplawski 2012-12-03 15:33:30 UTC
In my experience, nohz=off is not without problems as well - it seems that load average calculations are incorrect with that setting.  I've be running the -lt kernels from elrepo with good results.

Comment 15 RHEL Program Management 2012-12-17 06:49:33 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 16 RHEL Program Management 2013-10-14 04:42:19 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 18 Orion Poplawski 2015-04-10 20:00:45 UTC
https://access.redhat.com/solutions/302623 indicates this was fixed in 6.4.  Time to close this?

Comment 19 Prarit Bhargava 2015-04-13 12:02:01 UTC
(In reply to Orion Poplawski from comment #18)
> https://access.redhat.com/solutions/302623 indicates this was fixed in 6.4. 
> Time to close this?

I think so -- I'm closing as current release.

P.