Bug 755570

Summary: irqbalance fills /var/log/messages with the same message every 10 seconds
Product: [Fedora] Fedora Reporter: Edgar Hoch <edgar.hoch>
Component: irqbalanceAssignee: Neil Horman <nhorman>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: unspecified    
Version: 16CC: anton, bugzilla.redhat, mhlavink, mkelly, nhorman, whome9988
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-04 19:36:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Messages
none
Interrupts
none
patch to correct log spam.
none
Messages
none
Interrupts none

Description Edgar Hoch 2011-11-21 14:01:48 UTC
Description of problem:

On a new installed system /var/log/messages is filled by useless messages like this:

/usr/sbin/irqbalance: Load average increasing, re-enabling all cpus for irq balancing

The message apears every 10 seconds.


Version-Release number of selected component (if applicable):
irqbalance-1.0.2-1.fc16.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Install Fedora 16
  
Actual results:
/var/log/messages is filled with the message above.

Expected results:
/var/log/messages contains only relevant messages.

Comment 2 Neil Horman 2011-12-12 11:52:40 UTC
It might be the crappy programmer that wrote that code ;). I designed that code to exit power save mode and log a message about it if/when we had a cpu more than a standard deviation away from the average load, and more than one irq assigned to it.  I'm guessing that this system just has a lot of interrupt sources, one of which probably high volume, leading to a forced imbalance.  Can we get a copy of:

1)/var/log/messages
2)/proc/interrupts

I think the easiest solution is to gate that syslog on info.num_over and having at least one object in powersave mode, but I'd like to be sure nothing else is going on first.  Thanks!

Comment 3 Mike 2011-12-22 21:11:19 UTC
Created attachment 549240 [details]
Messages

I am also experiencing this issue on a new install of FC 16 (running in a VM).  I attached part of my /var/log/messages and /proc/interrupts

Hope that helps... in the meantime, is there a way to suppress these messages?

Comment 4 Mike 2011-12-22 21:11:59 UTC
Created attachment 549241 [details]
Interrupts

I am also experiencing this issue on a new install of FC 16 (running in a VM).  I attached part of my /var/log/messages and /proc/interrupts

Hope that helps... in the meantime, is there a way to suppress these messages?

Comment 5 Neil Horman 2011-12-22 21:17:08 UTC
Unfortunately no, but at this point, it seems pretty clear whats going on.  Anton, do you want to write this fix, or shall I?

Comment 6 Neil Horman 2011-12-23 00:33:53 UTC
Created attachment 549266 [details]
patch to correct log spam.

Heres a patch that ensures that at least one cpu is in powersave mode before logging that message.  It should fix the problem.  I'll have a test build for you shortly.

Comment 7 Neil Horman 2011-12-23 00:45:02 UTC
http://koji.fedoraproject.org/koji/taskinfo?taskID=3602402

Theres a test build with the above patch for you.  Please test it out and confirm that it solves the problem for you.  When you confirm I'll get it checked in upstream and for fedora

Comment 8 Mike 2011-12-27 15:49:28 UTC
Created attachment 549714 [details]
Messages

Comment 9 Mike 2011-12-27 15:49:58 UTC
Created attachment 549715 [details]
Interrupts

Comment 10 Mike 2011-12-27 15:52:05 UTC
I applied the patch this morning.  While the message does appears with a lower frequency, it does still strike me as a bit noisy.

Comment 11 Neil Horman 2011-12-28 03:34:55 UTC
please provide the new var/log/messages.  This suggests something else is wrong, like you're constantly turning a cpu off from irq balancing then re-enabling it

Comment 12 Matt 2011-12-28 04:08:11 UTC
http://matt.seilcampbells.com/messages

Matt

Comment 13 Neil Horman 2011-12-28 15:27:53 UTC
ok, something isn't right here.  You're getting re-enable logs on successive balance iterations.  You should have at least 1 iteration between these logs.  I'll take a closer look

Oh!  I also remembered a workaround for you.  Edit the systemd unit file for irqbalance to pass --powerthresh=X to the daemon, where X is the number of CPU's in your system.  That will prevent any cpu from going into powersave mode and prevent the error from triggering.

Comment 14 Matt 2011-12-29 06:40:10 UTC
Thanks Neil,

Preventing CPUs from going to sleep?  Is that something I want to do?  I have 24 CPUs in my home server and I would like the power savings when nobody's at home using the services they provide.

Thanks for looking into the code.  Good luck!

Matt

P.S.  This is a fresh and generic Fedora install, just running the base OS, with file services turned on and the VI hypervisor for the various guests.

Comment 15 Neil Horman 2011-12-29 14:18:51 UTC
It doesn't prevent cpus from going to sleep, it prevents cpus from having irqs assigned to them so they can stay asleep longer.  By raising the threshold, you lower the likelyhood that this will occur.  If cpus don't go into powersave mode in irqbalance, then this printout shouldn't occur.  Although looking at it, the threshold should already default to ULONG_MAX which suggests that something else is going on here.

Comment 16 Neil Horman 2011-12-29 14:40:48 UTC
so I've spent the last few hours looking at this, and for the life of me I cant see whats wrong with the code.  By all rights, you shouldn't ever be able to get the message you're seeing without having an intervening message about a cpu entering powersave mode, but you don't.  Is this system by any chance a virtual machine (vmware perhaps?).  Does the same problem occur on your bare metal system, or possibly on a kvm virt guest.  I'd like to be able to reproduce, but I need to figure out how

Comment 17 Mike 2011-12-30 13:22:06 UTC
Sorry for the delay.  I was OoO; spending some time with the family.

I attached updated logs from after the patch on 2011-12-27.  The show as comment number 8 and 9 or attachment 549714 [details] and 549715 for me.

Yes, my system is a VM, running on VMWare with 4 CPUs assigned to it.  I'm not sure if this will help or not but I also have another VM running on the same physical box with 2 CPUS assigned to it.  This  other VM is also running FC 16 but it does not exhibit this behavior.  Perhaps there is a difference between 2 and 4 core?

Comment 18 Mike 2011-12-30 15:13:13 UTC
Neil, per your recommendation, I added the powerthresh argument.  

Approximately 1 minute after restarting, I received notification of 'cpu 0 entering powersave mode'
After an additional 30 seconds, I received the same message for cpu 1... 
20 more additional seconds, cpu 2...  
and finally, 10 more seconds, cpu 3.


Thanks again for your assistance.

Mike

Comment 19 Mike 2011-12-30 15:15:40 UTC
I should have also added that after 1 hour, I have not received the original '/usr/sbin/irqbalance: Load average increasing, re-enabling all cpus for irq
balancing' message.

Comment 20 Madison Kelly 2011-12-30 18:36:31 UTC
To lend my voice to this bug; I am seeing this on my test systems as well. I've installed the RPM from koji. Once installed, the messages went away (at least, ten minutes in I've yet to see one, where before the screen would be full of messages by now).

Cheers

Comment 21 Neil Horman 2011-12-30 18:54:15 UTC
Thank you for your input, digimer, I presume you are on a bare metal system?  i.e. not a virt guest?

Mike, thank you for your files, I'm starting to feel like maybe the vmware hypervisor isn't reporting the right stats to /proc/stat.  I'll look at your new data in a few days (need to spend time with some more relatives :)).

Comment 22 Madison Kelly 2011-12-30 19:27:12 UTC
I am indeed on bare iron. Modest quad-core AMD Athlon II X4 600e processors ASUS M4A785T-M/CSM mainboards, in case it matters.

Comment 23 Neil Horman 2012-01-03 17:53:06 UTC
Mike, ok, you're comment 18 has me really confused, given that setting the powerthresh argument as I suggested should prevent any cpus from getting marked as going into powersave mode, but you've noted that they all are, which should be impossible, since at least one cpu needs to be responsible for handling all the irq traffic of a system.  Its almost as though, despite setting affinity for irqs in the guest, cpu load isn't reflecting the change.

I'm starting to wonder if you shouldn't just disable irqbalance on the guest.  Some quick googling seems to have support for that notion, given that, affinity for an irq to a virtual cpu isn't really sensical (since its possible for the vcpu to actually be any of a set of cpus).  Maybe thats the answer here, in a guest environment, we should just not run irqbalance.

Comment 24 Mike 2012-01-04 19:24:44 UTC
Sorry for the delay... I guess all of my CPUs were asleep and didn't see your message :)

I understand your message - re: comment 23.  Disabling irqbalance may, in fact, be the best option.

Comment 25 Neil Horman 2012-01-04 19:36:32 UTC
concur.  lets go with that option, and I'll look into ways of detecting irqbalance running on a virtual guest so that we don't do more harm than good.

Closing as wontfix since theres really no good way for irqbalance to handle this case other than to not run.

Comment 26 Michal Hlavinka 2012-06-29 07:56:39 UTC
Hi, could you explain me the status of this bug?  I can see it's closed wontfix, but I understand it is wontfix for virtual machines.

I'm seeing these messages in logs since I've updated my server (bare metal, CPU Intel Atom 330) from Fedora 16 to Fedora 17.