Hide Forgot
Description of problem: On a new installed system /var/log/messages is filled by useless messages like this: /usr/sbin/irqbalance: Load average increasing, re-enabling all cpus for irq balancing The message apears every 10 seconds. Version-Release number of selected component (if applicable): irqbalance-1.0.2-1.fc16.x86_64 How reproducible: Always. Steps to Reproduce: 1. Install Fedora 16 Actual results: /var/log/messages is filled with the message above. Expected results: /var/log/messages contains only relevant messages.
It might be the crappy programmer that wrote that code ;). I designed that code to exit power save mode and log a message about it if/when we had a cpu more than a standard deviation away from the average load, and more than one irq assigned to it. I'm guessing that this system just has a lot of interrupt sources, one of which probably high volume, leading to a forced imbalance. Can we get a copy of: 1)/var/log/messages 2)/proc/interrupts I think the easiest solution is to gate that syslog on info.num_over and having at least one object in powersave mode, but I'd like to be sure nothing else is going on first. Thanks!
Created attachment 549240 [details] Messages I am also experiencing this issue on a new install of FC 16 (running in a VM). I attached part of my /var/log/messages and /proc/interrupts Hope that helps... in the meantime, is there a way to suppress these messages?
Created attachment 549241 [details] Interrupts I am also experiencing this issue on a new install of FC 16 (running in a VM). I attached part of my /var/log/messages and /proc/interrupts Hope that helps... in the meantime, is there a way to suppress these messages?
Unfortunately no, but at this point, it seems pretty clear whats going on. Anton, do you want to write this fix, or shall I?
Created attachment 549266 [details] patch to correct log spam. Heres a patch that ensures that at least one cpu is in powersave mode before logging that message. It should fix the problem. I'll have a test build for you shortly.
http://koji.fedoraproject.org/koji/taskinfo?taskID=3602402 Theres a test build with the above patch for you. Please test it out and confirm that it solves the problem for you. When you confirm I'll get it checked in upstream and for fedora
Created attachment 549714 [details] Messages
Created attachment 549715 [details] Interrupts
I applied the patch this morning. While the message does appears with a lower frequency, it does still strike me as a bit noisy.
please provide the new var/log/messages. This suggests something else is wrong, like you're constantly turning a cpu off from irq balancing then re-enabling it
http://matt.seilcampbells.com/messages Matt
ok, something isn't right here. You're getting re-enable logs on successive balance iterations. You should have at least 1 iteration between these logs. I'll take a closer look Oh! I also remembered a workaround for you. Edit the systemd unit file for irqbalance to pass --powerthresh=X to the daemon, where X is the number of CPU's in your system. That will prevent any cpu from going into powersave mode and prevent the error from triggering.
Thanks Neil, Preventing CPUs from going to sleep? Is that something I want to do? I have 24 CPUs in my home server and I would like the power savings when nobody's at home using the services they provide. Thanks for looking into the code. Good luck! Matt P.S. This is a fresh and generic Fedora install, just running the base OS, with file services turned on and the VI hypervisor for the various guests.
It doesn't prevent cpus from going to sleep, it prevents cpus from having irqs assigned to them so they can stay asleep longer. By raising the threshold, you lower the likelyhood that this will occur. If cpus don't go into powersave mode in irqbalance, then this printout shouldn't occur. Although looking at it, the threshold should already default to ULONG_MAX which suggests that something else is going on here.
so I've spent the last few hours looking at this, and for the life of me I cant see whats wrong with the code. By all rights, you shouldn't ever be able to get the message you're seeing without having an intervening message about a cpu entering powersave mode, but you don't. Is this system by any chance a virtual machine (vmware perhaps?). Does the same problem occur on your bare metal system, or possibly on a kvm virt guest. I'd like to be able to reproduce, but I need to figure out how
Sorry for the delay. I was OoO; spending some time with the family. I attached updated logs from after the patch on 2011-12-27. The show as comment number 8 and 9 or attachment 549714 [details] and 549715 for me. Yes, my system is a VM, running on VMWare with 4 CPUs assigned to it. I'm not sure if this will help or not but I also have another VM running on the same physical box with 2 CPUS assigned to it. This other VM is also running FC 16 but it does not exhibit this behavior. Perhaps there is a difference between 2 and 4 core?
Neil, per your recommendation, I added the powerthresh argument. Approximately 1 minute after restarting, I received notification of 'cpu 0 entering powersave mode' After an additional 30 seconds, I received the same message for cpu 1... 20 more additional seconds, cpu 2... and finally, 10 more seconds, cpu 3. Thanks again for your assistance. Mike
I should have also added that after 1 hour, I have not received the original '/usr/sbin/irqbalance: Load average increasing, re-enabling all cpus for irq balancing' message.
To lend my voice to this bug; I am seeing this on my test systems as well. I've installed the RPM from koji. Once installed, the messages went away (at least, ten minutes in I've yet to see one, where before the screen would be full of messages by now). Cheers
Thank you for your input, digimer, I presume you are on a bare metal system? i.e. not a virt guest? Mike, thank you for your files, I'm starting to feel like maybe the vmware hypervisor isn't reporting the right stats to /proc/stat. I'll look at your new data in a few days (need to spend time with some more relatives :)).
I am indeed on bare iron. Modest quad-core AMD Athlon II X4 600e processors ASUS M4A785T-M/CSM mainboards, in case it matters.
Mike, ok, you're comment 18 has me really confused, given that setting the powerthresh argument as I suggested should prevent any cpus from getting marked as going into powersave mode, but you've noted that they all are, which should be impossible, since at least one cpu needs to be responsible for handling all the irq traffic of a system. Its almost as though, despite setting affinity for irqs in the guest, cpu load isn't reflecting the change. I'm starting to wonder if you shouldn't just disable irqbalance on the guest. Some quick googling seems to have support for that notion, given that, affinity for an irq to a virtual cpu isn't really sensical (since its possible for the vcpu to actually be any of a set of cpus). Maybe thats the answer here, in a guest environment, we should just not run irqbalance.
Sorry for the delay... I guess all of my CPUs were asleep and didn't see your message :) I understand your message - re: comment 23. Disabling irqbalance may, in fact, be the best option.
concur. lets go with that option, and I'll look into ways of detecting irqbalance running on a virtual guest so that we don't do more harm than good. Closing as wontfix since theres really no good way for irqbalance to handle this case other than to not run.
Hi, could you explain me the status of this bug? I can see it's closed wontfix, but I understand it is wontfix for virtual machines. I'm seeing these messages in logs since I've updated my server (bare metal, CPU Intel Atom 330) from Fedora 16 to Fedora 17.