837049 – irqbalance 1.0.3 distributes network interrupts across cpus resulting in packet drops

Bug 837049 - irqbalance 1.0.3 distributes network interrupts across cpus resulting in packet drops

Summary: irqbalance 1.0.3 distributes network interrupts across cpus resulting in pack...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	irqbalance
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Petr Holasek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-07-02 15:29 UTC by Andrew J. Schorr
Modified:	2016-10-04 04:08 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-09-17 23:54:42 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch irqbalance to add --affinity-helper option to allow a user-space program to supply affinity hints (4.54 KB, patch) 2012-07-03 02:02 UTC, Andrew J. Schorr	no flags	Details \| Diff
log file showing output from "irqbalance --powerthresh=1 --debug" (2.75 MB, text/plain) 2012-07-05 15:38 UTC, Andrew J. Schorr	no flags	Details
View All

Description Andrew J. Schorr 2012-07-02 15:29:10 UTC

Description of problem: After upgrading a system from Fedora 14 to Fedora 16, we notice a lot of network packets being dropped.  In Fedora 14, the old irqbalance 0.56 sent most of the interrupts to CPU 0.  But in Fedora 16, irqbalance is smarter.  It distributes them across the cpus in the system.  This seems like a good idea, but it results in packet drops with Intel Xeon cpus.  This seems to be related to the cpu being in a sleep C-state which triggers DMA stalls.  Sending all the interrupts to CPU 0 seems to work better.  Since the network drivers cannot be trusted to set affinity_hint properly, there needs to be an easy way to configure irqbalance to do the right thing.


Version-Release number of selected component (if applicable):
irqbalance-1.0.3-3.fc16.x86_64


How reproducible: Run an application that receives lots of multicast UDP traffic on an Intel Xeon system (e.g. E5-2690 or X5690 cpu).


Steps to Reproduce:
1. Subscribe to high-speed multicast UDP data services and observe packet drops at the application level or using "ethtool -S"
2.
3.
  
Actual results: Packets are dropped.


Expected results: Packets should not be dropped.


Additional info:

Comment 2 Neil Horman 2012-07-02 17:40:11 UTC

If you want to force all all interrupts to a single cpu, You've found yourself in a situation where balancing is something to avoid, rather than to do.  So don't do it.  you can _not_ balance irqs in several ways:

1) Turn off irqbalance entirely.  If you don't want to balance irq's don't do it.  Turn the daemon off and manually set the affinity for all the irqs via /proc/irq/<X>/smp_affinity

2) Ban the offending irqs from balancing with the --banirq option.  This needs to be pulled into fedora, but can easily be done.  Then you can just set the affinity manually for the irqs you want bound to cpu0

3) use the powerthresh option in irqbalance to tell irqbalance not to balance irqs to cpus that have gone into a deep C state.  See the irqbalance man page for details on the option, but setting a power_thresh of 1 should attempt to very quickly  avoid using cpus that have gone to sleep.

Comment 3 Neil Horman 2012-07-02 18:00:06 UTC

Having read the email thread, I feel I should add here that F14's irqbalance was quite broken. It left irqs from ethernet devices alone because it couldn't determine the device they belonged to (i.e. couldn't parse the MSI mangled names in /proc/interrupts properly).  As such irqbalance did no balancing on them, which is why they were all left to whatever the system defaulted them too.  F15 forward, with the new irqbalance daemon and later kernels, use sysfs to gather this info, and so can consistently parse irq information, and balancing happens properly.  Thats why I say that you should do one of the three things above, as it will explicitly produce behavior that is compatible with what you experienced in F14 by accident.

Comment 4 Andrew J. Schorr 2012-07-02 18:18:39 UTC

Hi Neil,

Thanks for the feedback. I am aware that the old irqbalance was broken. I'm not an expert on this stuff, but I was guessing that disabling irqbalance entirely was not a good plan. If in fact irqbalance is only hurting the system, then why run it at all? I have been assuming that irqbalance does good things in most
cases...

I have no particular desire to send everything to cpu 0. It's just that I have observed many packet drops when the network interrupts are distributed across cpus. Thus, I was planning to implement option #2 above. However, this is rather ugly. As best I can tell, I need to configure a startup service that will wait until the network drivers have been loaded, parse /proc/interrupts to determine which interrupts are being used for network interfaces, then kill and restart irqbalance with --banirq for all the network interrupts, and then set smp_affinity for those network irqs. It does not seem reasonable to ask the average user to do this.

I already tried patching the igb driver to add an AFFINITY option to control
setting the affinity_hint, but the Intel folks informed me that they will not
accept such a feature. Their goal is to eliminate all driver options. So I concluded that the fix needs to be on the irqbalance side.

As for powerthresh, I did read the man page, but I'm afraid I don't understand
the explanation given there. I can try setting power_thresh to 1 to see if that
helps, but I would never have guessed this from the description in the man page.

Thanks,
Andy

Comment 5 Neil Horman 2012-07-02 19:18:31 UTC

You are correct, typically disabling irqbalance is not a good plan, nor is it in your case. you probably want (as you noted) to implement option 2 or 3.

For the record, irqbalance isn't only hurting the system, its hurting your system. typically cpus that don't implement deep c-states, or don't activate them, don't suffer from this sort of issue. You are an outlier case, although that may be changing.

If you don't want to send everything to cpu0, then I think the first step here is to decide exactly what it is you need to do. From reading your emails it sounds like you've got irqbalance distributing irqs to as many cpus as possible to balance irq load as much as possible (which is designed behavior), but because you have some cpus that are going into deep c-states, the latency that occurs in waking up when they have traffic to process is causing drops when the traffic volume spikes suddenly. If thats the case, then what you want is to either:

a) Avoid moving irqs to cpus that you expect to be in deep c-states often.

b) Move irqs assigned to cpus who's lack of load make them susceptible to going into a deep c-state.

(a) requires that you have foreknowledge of which cpus are going to be less loaded. This is usually easy to determine in very finely tuned systems where affinity is often set of processes and teh system environment is very stable. Its usually accomplished by just disabling irqbalance and setting a manual irq affinity for each device irq

(b) is better suited to a more general purpose system where traffic and applications are a bit more fluid. This sounds to me more like your situation. The easiest way to do this is to set powerthresh to 1. As the documentation indicates, the powerthresh option specifies the threshold in units of cpus, beyond which we attempt to prevent irqbalancing to a single cpu. Irqbalance, for the purposes of trying to evenly distribute irq load across the system, will calculate the weighted average load per irq, and use that value to approximate the load added to a cpu by assigning an irq there. Powerthresh specifies number of cpus whos load falls more than a single standard deviation below that computed average load. If more than that threshold of cpus has a low enough load, irqbalance will remove one of those cpus from its list of available cpus, and rebalance irqs among the remaining available cpus. So by setting powerthresh to 1, everytime a cpu has its load fall more than a standard deviation below the average cpu load, it will become ineligible to handle irqs, so as to prevent it from waking up needlessly.

That really sounds like the approach you want, as it will allow you to naturally constrict your irq set to the number of cpus that will stay awake for the offered load at a given time. Note that if the load increases, the cpus that were formerly marked ineligible will be re-activated during the next rebalance cycle.

You can of course, also use banirq as noted. Its not anywhere near as hard as you make it out, although, it isn't as easy as it could be either. Typically in successive reboots a given device will allocate the same irq vectors, so typically you can just write a static --banirq set and it will work, and a script to parse /sys/class/net/<netdev>/device/msi_irqs will let you quickly assign static affinity to the needed interrupts.

But, you are correct, --banirq isn't as useful as it could be. I've been thinking about adding a --bandev option to irqbalance, that would let you specify a sysfs path which could then automatically parse the irqs for that device and avoid balancing them without you having to specify each irq individually. Please open an issue here:
https://code.google.com/p/irqbalance
if you would like to see that and I can prioritize it.

Note, if you do happen to know which cpus will be powering down typically, you can also investigate the IRQBALANCE_BANNED_CPUS environment variable. But I really think powerthresh is what you want to look at first.

Comment 6 Andrew J. Schorr 2012-07-02 20:06:31 UTC

Thank you, this is helpful.  I will try --powerthresh=1.  I think that --bandev could be helpful, but I'm wondering if an even better solution might be to add an argument --affinity-helper=<program>.  I think the problem I'm encountering is that the device driver does not set the affinity_hint values as needed to avoid performance problems.  So if irqbalance could provide a way to set this from user-space, I think that would also solve the problem.  In other words, I would like to provide a helper script that gets a chance to provide (or override) the affinity_hint values.  Does this make any sense?  If I could set /proc/irq/*/affinity_hint from user space, I would just do that.  But it's not possible.  Plus, it is somewhat thorny to construct the needed bitmask, and it would be better to have a dynamic mechanism that works as drivers are loaded and irqs are configured.

Does this make any sense?

Comment 7 Neil Horman 2012-07-02 20:23:32 UTC

well, the affintiy_hint files for devices in /proc are read only, so it would have to be an override sort of mechanism, in which some user space tool just wrote into smp_affinity despite what affinity_hint says.  Regardless however, I'm hesitant to export functionality like that from irqbalance, mostly because I have difficultly seeing the benefit.  Not that I don't want to give users control over this sort of thing, mind you, but I think you already have it, or almost have it.  If you have a specific need to balance a subset of irqs according to a site specific set of constraints, then the solution is to use --banirqs (or possibly --bandev to exclude an irq set), from irqbalance, and then just run another script to manage those banned irqs independently.  Theres really no reason for a custom script and irqbalance to interact, and since smp_affinity is just a proc file, marshaling data to write into that file is pretty straightforward work in a script.

Comment 8 Andrew J. Schorr 2012-07-02 20:38:37 UTC

I see your point.  I understand that it can be accomplished using --banirq or --bandev.  But those methods still require a fair bit of additional infrastructure from the user.  I will need to run a task to configure the smp_affinity values for the banned irqs, and I will need to make sure that happens after any relevant driver is loaded.  If instead of banning, irqbalance could call out to a user-space program to learn the desired affinity, then the user would not have to build any infrastructure to make sure that he configures the smp_affinity at the appropriate time.  It would be a pretty trivial patch to the end of classify.c:add_one_irq_to_db.  After the assign_affinity_hint section,
one could simply popen a user-space program with the irq number as an argument (and possibly the device name if that is available), and if it returned a non-empty string, use cpulist_parse to set new->affinity_hint.  Because this provides the hook, it's much easier for the user to control the policy.  If I
were to put together such a patch, is there any chance it would be accepted?  It seems like a much cleaner solution to me with fewer moving parts.

Thanks,
Andy

Comment 9 Andrew J. Schorr 2012-07-03 02:02:47 UTC

Created attachment 595856 [details]
Patch irqbalance to add --affinity-helper option to allow a user-space program to supply affinity hints

This is a sample patch to allow a user-space program to provide an affinity_hint.
Here is an example of a helper script I have used to pin igb and sfc interrupts
to cpu 0:

sh-4.2$ cat /var/tmp/affinity.sh 
#!/bin/sh

irq="$1"
devpath="$2"

uef="$devpath/uevent"
[ -s "$uef" ] && egrep -q "^DRIVER=(sfc|igb)$" $uef && echo 0

And irqbalance is invoked as follows:

    irqbalance --affinity-helper=/var/tmp/affinity.sh

Is there any chance of incorporating this feature into irqbalance?  It seems
to solve my problem in a fairly elegant fashion.

Regards,
Andy

Comment 10 Andrew J. Schorr 2012-07-03 02:23:19 UTC

Note: I see now that the helper script could simply test for the presence of a $devpath/net directory instead of checking the driver inside the uevent file.   So
to send all network irqs to CPU 0:

[ -d "$devpath/net" ] && echo 0

Or, to limit it to certain drivers:

[ -d "$devpath/net" -a -s "$uef" ] && egrep -q "^DRIVER=(sfc|igb)$" $uef && echo 0

Regards,
Andy

Comment 11 Neil Horman 2012-07-03 14:16:51 UTC

I never said it was hard to add the code you're asking for to irqbalance. As you've noted, its easy, The hard part isn't what you're desribing, its what goes in the script in the first place. Your example is the trivial case. What if you wanted to have the script assign all irqs assigned to em1 to a unique core on numa node 1, rotating through all of those rather than spilling to other nodes (thats not a made up example, mind you, its something that irqbalance actually does). Determining that is far less trivial.

Add to that the fact that your script still has the same number of "moving parts" as you term it. While scripting independently requires you to ban irqs from irqbalance to avoid irqbalance and you script from fighting over what an irqs affinity should be, you still need to set the --hintpolicy option of irqbalance when using your approach above. To set it to anything other than "exact" would produce unexpected results, and once you set an exact policy, any irq that has a hint will have it applied, not just the ones that you're script picks up on. So significant co-ordination is still required using your method.

Also, note that your example is incorrect. to assign an irq to cpu0 you need to "echo 1" rather than "echo 0", as the zero-th bit of the mask represents cpu0. echo 0 implies an empty bitmask, and would tell irqbalance to not use any affinity hint for that irq. Its not relevant to this conversation of course, but since you're interested in scripting stuff like this, you should know that.

Regardless however, my point remains. There is co-ordination required between whatever custom script you have an irqbalance, weather its run independently or forked from irqbalance. And that co-ordination is trivial compared to what you can reasonably be expected to need to do when determining what your affinity mask should be.

That said, I do agree that it would be nice if we could have something that provided a mechanism to allow a user to manually do affinity assignment for a irq that was banned. What if added an option to specify a callout script for any interrupt excluded from --banirq or --bandev? That would create a canonical place to indicate which irq's were handled by irqbalance, and which were handled externally to that. It lets a user take advantage of irqbalances new irq detection, and the script can jsut echo whatever affintiy it wants into /proc/irq/<interrupt>/smp_affinity directly, without having to worry about irqbalances behavior (because we've explicitly identified the irq as banned by irqbalance). Does that sound reasonable?

Comment 12 Andrew J. Schorr 2012-07-03 14:39:50 UTC

Hi Neil,

I agree that my usage case is quite simple. If somebody has a more complicated need, they are free to build a more sophisticated script to set the affinity. In my case, I need to pin the network interrupts to a single CPU. I think this will
be a common need that will affect other users. I see dropped packets with both igb and sfc drivers when interrupts are distributed across multiple cpus.

I think you and I mean different things by "moving parts". In order for me to
solve my problem (which I think may be a typical problem for other users as
they migrate to the new version of irqbalance), I simply needed to edit the /etc/sysconfig/irqbalance file to supply a command-line option to irqbalance
with a shell script to set the affinity appropriately. Thus, I needed to edit
only 2 files (after having patched the daemon), and the problem was solved.

Without the --affinity-helper option to irqbalance, I would have had to create and install and activate a systemd or LSB init script that runs after
the network device drivers have been loaded. This script will need to scan /proc/interrupts, identify network drivers, and then patch the /etc/sysconfig/irqbalance configuration file to add --bandev or --banirq options for each relevant interrupt. It will then have to call systemctl to restart irqbalance.
Then, it must set the smp_affinity values for each irq. While this is certainly feasible, I personally find it rather complicated to install new init scripts or systemd units into the system, and the script to do this is
far more complicated than in my approach.

Also, please note that my example is correct. I "echo 0" because the patch
to irqbalance is using cpulist_parse (not cpumask_parse_user) to read in
the cpu mask. I personally find the list format much friendlier and less error-prone. I imagine that is why smp_affinity_list has been added to more recent kernels.

Finally, if you add a --bandev feature with a callout capatibility, that
should be good enough. It is less flexible than my approach, and a bit
more painful (since it requires identification of all the potential network device names prior to starting irqbalance), but it should be workable. I happen
to think my patch is simple and elegant and solves a general problem -- how to
provide or override affinity_hint values for drivers that do not do this properly. But --bandev with a callout feature should be good enough, if a
bit uglier and harder to maintain from my perspective. If --bandev took
a regular expression as an argument, that would be even better.

Thanks,
Andy

Comment 13 Andrew J. Schorr 2012-07-03 14:41:23 UTC

I should also note that I am continuing to see dropped packets on a system
where irqbalance is running with --powerthresh=1.

Regards,
Andy

Comment 14 Neil Horman 2012-07-03 16:39:34 UTC

>I agree that my usage case is quite simple.  If somebody has a more complicated >need, they are free to build a more sophisticated script to set the affinity
Thats exactly my point.  You're satisfied with the  interface you've come up with because what you need to do is trivial.  It does absolutely _nothing_ for anyone else that wants to do anything more complex.  I'm not going to add an interface that is useful to only a small subset of people.  While it probably does solve the needs of people who need to do something exceedingly simple, it doesn't do much of anything for people who even need a marginally more complex solution.

>Without the --affinity-helper option to irqbalance, I would have had to create >and install and activate a systemd or LSB init script that runs after
>the network device drivers have been loaded.
You're making this far more difficult than it needs to be.  A one line udev rule that runs an alternate affinity script on a module load event would be sufficient to do what you want.  You don't need to do any sort of irqbalance sysfs patching, because you should know what devices/interrupts you want to manage independently. 

>Also, please note that my example is correct.  I "echo 0" because the patch
>to irqbalance is using cpulist_parse (not cpumask_parse_user) to read in
>the cpu mask.  I personally find the list format much friendlier and less error->prone.
My fault, I missed the use of cpulist_parse.


regardless, I still don't like this approach, as its behavior is sensitive to other options ni irqbalance, which will lead to confusion.  I'm going to implement a callout for excluded irqs that allows you to set affinity manually.

As for the powerthresh issue, it should be helping.  Can you please run irqbalance manually with the following options:

irqbalance --powerthresh=1 --debug

and attach the output?  Thank you.

Comment 15 Andrew J. Schorr 2012-07-03 17:14:42 UTC

Hi Neil,

I'm afraid I don't understand why my framework cannot be used in a more complicated situation.  In what way is the --affinity-helper approach lacking?
That script can do anything it wants to do.  It can be simple or complicated
as necessary.  It is particularly nice that a very simple script solves a standard problem that many users may encounter.

I don't know what "irqbalance sysfs patching" refers to, but I don't intend to
do that.  And I don't think most people want to be bothered with hacking on udev rules.  I am managing 50 linux systems, and my simple 1-line affinity_helper script fixes the problem on all of them in one feel swoop.  I don't want to have to build customized solutions for each system based on the names and types of network devices on those boxes.  What can a udev-triggered script do that my affinity_helper script couldn't do?

You are of course the maintainer, so you may do as you please.  I think your
approach will be more unwieldy for most users, but so be it.

I will attempt to gather the --powerthresh=1 debug output on Thursday,
but I'm not sure I can get to it.  I have other serious problems with
which to contend.

Regards,
Andy

P.S. The default subset hintpolicy works fine for me.  It is not hard to change that value if the default is not ideal.  This is much easier than adding udev rules.

Comment 16 Neil Horman 2012-07-03 19:53:18 UTC

sorry, I meant to say sysconfig, rather than sysfs,

regardless, We're apparently just talking past one another at this point. All I'm trying to say is that using your approach, things become more complex, and do so more quickly than they need to. Yes, you get a one line fix for your 50 systems, but anybody with a simmilar, yet slightly more complex problem, will have a hard time working with your solution. What if someone wants to set a specific affinity for a set of irqs but let irqbalance manage the others? That would imply needing to set hintpolicy=exact. But if they have another driver that sets an affinity hint poilcy from the driver, then the behavior of that interrupt changes, potentially against the administrators wishes.

As for the use of udev, the only reason I mentioned it was because you asserted that to manage banned interrupts you needed to create a systemd unit file, etc to run the script appropriately, which is certainly not the case. You can do any number of things to run an additional script, some of which are easier than other (and we don't need to debate the nuances of which is easier than which). You are correct however, no one really wants to do any of them, which is why this is normally not a problem. For the small subset of people that do want some finer grained control....well, thats what we're talking about here :)

the bottom line is, I really just don't like the idea of overriding affinity_hint in irqbalance. I feel like its going to lead to confusion, as irqs will be restricted to running on a subset of cores without any external visibility as to why. Its confusing for outside observers, and even for administrators, when the script doing the override is complex. I much prefer an irqbalance that either manages interrupts itself, or leaves it to another program to do so.

That said, what about a this for an idea? A should_ban callout. For each irq that gets probed, we can execute an optionally provided script, whos exit code informs irqbalance weather or not it should manage the irq, or leave it alone. A solution like this would be more flexible than --bandev, as you could do regex matching on the passed in device string and or irq vector in any way you choose, and it would give you a chance to either optionally set the affinity staticaly for each irq, or hand the vector off to a custom program of your choosing that can manage that irq according to whatever policy you choose. This also provides a single point of delineation to determine if irqbalance is actually managing that irq or not. You're specific situation would be covered in much the same way in that you could write a script, which matches on your relevant net driver, did a "echo 1> /proc/irq/<irq>/smp_affinity" and returned 1 to let irqbalance know that that interrupt should be left alone. That sounds like a good all around solution to me. What do you think?

Comment 17 Andrew J. Schorr 2012-07-03 20:04:31 UTC

That sounds great.  I hope the arguments will include the devpath that I am using in my current script.

Thanks,
Andy

Comment 18 Neil Horman 2012-07-03 20:09:12 UTC

yes, I figure the arguments can include the full path to the device in sysfs and the irq value in question.

Comment 19 Andrew J. Schorr 2012-07-03 22:14:46 UTC

That sounds good.  I do have one followup question about irqbalance's algorithm.
Consider two approaches: A. some IRQs are pegged to certain CPUs by the
affinity_hint setting; and B. those IRQs are banned using --banirq or some
other mechanism.  Is there any difference in how irqbalance will distribute
the remaining IRQs between these 2 cases?  In other words, does knowledge
of the affinity_hint for the non-managed IRQs impact its decisions about how
to allocate other interrupts?

Thanks,
Andy

Comment 20 Andrew J. Schorr 2012-07-05 15:38:35 UTC

Created attachment 596438 [details]
log file showing output from "irqbalance --powerthresh=1 --debug"

As requested, this attachment contains a log file of output
from "irqbalance --powerthresh=1 --debug".  During this time period,
several thousand network packets were dropped.  A similar system
with the interrupts pinned to CPU 0 did not drop any packets.

Regards,
Andy

Comment 21 Neil Horman 2012-07-05 19:38:10 UTC

http://code.google.com/p/irqbalance/issues/detail?id=33&can=1


i've added the banscript option upstream if you would like to backport it Petr.

Andrew, looking at your output above, it seems that irqbalance is deciding that none of your cpus ever have a load that drops sufficiently to be considered for powersaving mode. Is it possible for a cpu to enter and leave a deep c-state very quickly (i.e. so quickly that the periodic poll from irqbalance misses it)?

Comment 22 Andrew J. Schorr 2012-07-05 19:47:50 UTC

Hi Neil,

Yes, I think so.  Here is some output from turbostat on that system:

 cr CPU    %c0  GHz  TSC    %c1    %c3    %c6  %pc3  %pc6
           5.96 2.74 3.46  26.19  39.01  28.83  1.07  0.00
   0   0  60.75 2.72 3.46  33.74   5.50   0.01  1.07  0.00
   0   6   0.03 2.36 3.46  94.45   5.50   0.01  1.07  0.00
   1   1   2.39 2.79 3.46   4.41  86.07   7.13  1.07  0.00
   1   7   0.06 2.76 3.46   6.74  86.07   7.13  1.07  0.00
   2   2   0.71 2.97 3.46   3.38  68.99  26.93  1.07  0.00
   2   8   0.08 1.85 3.46   4.01  68.99  26.93  1.07  0.00
   8   3   2.77 2.89 3.46  34.38  26.59  36.25  1.07  0.00
   8   9   0.11 1.81 3.46  37.05  26.59  36.25  1.07  0.00
   9   4   3.56 2.98 3.46  42.40  39.12  14.93  1.07  0.00
   9  10   0.05 1.88 3.46  45.91  39.12  14.93  1.07  0.00
  10   5   0.83 2.48 3.46   3.61   7.81  87.76  1.07  0.00
  10  11   0.23 1.94 3.46   4.21   7.81  87.76  1.07  0.00

By sending all the network interrupts to CPU0, I can avoid most of the C6 states
on that CPU.

If I don't pin the network interrupts to CPU 0, the only other option is to
mess around with the cpuidle C-state configuration to eliminate C6 states
on all cpus.  That has bad implications for power usage, so it's not a great
solution.  According to cpupower, the C6 states have latency of 200 microseconds, and C3 is 20 microseconds.  I'm not sure if C3 causes packet
drops, but C6 definitely seems to be a problem.  Kernel 3.4 and above
has a disable flag under /sys/devices/system/cpu/cpu*/cpuidle/state*
that allows one to disable that state.  Unfortunately, it is currently
a global impacting all cores, but a patch is circulating to make this
operate on a per-core basis.  Still, the winning solution seems to be 
to concentrate the interrupts on a single CPU that will be active enough
to avoid these stalls.

Thanks,
Andy

Comment 23 Andrew J. Schorr 2012-07-05 19:50:16 UTC

FYI, here is the thread to patch the kernel to disable certain C states on a per-cpu basis.

https://lkml.org/lkml/2012/6/24/25

Regards,
Andy

Comment 24 Andrew J. Schorr 2012-07-05 19:51:34 UTC

Neil, when you get a chance, do you happen to know the answer to my question in Comment #19 above?  I want to make sure I understand any performance impact of your patch vs. mine.

Thanks,
Andy

Comment 25 Petr Holasek 2012-07-05 20:09:20 UTC

(In reply to comment #21)
> http://code.google.com/p/irqbalance/issues/detail?id=33&can=1
> 
> 
> i've added the banscript option upstream if you would like to backport it
> Petr.
> 

Hi Neil,

thanks, I'll backport it.

Petr H

Comment 26 Neil Horman 2012-07-05 20:33:02 UTC

Thank you Petr.

Andrew, in answer to your question,both approaches should result in the same behavior.  Irqbalance only takes affinity_hinting into consideration for interrupts that it actually manages.  That said, irqbalance does balance irqs based on load, as read from /proc/stat.  For each cpu, it determines the load applied to that cpus resulting from work done in irq and softirq context.  So setting an affinity hint of cpu0 is exactly the same to irqbalance as setting a static affinity of cpu0 on an unmanaged irq, because the irq will produce the same load on that cpu, and irqbalance can't tell the difference.

Comment 27 Fedora Update System 2012-08-23 14:12:31 UTC

irqbalance-1.0.3-6.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/irqbalance-1.0.3-6.fc18

Comment 28 Fedora Update System 2012-08-23 15:38:01 UTC

Package irqbalance-1.0.3-6.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing irqbalance-1.0.3-6.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-12570/irqbalance-1.0.3-6.fc18
then log in and leave karma (feedback).

Comment 29 Fedora Update System 2012-08-29 15:06:54 UTC

irqbalance-1.0.3-7.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/irqbalance-1.0.3-7.fc18

Comment 30 Fedora Update System 2012-08-29 15:19:02 UTC

irqbalance-1.0.3-6.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/irqbalance-1.0.3-6.fc17

Comment 31 Fedora Update System 2012-09-03 11:26:42 UTC

irqbalance-1.0.3-8.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/irqbalance-1.0.3-8.fc18

Comment 32 Fedora Update System 2012-09-17 23:54:42 UTC

irqbalance-1.0.3-8.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 33 Fedora Update System 2012-09-19 02:59:43 UTC

irqbalance-1.0.3-6.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.