Bug 1590937
Summary: | Failed to set smp_affinity for IRQ | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Luiz Capitulino <lcapitulino> | |
Component: | tuned | Assignee: | Ondřej Lysoněk <olysonek> | |
Status: | CLOSED ERRATA | QA Contact: | Tereza Cerna <tcerna> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.5 | CC: | jeder, jskarvad, lcapitulino, olysonek, pezhang, siliu, tcerna | |
Target Milestone: | rc | Keywords: | Regression | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | tuned-2.10.0-1.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1609148 (view as bug list) | Environment: | ||
Last Closed: | 2018-10-30 10:50:19 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1240765, 1394932, 1609148 |
Description
Luiz Capitulino
2018-06-13 16:51:11 UTC
I haven't investigated this yet, but I think it's possible that those errors always happened and we're seeing them now just because now tuned prints errors to stderr. I have some important findings: o It is defirqaffinity.py that is failing. The reason for the failure seems to be that the IRQ cpumask is fixed to one or more CPUs and hence can't be changed o Since defirqaffinity.py is used by cpu-partitioning and the real-time profiles, I can also reproduce the issue with the host real-time profile o This issue seems to be particularly serious for the host real-time profile, since the profile script will stop short on error, leaving a lot of tunings unapplied. For this reason I'm setting priority to urgent o One thing I still don't understand is that some of the IRQ masks that fail to be changed differ between profile activation and deactivation. I'd expect them to be same. I think the idea below for "remove" and "add" operations in defirqaffinity.py might fix all issues. Jaroslav or Ondrej, could one of you work on this and send me a test script? remove ------- Only remove an isolated CPU from an IRQ cpumask if that CPU bit actually appears in the IRQ cpumask. Otherwise, if the IRQ cpumask just contains housekeeping CPUs for example, skip it The algorithm would something like: For each IRQ: for each isolated CPU if isolated CPU in IRQ cpumask save original IRQ cpumask to a file remove this CPU from the IRQ cpumask, exit with an error if this fails add --- Go over the list saved by "remove" and restore the IRQs according to that list. Emit an warning on failure I'll look into it. (In reply to Luiz Capitulino from comment #3) > o This issue seems to be particularly serious for the host real-time > profile, since the profile script will stop short on error, leaving a lot of > tunings unapplied. For this reason I'm setting priority to urgent To fix/workaround this in Tuned 2.10, we probably can drop the call to "die" here: https://github.com/redhat-performance/tuned/blob/91c89db361e75d29fbee529906828723bb2a28be/profiles/realtime-virtual-host/script.sh#L48 > o One thing I still don't understand is that some of the IRQ masks that fail > to be changed differ between profile activation and deactivation. I'd expect > them to be same. The explanation might be that you don't get an IOError when you try to set an affinity which is already set. Or that you can set the affinity to certain values, but not others. > I think the idea below for "remove" and "add" operations in > defirqaffinity.py might fix all issues. Jaroslav or Ondrej, could one of you > work on this and send me a test script? > > remove > ------- > > Only remove an isolated CPU from an IRQ cpumask if that CPU bit actually > appears in the IRQ cpumask. Otherwise, if the IRQ cpumask just contains > housekeeping CPUs for example, skip it > > The algorithm would something like: > > For each IRQ: > for each isolated CPU > if isolated CPU in IRQ cpumask > save original IRQ cpumask to a file > remove this CPU from the IRQ cpumask, exit with an error if this fails > > add > --- > > Go over the list saved by "remove" and restore the IRQs according to that > list. Emit an warning on failure I think this might work. Given that what you're asking for is basically a new feature (the defirqaffinity.py script doesn't implement 'save and restore'), and given that the scheduler plugin already implements changing the affinity of IRQs *with* 'save and restore' (as part of the isolated_cores option of the scheduler plugin), I propose we implement the following feature instead: we add a new option "irqs_affinity" to the scheduler plugin which will change the affinity of IRQs to the given value. We then use this option in tuned.conf of the relevant profiles and drop the call to defirqaffinity. I think this solution will be simpler to implement, because the necessary infrastructure is already present in the scheduler plugin. I also think it's a better solution. Luiz, Jardo, is the above proposal OK with you? Jardo, do you think we can include the feature in tuned 2.10? If we can't include it in 2.10, I think it should be fine: I *think* the error is not new, and I already proposed that we don't print messages with log level ERROR to stderr just yet, so the messages will get hidden in the log again :) : https://github.com/redhat-performance/tuned/pull/108 (In reply to Ondřej Lysoněk from comment #5) > (In reply to Luiz Capitulino from comment #3) > > o This issue seems to be particularly serious for the host real-time > > profile, since the profile script will stop short on error, leaving a lot of > > tunings unapplied. For this reason I'm setting priority to urgent > > To fix/workaround this in Tuned 2.10, we probably can drop the call to "die" > here: > https://github.com/redhat-performance/tuned/blob/ > 91c89db361e75d29fbee529906828723bb2a28be/profiles/realtime-virtual-host/ > script.sh#L48 Yes, that's a workaround. But I think we should do this as a last resort, since this prints lots of errors and will certainly confuse users. Not to mention real errors will be hidden in the mess. > > o One thing I still don't understand is that some of the IRQ masks that fail > > to be changed differ between profile activation and deactivation. I'd expect > > them to be same. > > The explanation might be that you don't get an IOError when you try to set > an affinity which is already set. Or that you can set the affinity to > certain values, but not others. It's the last one. The script does get an error. The kernel returns -EIO if the cpumask can't be changed for an IRQ. > Given that what you're asking for is basically a new feature (the > defirqaffinity.py script doesn't implement 'save and restore'), and given > that the scheduler plugin already implements changing the affinity of IRQs > *with* 'save and restore' (as part of the isolated_cores option of the > scheduler plugin), I propose we implement the following feature instead: we > add a new option "irqs_affinity" to the scheduler plugin which will change > the affinity of IRQs to the given value. We then use this option in > tuned.conf of the relevant profiles and drop the call to defirqaffinity. I > think this solution will be simpler to implement, because the necessary > infrastructure is already present in the scheduler plugin. I also think it's > a better solution. > > Luiz, Jardo, is the above proposal OK with you? Jardo, do you think we can > include the feature in tuned 2.10? If we can't include it in 2.10, I think > it should be fine: I *think* the error is not new, and I already proposed > that we don't print messages with log level ERROR to stderr just yet, so the > messages will get hidden in the log again :) : > https://github.com/redhat-performance/tuned/pull/108 This all sounds good, but there are a few important points. First and most important, I do think we should get this fixed in 2.10 since the workaround is pretty bad (it works, but it's bad). Second, it is possible that the algorithms I suggested won't work as expected, in this case we should try the easier and quicker way of implementing it first, which I guess would be defirqaffinity.py? Then, if it works we could use this for 2.10 and think about deprecating it in 2.11. Resetting needinfo. Since this affects the real-time host profile too. Sorry for delay, I think it's worth to fix it for 2.10. This bug is fixed in tuned-2.10.0-1.el7. Guys, thank you so much for working on this for this release! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3172 |