Bug 2227536
| Summary: | irqbalance: Cannot change IRQ 76 affinity: Input/output error | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Frank Liang <xiliang> | ||||
| Component: | irqbalance | Assignee: | ltao | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Jiri Dluhos <jdluhos> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 9.3 | CC: | andavis, jeder, jshortt, libhe, linl, lixc17, pkenyon, qzhang, siddharth.yelandur, vkuznets, ymao | ||||
| Target Milestone: | rc | Keywords: | Regression | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-07-31 05:54:04 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hi Frank,
Thanks for reporting the issue. However I think it is expected due to upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered permanent"). In short words, if a IRQ number failed to set affinity due to !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be kicked out of schedule list in case of re-set affinity next time, and at the same time throw out a error message to report the failing details.
Previous version of irqbalance will just fail with no error report, so users don't know if a irq number is affinity set successfully or not, this bug is addressed in bz2184735, and the patches I mentioned previously is meant to fix this bz.
Thanks,
Tao Liu
(In reply to ltao from comment #1) > Hi Frank, > > Thanks for reporting the issue. However I think it is expected due to > upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and > eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered > permanent"). In short words, if a IRQ number failed to set affinity due to > !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be > kicked out of schedule list in case of re-set affinity next time, and at the > same time throw out a error message to report the failing details. > > Previous version of irqbalance will just fail with no error report, so users > don't know if a irq number is affinity set successfully or not, this bug is > addressed in bz2184735, and the patches I mentioned previously is meant to > fix this bz. > > Thanks, > Tao Liu Thanks, ltao I think it is good to throw out fail messages accordingly. Now just feel that there are too many lines(128 similar lines in attachment) in journal logs. Is it possible to suppress them or save details in other place? Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 99 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 99 affinity is now unmanaged (In reply to Frank Liang from comment #2) > (In reply to ltao from comment #1) > > Hi Frank, > > > > Thanks for reporting the issue. However I think it is expected due to > > upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and > > eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered > > permanent"). In short words, if a IRQ number failed to set affinity due to > > !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be > > kicked out of schedule list in case of re-set affinity next time, and at the > > same time throw out a error message to report the failing details. > > > > Previous version of irqbalance will just fail with no error report, so users > > don't know if a irq number is affinity set successfully or not, this bug is > > addressed in bz2184735, and the patches I mentioned previously is meant to > > fix this bz. > > > > Thanks, > > Tao Liu > > Thanks, ltao > > I think it is good to throw out fail messages accordingly. > Now just feel that there are too many lines(128 similar lines in attachment) > in journal logs. Is it possible to suppress them or save details in other > place? > > Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal > /usr/sbin/irqbalance[1041]: Cannot change IRQ 99 affinity: Input/output error > Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal > /usr/sbin/irqbalance[1041]: IRQ 99 affinity is now unmanaged Hi Frank, I see from your log, it seems to me you are doing some testing. The test script recongnized the error log throwed out by irqbalance. And it contains 4 parts: 1) Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal kdumpctl[1446]: kdump: kexec: loaded kdump kernel Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal kdumpctl[1446]: kdump: Starting kdump: [OK] Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Finished Crash recovery kernel arming. Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Startup finished in 1.409s (kernel) + 2.546s (initrd) + 12.764s ( userspace) = 16.720s. Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 108 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 108 affinity is now unmanaged ...snip... This part is the real log outputted by irqbalance, 64 items in total. 2) 2023-07-30 09:36:09,607:INFO:error found! 2023-07-30 09:36:09,608:INFO:Checking:Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 76 affinity: Input/output error ...snip... This part is the test script finishes testing, and it detected the error log so it output all the error info it found. 3) 2023-07-30 09:36:15,239:INFO:Below items are unknow! ['Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 76 affinity: Input/output error', 'Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: ...snip... This part is the same error log as 2, but report in array format. 4) AssertionError: New error in journalctl -b 0|grep -Ev "os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error"| grep -Ev test_check|grep -Ev "UpdateGSErrors" log This part is also the same error log as 2. I suppose the system you are testing is a large system which contains hundreds of IRQs, among which, 64 IRQs failed to set affinity, it is acceptable to me. And also irqbalance only reported them once, it is also expected. I think suppressing the error logs is better to be handled in the test script side, it doesn't need to repeat outputting the error logs 3 times. Thanks, Tao Liu Ok, pls feel free to close it as not a bug. Thanks OK, I will close the bz as not a bug. Hi Tao Liu, I understand that the fix was added to let users know if an irq number is affinity set successfully or not, but in 9.4 I am seeing that these are logged way too many times into /var/log/messages. Is there any way of suppressing this as asked before. Thanks, Siddharth Hi Tao Liu, Also, as I understand these messages are for information only, please let me know if otherwise. With these messages too I saw the affinity was set for all the IRQ's and no issue was observed. Thanks, Siddharth (In reply to Siddharth.Yelandur from comment #6) > Hi Tao Liu, > > I understand that the fix was added to let users know if an irq number is > affinity set successfully or not, but in 9.4 I am seeing that these are > logged way too many times into /var/log/messages. There shouldn't be "too many times" for "Cannot change IRQ X affinity: Input/output error", it will only show once for each IRQ at the start of irqbalance service. > Is there any way of suppressing this as asked before. Why need we suppress the info? 1st, the info are saved in systemd journal, ordinary users would not inspect it regularly. 2nd, If you do want to inspect the log file, it is very easy to get the info filtered out by cmd like "grep, sed, awk", is this a big problem for logs? Frankly I didn't see a strong motivations for the change, if you do, please let me know. Thanks, Tao Liu > > Thanks, > Siddharth (In reply to Siddharth.Yelandur from comment #7) > Hi Tao Liu, > > Also, as I understand these messages are for information only, please let me > know if otherwise. With these messages too I saw the affinity was set for > all the IRQ's and no issue was observed. For the EIO errors, it means the users cannot set customed affinity to the IRQ, so it will not be rescheduled by irqbalance next time, thus these IRQs' affinity will be left as it is, which is unlikely satisfy your affinity rules. For EAGAIN EBUSY EINVAL ERANGE ENOMEM ENOSPC errors, they can be rescheduled next time, to it can work as you expected. > > Thanks, > Siddharth Just paste a link here for referencing purpose[1], in case other people have a similar request to suppress the logs, this link provide an approach to suppress it from systemd side. [1]: https://github.com/Irqbalance/irqbalance/pull/332#issuecomment-2457641689 (In reply to ltao from comment #9) > (In reply to Siddharth.Yelandur from comment #7) > > Hi Tao Liu, > > > > Also, as I understand these messages are for information only, please let me > > know if otherwise. With these messages too I saw the affinity was set for > > all the IRQ's and no issue was observed. > > For the EIO errors, it means the users cannot set customed affinity to the > IRQ, so it will not be rescheduled by irqbalance next time, thus these IRQs' > affinity will be left as it is, which is unlikely satisfy your affinity > rules. For EAGAIN EBUSY EINVAL ERANGE ENOMEM ENOSPC errors, they can be > rescheduled next time, to it can work as you expected. > > > > > Thanks, > > Siddharth Hi, Tao Liu Thanks your information, I encountered the same symptom lately. But one more question about in which situations the driver returns -EIO to the irqbalance service. Does it require the driver vendor to investigate this? After checking the kernel code, the -EIO return value indicates that the irq_data->common.state_use_accessors has been set to IRQD_NO_BALANCING or is not set to IRQD_AFFINITY_MANAGED (IRQD_AFFINITY_MANAGED means affinity is auto-managed by the kernel). Consequently, the irq cannot be set from the user space. Hoping your reply, thanks! (In reply to xiaochun lee from comment #11) > (In reply to ltao from comment #9) > > (In reply to Siddharth.Yelandur from comment #7) > > > Hi Tao Liu, > > > > > > Also, as I understand these messages are for information only, please let me > > > know if otherwise. With these messages too I saw the affinity was set for > > > all the IRQ's and no issue was observed. > > > > For the EIO errors, it means the users cannot set customed affinity to the > > IRQ, so it will not be rescheduled by irqbalance next time, thus these IRQs' > > affinity will be left as it is, which is unlikely satisfy your affinity > > rules. For EAGAIN EBUSY EINVAL ERANGE ENOMEM ENOSPC errors, they can be > > rescheduled next time, to it can work as you expected. > > > > > > > > Thanks, > > > Siddharth > > Hi, Tao Liu > Thanks your information, I encountered the same symptom lately. But one more > question about in which situations the driver returns -EIO to the irqbalance > service. Does it require the driver vendor to investigate this? After > checking the kernel code, the -EIO return value indicates that the > irq_data->common.state_use_accessors has been set to IRQD_NO_BALANCING or is > not set to IRQD_AFFINITY_MANAGED (IRQD_AFFINITY_MANAGED means affinity is > auto-managed by the kernel). Consequently, the irq cannot be set from the > user space. Hoping your reply, thanks! Just curious, is a specific IRQ unmigratable really matters? The only case to me is, you want to shutdown a CPU, so you have to migrate all tasks or IRQs out of it, and if the IRQ is unmigratable, then it matters, is this your case? I cannot tell why a specific IRQ is unmigratable, I'm not an expert to kernel's IRQ subsystem. I guess it is due to hardware constraint, so vendor's suggestion will be helpful. Hi, Tao Liu Thanks your reply. Yes, all the IRQ numbers for which irqbalance reports errors such as "Cannot change IRQ 174 affinity: Input/output error" and "IRQ 174 affinity is now unmanaged" originate from a RAID controller. At present, we're merely curious as to why only the IRQs of the RAID adapter cannot be migrated. In any case, we plan to verify this with the adapter vendor. Thanks! |
Created attachment 1980710 [details] os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error.debug Description of problem: Found several errors from irqbalanceIn RHEL-9.3.0-20230730.21(irqbalance-1.9.2-2.el9.aarch64) test. Below is details: Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Startup finished in 1.409s (kernel) + 2.546s (initrd) + 12.764s (userspace) = 16.720s. Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 108 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 108 affinity is now unmanaged Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 83 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 83 affinity is now unmanaged Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 126 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 126 affinity is now unmanaged Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 116 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 116 affinity is now unmanaged Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 91 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 91 affinity is now unmanaged Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 106 affinity: Input/output error Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 106 affinity is now unmanaged RHEL Version: RHEL-9.3(5.14.0-345.el9.aarch64) How reproducible: 100% Steps to Reproduce: 1. launch aws i4g.8xlarge instance 2. check journal log Actual results: many errors from irqbalance Expected results: no such errors from /irqbalance Additional info: - this is a regression issue