Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2227536

Summary: irqbalance: Cannot change IRQ 76 affinity: Input/output error
Product: Red Hat Enterprise Linux 9 Reporter: Frank Liang <xiliang>
Component: irqbalanceAssignee: ltao
Status: CLOSED NOTABUG QA Contact: Jiri Dluhos <jdluhos>
Severity: medium Docs Contact:
Priority: medium    
Version: 9.3CC: andavis, jeder, jshortt, libhe, linl, lixc17, pkenyon, qzhang, siddharth.yelandur, vkuznets, ymao
Target Milestone: rcKeywords: Regression
Target Release: ---Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-31 05:54:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error.debug none

Description Frank Liang 2023-07-30 14:56:46 UTC
Created attachment 1980710 [details]
os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error.debug

Description of problem:
Found several errors from irqbalanceIn RHEL-9.3.0-20230730.21(irqbalance-1.9.2-2.el9.aarch64) test.

Below is details:

Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Startup finished in 1.409s (kernel) + 2.546s (initrd) + 12.764s (userspace) = 16.720s.
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 108 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 108 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 83 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 83 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 126 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 126 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 116 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 116 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 91 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 91 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 106 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 106 affinity is now unmanaged

RHEL Version:
RHEL-9.3(5.14.0-345.el9.aarch64)

How reproducible:
100%

Steps to Reproduce:
1. launch aws i4g.8xlarge instance
2. check journal log

Actual results:
many errors from irqbalance

Expected results:
no such errors from /irqbalance

Additional info:
- this is a regression issue

Comment 1 ltao 2023-07-31 01:49:45 UTC
Hi Frank,

Thanks for reporting the issue. However I think it is expected due to upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered permanent"). In short words, if a IRQ number failed to set affinity due to !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be kicked out of schedule list in case of re-set affinity next time, and at the same time throw out a error message to report the failing details.

Previous version of irqbalance will just fail with no error report, so users don't know if a irq number is affinity set successfully or not, this bug is addressed in bz2184735, and the patches I mentioned previously is meant to fix this bz.

Thanks,
Tao Liu

Comment 2 Frank Liang 2023-07-31 02:47:34 UTC
(In reply to ltao from comment #1)
> Hi Frank,
> 
> Thanks for reporting the issue. However I think it is expected due to
> upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and
> eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered
> permanent"). In short words, if a IRQ number failed to set affinity due to
> !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be
> kicked out of schedule list in case of re-set affinity next time, and at the
> same time throw out a error message to report the failing details.
> 
> Previous version of irqbalance will just fail with no error report, so users
> don't know if a irq number is affinity set successfully or not, this bug is
> addressed in bz2184735, and the patches I mentioned previously is meant to
> fix this bz.
> 
> Thanks,
> Tao Liu

Thanks, ltao

I think it is good to throw out fail messages accordingly.
Now just feel that there are too many lines(128 similar lines in attachment) in journal logs. Is it possible to suppress them or save details in other place?

Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 99 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 99 affinity is now unmanaged

Comment 3 ltao 2023-07-31 03:33:05 UTC
(In reply to Frank Liang from comment #2)
> (In reply to ltao from comment #1)
> > Hi Frank,
> > 
> > Thanks for reporting the issue. However I think it is expected due to
> > upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and
> > eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered
> > permanent"). In short words, if a IRQ number failed to set affinity due to
> > !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be
> > kicked out of schedule list in case of re-set affinity next time, and at the
> > same time throw out a error message to report the failing details.
> > 
> > Previous version of irqbalance will just fail with no error report, so users
> > don't know if a irq number is affinity set successfully or not, this bug is
> > addressed in bz2184735, and the patches I mentioned previously is meant to
> > fix this bz.
> > 
> > Thanks,
> > Tao Liu
> 
> Thanks, ltao
> 
> I think it is good to throw out fail messages accordingly.
> Now just feel that there are too many lines(128 similar lines in attachment)
> in journal logs. Is it possible to suppress them or save details in other
> place?
> 
> Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal
> /usr/sbin/irqbalance[1041]: Cannot change IRQ 99 affinity: Input/output error
> Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal
> /usr/sbin/irqbalance[1041]: IRQ 99 affinity is now unmanaged

Hi Frank,

I see from your log, it seems to me you are doing some testing. The test script recongnized the error log throwed out by irqbalance. And it contains 4 parts:

1)
Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal kdumpctl[1446]: kdump: kexec: loaded kdump kernel
Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal kdumpctl[1446]: kdump: Starting kdump: [OK]
Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Finished Crash recovery kernel arming.
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Startup finished in 1.409s (kernel) + 2.546s (initrd) + 12.764s (
userspace) = 16.720s.
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 108 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 108 affinity is now unmanaged
...snip...

This part is the real log outputted by irqbalance, 64 items in total.

2)
2023-07-30 09:36:09,607:INFO:error found!
2023-07-30 09:36:09,608:INFO:Checking:Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 76 affinity: Input/output error
...snip...

This part is the test script finishes testing, and it detected the error log so it output all the error info it found.

3)
2023-07-30 09:36:15,239:INFO:Below items are unknow!
['Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 76 affinity: Input/output error', 'Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: ...snip...

This part is the same error log as 2, but report in array format.

4)
AssertionError: New error in journalctl -b 0|grep -Ev "os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error"|
grep -Ev test_check|grep -Ev "UpdateGSErrors" log

This part is also the same error log as 2.

I suppose the system you are testing is a large system which contains hundreds of IRQs, among which, 64 IRQs failed to set affinity, it is acceptable to me. And also irqbalance only reported them once, it is also expected. I think suppressing the error logs is better to be handled in the test script side, it doesn't need to repeat outputting the error logs 3 times.

Thanks,
Tao Liu

Comment 4 Frank Liang 2023-07-31 05:41:08 UTC
Ok, pls feel free to close it as not a bug. Thanks

Comment 5 ltao 2023-07-31 05:54:04 UTC
OK, I will close the bz as not a bug.

Comment 6 Siddharth.Yelandur@stratus.com 2024-10-03 11:11:09 UTC
Hi Tao Liu,

I understand that the fix was added to let users know if an irq number is affinity set successfully or not, but in 9.4 I am seeing that these are logged way too many times into /var/log/messages.
Is there any way of suppressing this as asked before.

Thanks,
Siddharth

Comment 7 Siddharth.Yelandur@stratus.com 2024-10-03 11:14:30 UTC
Hi Tao Liu,

Also, as I understand these messages are for information only, please let me know if otherwise. With these messages too I saw the affinity was set for all the IRQ's and no issue was observed.

Thanks,
Siddharth

Comment 8 ltao 2024-10-06 22:26:43 UTC
(In reply to Siddharth.Yelandur from comment #6)
> Hi Tao Liu,
> 
> I understand that the fix was added to let users know if an irq number is
> affinity set successfully or not, but in 9.4 I am seeing that these are
> logged way too many times into /var/log/messages.

There shouldn't be "too many times" for "Cannot change IRQ X affinity: Input/output error", it will only show once for each IRQ at the start of irqbalance service.

> Is there any way of suppressing this as asked before.

Why need we suppress the info? 1st, the info are saved in systemd journal, ordinary users would not inspect it regularly. 2nd, If you do want to inspect the log file, it is very easy to get the info filtered out by cmd like "grep, sed, awk", is this a big problem for logs? Frankly I didn't see a strong motivations for the change, if you do, please let me know.

Thanks,
Tao Liu

> 
> Thanks,
> Siddharth

Comment 9 ltao 2024-10-06 22:36:09 UTC
(In reply to Siddharth.Yelandur from comment #7)
> Hi Tao Liu,
> 
> Also, as I understand these messages are for information only, please let me
> know if otherwise. With these messages too I saw the affinity was set for
> all the IRQ's and no issue was observed.

For the EIO errors, it means the users cannot set customed affinity to the IRQ, so it will not be rescheduled by irqbalance next time, thus these IRQs' affinity will be left as it is, which is unlikely satisfy your affinity rules. For EAGAIN EBUSY EINVAL ERANGE ENOMEM ENOSPC errors, they can be rescheduled next time, to it can work as you expected.

> 
> Thanks,
> Siddharth

Comment 10 ltao 2024-11-05 21:28:07 UTC
Just paste a link here for referencing purpose[1], in case other people have a similar request to suppress the logs, this link provide an approach to suppress it from systemd side.

[1]: https://github.com/Irqbalance/irqbalance/pull/332#issuecomment-2457641689

Comment 11 xiaochun lee 2024-12-20 08:04:01 UTC
(In reply to ltao from comment #9)
> (In reply to Siddharth.Yelandur from comment #7)
> > Hi Tao Liu,
> > 
> > Also, as I understand these messages are for information only, please let me
> > know if otherwise. With these messages too I saw the affinity was set for
> > all the IRQ's and no issue was observed.
> 
> For the EIO errors, it means the users cannot set customed affinity to the
> IRQ, so it will not be rescheduled by irqbalance next time, thus these IRQs'
> affinity will be left as it is, which is unlikely satisfy your affinity
> rules. For EAGAIN EBUSY EINVAL ERANGE ENOMEM ENOSPC errors, they can be
> rescheduled next time, to it can work as you expected.
> 
> > 
> > Thanks,
> > Siddharth

Hi, Tao Liu
Thanks your information, I encountered the same symptom lately. But one more question about in which situations the driver returns -EIO to the irqbalance service. Does it require the driver vendor to investigate this? After checking the kernel code, the -EIO return value indicates that the irq_data->common.state_use_accessors has been set to IRQD_NO_BALANCING or is not set to IRQD_AFFINITY_MANAGED (IRQD_AFFINITY_MANAGED means affinity is auto-managed by the kernel). Consequently, the irq cannot be set from the user space. Hoping your reply, thanks!

Comment 12 ltao 2024-12-24 05:15:35 UTC
(In reply to xiaochun lee from comment #11)
> (In reply to ltao from comment #9)
> > (In reply to Siddharth.Yelandur from comment #7)
> > > Hi Tao Liu,
> > > 
> > > Also, as I understand these messages are for information only, please let me
> > > know if otherwise. With these messages too I saw the affinity was set for
> > > all the IRQ's and no issue was observed.
> > 
> > For the EIO errors, it means the users cannot set customed affinity to the
> > IRQ, so it will not be rescheduled by irqbalance next time, thus these IRQs'
> > affinity will be left as it is, which is unlikely satisfy your affinity
> > rules. For EAGAIN EBUSY EINVAL ERANGE ENOMEM ENOSPC errors, they can be
> > rescheduled next time, to it can work as you expected.
> > 
> > > 
> > > Thanks,
> > > Siddharth
> 
> Hi, Tao Liu
> Thanks your information, I encountered the same symptom lately. But one more
> question about in which situations the driver returns -EIO to the irqbalance
> service. Does it require the driver vendor to investigate this? After
> checking the kernel code, the -EIO return value indicates that the
> irq_data->common.state_use_accessors has been set to IRQD_NO_BALANCING or is
> not set to IRQD_AFFINITY_MANAGED (IRQD_AFFINITY_MANAGED means affinity is
> auto-managed by the kernel). Consequently, the irq cannot be set from the
> user space. Hoping your reply, thanks!

Just curious, is a specific IRQ unmigratable really matters? The only case to me is, you want to shutdown a CPU, so you have to migrate all tasks or IRQs out of it, and if the IRQ is unmigratable, then it matters, is this your case? I cannot tell why a specific IRQ is unmigratable, I'm not an expert to kernel's IRQ subsystem. I guess it is due to hardware constraint, so vendor's suggestion will be helpful.

Comment 13 xiaochun lee 2024-12-24 07:49:28 UTC
Hi, Tao Liu
Thanks your reply. 
Yes, all the IRQ numbers for which irqbalance reports errors such as "Cannot change IRQ 174 affinity: Input/output error" and "IRQ 174 affinity is now unmanaged" originate from a RAID controller. At present, we're merely curious as to why only the IRQs of the RAID adapter cannot be migrated. In any case, we plan to verify this with the adapter vendor. Thanks!