Bug 2227536

Summary: irqbalance: Cannot change IRQ 76 affinity: Input/output error
Product: Red Hat Enterprise Linux 9 Reporter: Frank Liang <xiliang>
Component: irqbalanceAssignee: ltao
Status: CLOSED NOTABUG QA Contact: Jiri Dluhos <jdluhos>
Severity: medium Docs Contact:
Priority: medium    
Version: 9.3CC: andavis, jeder, jshortt, libhe, linl, qzhang, vkuznets, ymao
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-31 05:54:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error.debug none

Description Frank Liang 2023-07-30 14:56:46 UTC
Created attachment 1980710 [details]
os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error.debug

Description of problem:
Found several errors from irqbalanceIn RHEL-9.3.0-20230730.21(irqbalance-1.9.2-2.el9.aarch64) test.

Below is details:

Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Startup finished in 1.409s (kernel) + 2.546s (initrd) + 12.764s (userspace) = 16.720s.
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 108 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 108 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 83 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 83 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 126 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 126 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 116 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 116 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 91 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 91 affinity is now unmanaged
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 106 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 106 affinity is now unmanaged

RHEL Version:
RHEL-9.3(5.14.0-345.el9.aarch64)

How reproducible:
100%

Steps to Reproduce:
1. launch aws i4g.8xlarge instance
2. check journal log

Actual results:
many errors from irqbalance

Expected results:
no such errors from /irqbalance

Additional info:
- this is a regression issue

Comment 1 ltao 2023-07-31 01:49:45 UTC
Hi Frank,

Thanks for reporting the issue. However I think it is expected due to upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered permanent"). In short words, if a IRQ number failed to set affinity due to !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be kicked out of schedule list in case of re-set affinity next time, and at the same time throw out a error message to report the failing details.

Previous version of irqbalance will just fail with no error report, so users don't know if a irq number is affinity set successfully or not, this bug is addressed in bz2184735, and the patches I mentioned previously is meant to fix this bz.

Thanks,
Tao Liu

Comment 2 Frank Liang 2023-07-31 02:47:34 UTC
(In reply to ltao from comment #1)
> Hi Frank,
> 
> Thanks for reporting the issue. However I think it is expected due to
> upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and
> eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered
> permanent"). In short words, if a IRQ number failed to set affinity due to
> !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be
> kicked out of schedule list in case of re-set affinity next time, and at the
> same time throw out a error message to report the failing details.
> 
> Previous version of irqbalance will just fail with no error report, so users
> don't know if a irq number is affinity set successfully or not, this bug is
> addressed in bz2184735, and the patches I mentioned previously is meant to
> fix this bz.
> 
> Thanks,
> Tao Liu

Thanks, ltao

I think it is good to throw out fail messages accordingly.
Now just feel that there are too many lines(128 similar lines in attachment) in journal logs. Is it possible to suppress them or save details in other place?

Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 99 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 99 affinity is now unmanaged

Comment 3 ltao 2023-07-31 03:33:05 UTC
(In reply to Frank Liang from comment #2)
> (In reply to ltao from comment #1)
> > Hi Frank,
> > 
> > Thanks for reporting the issue. However I think it is expected due to
> > upstream patch: 9a1fd29a82c97 ("activate_mapping: report error reason") and
> > eee7917ef52726 ("activate_mapping: only blacklist irq if error is considered
> > permanent"). In short words, if a IRQ number failed to set affinity due to
> > !(ENOSPC && EAGAIN && EBUSY && EINVAL && ERANGE && ENOMEM) error, it will be
> > kicked out of schedule list in case of re-set affinity next time, and at the
> > same time throw out a error message to report the failing details.
> > 
> > Previous version of irqbalance will just fail with no error report, so users
> > don't know if a irq number is affinity set successfully or not, this bug is
> > addressed in bz2184735, and the patches I mentioned previously is meant to
> > fix this bz.
> > 
> > Thanks,
> > Tao Liu
> 
> Thanks, ltao
> 
> I think it is good to throw out fail messages accordingly.
> Now just feel that there are too many lines(128 similar lines in attachment)
> in journal logs. Is it possible to suppress them or save details in other
> place?
> 
> Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal
> /usr/sbin/irqbalance[1041]: Cannot change IRQ 99 affinity: Input/output error
> Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal
> /usr/sbin/irqbalance[1041]: IRQ 99 affinity is now unmanaged

Hi Frank,

I see from your log, it seems to me you are doing some testing. The test script recongnized the error log throwed out by irqbalance. And it contains 4 parts:

1)
Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal kdumpctl[1446]: kdump: kexec: loaded kdump kernel
Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal kdumpctl[1446]: kdump: Starting kdump: [OK]
Jul 30 09:27:04 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Finished Crash recovery kernel arming.
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal systemd[1]: Startup finished in 1.409s (kernel) + 2.546s (initrd) + 12.764s (
userspace) = 16.720s.
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 108 affinity: Input/output error
Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: IRQ 108 affinity is now unmanaged
...snip...

This part is the real log outputted by irqbalance, 64 items in total.

2)
2023-07-30 09:36:09,607:INFO:error found!
2023-07-30 09:36:09,608:INFO:Checking:Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 76 affinity: Input/output error
...snip...

This part is the test script finishes testing, and it detected the error log so it output all the error info it found.

3)
2023-07-30 09:36:15,239:INFO:Below items are unknow!
['Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: Cannot change IRQ 76 affinity: Input/output error', 'Jul 30 09:27:05 ip-10-22-1-117.us-west-2.compute.internal /usr/sbin/irqbalance[1041]: ...snip...

This part is the same error log as 2, but report in array format.

4)
AssertionError: New error in journalctl -b 0|grep -Ev "os_tests.tests.test_general_check.TestGeneralCheck.test_check_journalctl_error"|
grep -Ev test_check|grep -Ev "UpdateGSErrors" log

This part is also the same error log as 2.

I suppose the system you are testing is a large system which contains hundreds of IRQs, among which, 64 IRQs failed to set affinity, it is acceptable to me. And also irqbalance only reported them once, it is also expected. I think suppressing the error logs is better to be handled in the test script side, it doesn't need to repeat outputting the error logs 3 times.

Thanks,
Tao Liu

Comment 4 Frank Liang 2023-07-31 05:41:08 UTC
Ok, pls feel free to close it as not a bug. Thanks

Comment 5 ltao 2023-07-31 05:54:04 UTC
OK, I will close the bz as not a bug.