Bug 2016540

Summary: RHEL9 traceback in fv_cpu_pinning test on some aarch64 systems
Product: Red Hat Enterprise Linux 9 Reporter: John Kacur <jkacur>
Component: tunaAssignee: John Kacur <jkacur>
Status: CLOSED ERRATA QA Contact: Qiao Zhao <qzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: bhu, mstowell, qzhao, rt-maint
Target Milestone: rcKeywords: Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tuna-0.16-3.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2018285 (view as bug list) Environment:
Last Closed: 2022-05-17 15:55:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2018285, 2020013    

Description John Kacur 2021-10-21 21:03:32 UTC
Description of problem: trace-back in tuna on some aarch64 systems trying to isolate cpus on rhel9

[root@ampere-hr330a-09 ~]# tuna --cpus 31 --isolate
Traceback (most recent call last):
  File "/usr/bin/tuna", line 763, in <module>
    main()
  File "/usr/bin/tuna", line 601, in main
    tuna.isolate_cpus(cpu_list, get_nr_cpus())
  File "/usr/lib/python3.9/site-packages/tuna/tuna.py", line 370, in isolate_cpus
    raise err
  File "/usr/lib/python3.9/site-packages/tuna/tuna.py", line 363, in isolate_cpus
    os.sched_setaffinity(pid, affinity)
OSError: [Errno 16] Device or resource busy

How reproducible:
Not on every machine, but on machines where it happens, everytime

Comment 1 John Kacur 2021-10-21 21:05:19 UTC
I did some more investigation, and added a line to print the pid in tuna
./tuna-cmd.py -c 31 -i
pid = 991
Traceback (most recent call last):
File "/root/src/tuna/./tuna-cmd.py", line 762, in <module>
main()
File "/root/src/tuna/./tuna-cmd.py", line 601, in main
tuna.isolate_cpus(cpu_list, get_nr_cpus())
File "/root/src/tuna/tuna/tuna.py", line 371, in isolate_cpus
raise err
File "/root/src/tuna/tuna/tuna.py", line 363, in isolate_cpus
os.sched_setaffinity(pid, affinity)
OSError: [Errno 16] Device or resource busy

and I looked at pid 991
cat /proc/991/stat
991 (cppc_fie) S 2 0 0 0 -1 2129984 0 0 0 0 1635 43557 0 0 -101 0 1 0 1595 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 22 0 6 0 0 0 0 0 0 0 0 0 0 0

This the flags 2129984
python
Python 3.9.7 (default, Sep 9 2021, 00:00:00)
[GCC 11.2.1 20210728 (Red Hat 11.2.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 2129984 & 0x04000000 and True or False
False
Which means setaffinity is allowed
ps ax | grep 991
991 ? S 7:37 [cppc_fie]

Wondering if there is something about cppc_fie in kernel that prevents setting affinity

Comment 2 John Kacur 2021-10-21 21:05:44 UTC
cppc_fie uses SCHED_DEADLINE

if admission control is on, you cannot restrict the cpus to a smaller set

try the following
echo -1 > /proc/sys/kernel/sched_rt_runtime_us

If you are able to run
tuna --cpus 31 --isolate
successfully after that, then we know what the problem is.
However, shutting off admission control is probably not a good work around for you.

Comment 4 John Kacur 2021-10-21 21:07:57 UTC
Note this probably manifests itself in rhel-9 because of the way the environment is set-up.
However it could potentially happen in rhel-8.x too, so any changes should be backported there as well.

Comment 5 John Kacur 2021-10-28 17:44:31 UTC
The fix I added prints a warning if setaffinity triggers an EBUSY error and continues.
This can occur if a pid is attached to a device using SCHED_DEADLINE and control admission is on.

The user can then do one of the following.
1. Simply ignore the one pid when isolating the CPU
or if the user is worried it might be impacting performance (such as realtime latency)
2. reboot using isolcpus to isolate the cpu
or
3. Turn off admission control and rerun tuna isolate and then turn admission control back on.

Comment 17 errata-xmlrpc 2022-05-17 15:55:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: tuna), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:3955