Bug 2128472
| Summary: | stalld service fails: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Bill Zvonar <bzvonar> |
| Component: | stalld | Assignee: | Leah Leshchinsky <lleshchi> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.6 | CC: | bhu, bzvonar, daolivei, jkacur, kcarcia, lgoncalv, mcornea, mstowell, williams |
| Target Milestone: | rc | Keywords: | FutureFeature, Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-11-16 21:03:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2120800 | ||
| Bug Blocks: | |||
|
Description
Bill Zvonar
2022-09-20 17:26:25 UTC
I don't think stalld creates the /sys/kernel/debug/sched/features directory. It is root-only access, at least on my RHEL 8.6 system. What kernel version is being used? Can you confirm if that file exists on your system, and if so, what are the permissions? @mcornea can you respond to Beth's question? (In reply to Beth Uptagrafft from comment #1) > I don't think stalld creates the /sys/kernel/debug/sched/features directory. > It is root-only access, at least on my RHEL 8.6 system. What kernel version > is being used? Can you confirm if that file exists on your system, and if > so, what are the permissions? [root@sno core]# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="411.86.202209140028-0" VERSION_ID="4.11" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 411.86.202209140028-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.11/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.11" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.11" OPENSHIFT_VERSION="4.11" RHEL_VERSION="8.6" OSTREE_VERSION="411.86.202209140028-0" [root@sno core]# uname -r 4.18.0-372.26.1.rt7.183.el8_6.x86_64 [root@sno core]# ls -l /sys/kernel/debug/sched/features ls: cannot access '/sys/kernel/debug/sched/features': No such file or directory FWIW I am seeing the same HRTICK error with an older stalld release(stalld-1.15-2.el8_4.x86_64) on OCP 4.10 but it is not preventing the service to start:
[root@sno core]# systemctl status stalld
● stalld.service - Stall Monitor
Loaded: loaded (/etc/systemd/system/stalld.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2022-09-21 11:14:51 UTC; 4min 50s ago
Process: 3313452 ExecStopPost=/usr/local/bin/throttlectl.sh on (code=exited, status=0/SUCCESS)
Process: 3314494 ExecStartPre=/usr/local/bin/throttlectl.sh off (code=exited, status=0/SUCCESS)
Main PID: 3314500 (stalld)
Tasks: 1 (limit: 402810)
Memory: 532.0K
CPU: 110ms
CGroup: /system.slice/stalld.service
└─3314500 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched/features doesn't exist
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched_features exists
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched/debug doesn't exist
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /proc/sched_debug exists
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: boosted pid 0 using SCHED_DEADLINE
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: using SCHED_DEADLINE for boosting
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: initial config_buffer_size set to 1966080
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: detected new task format
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: single threaded mode
while on OCP 4.11 with stalld-1.17-1.el8_6.x86_64
[root@sno core]# systemctl status stalld
● stalld.service - Stall Monitor
Loaded: loaded (/usr/lib/systemd/system/stalld.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2022-09-21 11:18:14 UTC; 15s ago
Process: 184942 ExecStopPost=/usr/bin/throttlectl on (code=exited, status=0/SUCCESS)
Process: 184940 ExecStart=/usr/bin/stalld --systemd $CLIST $AGGR $BP $BR $BD $THRESH $LOGGING $FG $PF $IT $IP (code=exited, status=1/FAILURE)
Process: 184934 ExecStartPre=/usr/bin/throttlectl off (code=exited, status=0/SUCCESS)
Main PID: 184940 (code=exited, status=1/FAILURE)
CPU: 23ms
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: /sys/kernel/debug/sched/features doesn't exist
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: /sys/kernel/debug/sched_features exists
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Main process exited, code=exited, status=1/FAILURE
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: stalld can't enable HRTICK. stalld cannot run in this mode. Exiting..
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Failed with result 'exit-code'.
Looks like this check was introduced in v1.17 by https://gitlab.com/rt-linux-tools/stalld/-/commit/27922ea36bbcf853c078dc3942610072231b7ea3 So, to explain the situation. To be able to limit the "interference" (noise) that the boosted thread adds to the "busy-loop" thread, we use SCHED_DEADLINE. The granularity of the SCHED_DEADLINE throttling mechanism is 1 ms by default. However, by setting HRTICK, we reduce this granularity to the microseconds range by using a high-resolution timer. Before 1.17, stalld was not checking the error of setting HRTICK, so the daemon was operating in a not-so-precise way. These error messages were not being ignored. Now stalld checks for this error and fails if the correct behavior cannot be achieved. We need to think better about an upstream solution for this inconvenience. Ignoring this error *is not* the right way to go. For a downstream solution in case of an emergency, the options I see are: 1) Find a setup that allows stalld to work properly (having permission to write in /sys/kernel/debug/sched[|/]_features) 2) Do a patch in the .rpm to ignore this error. In my opinion stalld is doing the right thing by failing if the HRTICK is not available. Otherwise the user is under the impression that everything is working correctly, which could lead to a very difficult debugging problem if they then notice that they are not getting the expected performance. I think the downstream solution is the first one that Daniel lists, OpenShift needs to find a way to allow stalld to read and write to the sys features file. Upstream release containing the fix: https://gitlab.com/rt-linux-tools/stalld/-/releases/v1.17.1 This bz was requested for rhel-8.6z. Could the customer tell us which zstream releases they are requesting this for? |