Bug 2128472

Summary: stalld service fails: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Product: Red Hat Enterprise Linux 8 Reporter: Bill Zvonar <bzvonar>
Component: stalldAssignee: Leah Leshchinsky <lleshchi>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.6CC: bhu, bzvonar, daolivei, jkacur, kcarcia, lgoncalv, mcornea, mstowell, williams
Target Milestone: rcKeywords: FutureFeature, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-16 21:03:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2120800    
Bug Blocks:    

Description Bill Zvonar 2022-09-20 17:26:25 UTC
Description of problem:
stalld service fails: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted

Version-Release number of selected component (if applicable):
v1.17 (OCP 4.11.5)

How reproducible:
Always on nodes with SecureBoot enabled

Steps to Reproduce:
1. Deploy SNO with SecureBoot enabled
2. On the deployed node check stalld status

Actual results:
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: Starting Stall Monitor...
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: Started Stall Monitor.
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: /sys/kernel/debug/sched/features doesn't exist
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: /sys/kernel/debug/sched_features exists
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: stalld can't enable HRTICK. stalld cannot run in this mode. Exiting..
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Failed with result 'exit-code'.


Expected results:
stalld service runs without failures

Additional info:

Comment 1 Beth Uptagrafft 2022-09-20 17:41:21 UTC
I don't think stalld creates the /sys/kernel/debug/sched/features directory.  It is root-only access, at least on my RHEL 8.6 system. What kernel version is being used?   Can you confirm if that file exists on your system, and if so, what are the permissions?

Comment 2 Bill Zvonar 2022-09-20 17:44:46 UTC
@mcornea can you respond to Beth's question?

Comment 8 Marius Cornea 2022-09-21 10:30:08 UTC
(In reply to Beth Uptagrafft from comment #1)
> I don't think stalld creates the /sys/kernel/debug/sched/features directory.
> It is root-only access, at least on my RHEL 8.6 system. What kernel version
> is being used?   Can you confirm if that file exists on your system, and if
> so, what are the permissions?

[root@sno core]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="411.86.202209140028-0"
VERSION_ID="4.11"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 411.86.202209140028-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.11/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.11"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.11"
OPENSHIFT_VERSION="4.11"
RHEL_VERSION="8.6"
OSTREE_VERSION="411.86.202209140028-0"

[root@sno core]# uname -r
4.18.0-372.26.1.rt7.183.el8_6.x86_64

[root@sno core]# ls -l /sys/kernel/debug/sched/features
ls: cannot access '/sys/kernel/debug/sched/features': No such file or directory

Comment 10 Marius Cornea 2022-09-21 11:21:42 UTC
FWIW I am seeing the same HRTICK error with an older stalld release(stalld-1.15-2.el8_4.x86_64) on OCP 4.10 but it is not preventing the service to start:

[root@sno core]# systemctl status stalld
● stalld.service - Stall Monitor
   Loaded: loaded (/etc/systemd/system/stalld.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-09-21 11:14:51 UTC; 4min 50s ago
  Process: 3313452 ExecStopPost=/usr/local/bin/throttlectl.sh on (code=exited, status=0/SUCCESS)
  Process: 3314494 ExecStartPre=/usr/local/bin/throttlectl.sh off (code=exited, status=0/SUCCESS)
 Main PID: 3314500 (stalld)
    Tasks: 1 (limit: 402810)
   Memory: 532.0K
      CPU: 110ms
   CGroup: /system.slice/stalld.service
           └─3314500 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid

Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched/features doesn't exist
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched_features exists
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched/debug doesn't exist
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /proc/sched_debug exists
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: boosted pid 0 using SCHED_DEADLINE
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: using SCHED_DEADLINE for boosting
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: initial config_buffer_size set to 1966080
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: detected new task format
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: single threaded mode


while on OCP 4.11 with stalld-1.17-1.el8_6.x86_64

[root@sno core]# systemctl status stalld
● stalld.service - Stall Monitor
   Loaded: loaded (/usr/lib/systemd/system/stalld.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2022-09-21 11:18:14 UTC; 15s ago
  Process: 184942 ExecStopPost=/usr/bin/throttlectl on (code=exited, status=0/SUCCESS)
  Process: 184940 ExecStart=/usr/bin/stalld --systemd $CLIST $AGGR $BP $BR $BD $THRESH $LOGGING $FG $PF $IT $IP (code=exited, status=1/FAILURE)
  Process: 184934 ExecStartPre=/usr/bin/throttlectl off (code=exited, status=0/SUCCESS)
 Main PID: 184940 (code=exited, status=1/FAILURE)
      CPU: 23ms
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: /sys/kernel/debug/sched/features doesn't exist
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: /sys/kernel/debug/sched_features exists
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Main process exited, code=exited, status=1/FAILURE
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: stalld can't enable HRTICK. stalld cannot run in this mode. Exiting..
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Failed with result 'exit-code'.

Comment 11 Marius Cornea 2022-09-21 12:02:29 UTC
Looks like this check was introduced in v1.17 by https://gitlab.com/rt-linux-tools/stalld/-/commit/27922ea36bbcf853c078dc3942610072231b7ea3

Comment 12 Daniel Bristot de Oliveira 2022-09-21 12:34:44 UTC
So, to explain the situation.

To be able to limit the "interference" (noise) that the boosted thread adds to the "busy-loop" thread, we use SCHED_DEADLINE.

The granularity of the SCHED_DEADLINE throttling mechanism is 1 ms by default. However, by setting HRTICK, we reduce this granularity to the microseconds range by using a high-resolution timer.

Before 1.17, stalld was not checking the error of setting HRTICK, so the daemon was operating in a not-so-precise way. These error messages were not being ignored.

Now stalld checks for this error and fails if the correct behavior cannot be achieved.

We need to think better about an upstream solution for this inconvenience. Ignoring this error *is not* the right way to go.

For a downstream solution in case of an emergency, the options I see are:

1) Find a setup that allows stalld to work properly (having permission to write in /sys/kernel/debug/sched[|/]_features)
2) Do a patch in the .rpm to ignore this error.

Comment 13 John Kacur 2022-09-21 13:41:02 UTC
In my opinion stalld is doing the right thing by failing if the HRTICK is not available.
Otherwise the user is under the impression that everything is working correctly, which could lead to a very difficult debugging problem
if they then notice that they are not getting the expected performance.

I think the downstream solution is the first one that Daniel lists, OpenShift needs to find a way to allow stalld to read and write to the sys features file.

Comment 15 Daniel Bristot de Oliveira 2022-10-14 09:42:16 UTC
Upstream release containing the fix:

https://gitlab.com/rt-linux-tools/stalld/-/releases/v1.17.1

Comment 16 John Kacur 2022-10-18 13:59:26 UTC
This bz was requested for rhel-8.6z. Could the customer tell us which zstream releases they are requesting this for?