RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2128472 - stalld service fails: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Summary: stalld service fails: could not open /sys/kernel/debug/sched_features to set ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: stalld
Version: 8.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Leah Leshchinsky
QA Contact:
URL:
Whiteboard:
Depends On: 2120800
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-20 17:26 UTC by Bill Zvonar
Modified: 2022-11-16 21:04 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-16 21:03:53 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-134447 0 None None None 2022-09-20 19:57:57 UTC

Description Bill Zvonar 2022-09-20 17:26:25 UTC
Description of problem:
stalld service fails: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted

Version-Release number of selected component (if applicable):
v1.17 (OCP 4.11.5)

How reproducible:
Always on nodes with SecureBoot enabled

Steps to Reproduce:
1. Deploy SNO with SecureBoot enabled
2. On the deployed node check stalld status

Actual results:
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: Starting Stall Monitor...
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: Started Stall Monitor.
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: /sys/kernel/debug/sched/features doesn't exist
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: /sys/kernel/debug/sched_features exists
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[36993]: stalld can't enable HRTICK. stalld cannot run in this mode. Exiting..
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 13:51:37 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Failed with result 'exit-code'.


Expected results:
stalld service runs without failures

Additional info:

Comment 1 Beth Uptagrafft 2022-09-20 17:41:21 UTC
I don't think stalld creates the /sys/kernel/debug/sched/features directory.  It is root-only access, at least on my RHEL 8.6 system. What kernel version is being used?   Can you confirm if that file exists on your system, and if so, what are the permissions?

Comment 2 Bill Zvonar 2022-09-20 17:44:46 UTC
@mcornea can you respond to Beth's question?

Comment 8 Marius Cornea 2022-09-21 10:30:08 UTC
(In reply to Beth Uptagrafft from comment #1)
> I don't think stalld creates the /sys/kernel/debug/sched/features directory.
> It is root-only access, at least on my RHEL 8.6 system. What kernel version
> is being used?   Can you confirm if that file exists on your system, and if
> so, what are the permissions?

[root@sno core]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="411.86.202209140028-0"
VERSION_ID="4.11"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 411.86.202209140028-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.11/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.11"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.11"
OPENSHIFT_VERSION="4.11"
RHEL_VERSION="8.6"
OSTREE_VERSION="411.86.202209140028-0"

[root@sno core]# uname -r
4.18.0-372.26.1.rt7.183.el8_6.x86_64

[root@sno core]# ls -l /sys/kernel/debug/sched/features
ls: cannot access '/sys/kernel/debug/sched/features': No such file or directory

Comment 10 Marius Cornea 2022-09-21 11:21:42 UTC
FWIW I am seeing the same HRTICK error with an older stalld release(stalld-1.15-2.el8_4.x86_64) on OCP 4.10 but it is not preventing the service to start:

[root@sno core]# systemctl status stalld
● stalld.service - Stall Monitor
   Loaded: loaded (/etc/systemd/system/stalld.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-09-21 11:14:51 UTC; 4min 50s ago
  Process: 3313452 ExecStopPost=/usr/local/bin/throttlectl.sh on (code=exited, status=0/SUCCESS)
  Process: 3314494 ExecStartPre=/usr/local/bin/throttlectl.sh off (code=exited, status=0/SUCCESS)
 Main PID: 3314500 (stalld)
    Tasks: 1 (limit: 402810)
   Memory: 532.0K
      CPU: 110ms
   CGroup: /system.slice/stalld.service
           └─3314500 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid

Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched/features doesn't exist
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched_features exists
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /sys/kernel/debug/sched/debug doesn't exist
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: /proc/sched_debug exists
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: boosted pid 0 using SCHED_DEADLINE
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: using SCHED_DEADLINE for boosting
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: initial config_buffer_size set to 1966080
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: detected new task format
Sep 21 11:14:51 sno.kni-qe-1.lab.eng.rdu2.redhat.com stalld[3314500]: single threaded mode


while on OCP 4.11 with stalld-1.17-1.el8_6.x86_64

[root@sno core]# systemctl status stalld
● stalld.service - Stall Monitor
   Loaded: loaded (/usr/lib/systemd/system/stalld.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2022-09-21 11:18:14 UTC; 15s ago
  Process: 184942 ExecStopPost=/usr/bin/throttlectl on (code=exited, status=0/SUCCESS)
  Process: 184940 ExecStart=/usr/bin/stalld --systemd $CLIST $AGGR $BP $BR $BD $THRESH $LOGGING $FG $PF $IT $IP (code=exited, status=1/FAILURE)
  Process: 184934 ExecStartPre=/usr/bin/throttlectl off (code=exited, status=0/SUCCESS)
 Main PID: 184940 (code=exited, status=1/FAILURE)
      CPU: 23ms
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: /sys/kernel/debug/sched/features doesn't exist
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: /sys/kernel/debug/sched_features exists
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Main process exited, code=exited, status=1/FAILURE
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: could not open /sys/kernel/debug/sched_features to set HRTICK: Operation not permitted
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com stalld[38109]: stalld can't enable HRTICK. stalld cannot run in this mode. Exiting..
Sep 21 09:57:24 sno.kni-qe-12.lab.eng.rdu2.redhat.com systemd[1]: stalld.service: Failed with result 'exit-code'.

Comment 11 Marius Cornea 2022-09-21 12:02:29 UTC
Looks like this check was introduced in v1.17 by https://gitlab.com/rt-linux-tools/stalld/-/commit/27922ea36bbcf853c078dc3942610072231b7ea3

Comment 12 Daniel Bristot de Oliveira 2022-09-21 12:34:44 UTC
So, to explain the situation.

To be able to limit the "interference" (noise) that the boosted thread adds to the "busy-loop" thread, we use SCHED_DEADLINE.

The granularity of the SCHED_DEADLINE throttling mechanism is 1 ms by default. However, by setting HRTICK, we reduce this granularity to the microseconds range by using a high-resolution timer.

Before 1.17, stalld was not checking the error of setting HRTICK, so the daemon was operating in a not-so-precise way. These error messages were not being ignored.

Now stalld checks for this error and fails if the correct behavior cannot be achieved.

We need to think better about an upstream solution for this inconvenience. Ignoring this error *is not* the right way to go.

For a downstream solution in case of an emergency, the options I see are:

1) Find a setup that allows stalld to work properly (having permission to write in /sys/kernel/debug/sched[|/]_features)
2) Do a patch in the .rpm to ignore this error.

Comment 13 John Kacur 2022-09-21 13:41:02 UTC
In my opinion stalld is doing the right thing by failing if the HRTICK is not available.
Otherwise the user is under the impression that everything is working correctly, which could lead to a very difficult debugging problem
if they then notice that they are not getting the expected performance.

I think the downstream solution is the first one that Daniel lists, OpenShift needs to find a way to allow stalld to read and write to the sys features file.

Comment 15 Daniel Bristot de Oliveira 2022-10-14 09:42:16 UTC
Upstream release containing the fix:

https://gitlab.com/rt-linux-tools/stalld/-/releases/v1.17.1

Comment 16 John Kacur 2022-10-18 13:59:26 UTC
This bz was requested for rhel-8.6z. Could the customer tell us which zstream releases they are requesting this for?


Note You need to log in before you can comment on or make changes to this bug.