Bug 1972701

Summary:	Stalld running not running as fifo
Product:	OpenShift Container Platform	Reporter:	browsell
Component:	Performance Addon Operator	Assignee:	Martin Sivák <msivak>
Status:	CLOSED ERRATA	QA Contact:	Gowrishankar Rajaiyan <grajaiya>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	aos-bugs, fsimonce, grajaiya, keyoung, msivak, shajmakh
Target Milestone:	---
Target Release:	4.8.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-26 14:52:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1973237
Bug Blocks:	1970940

Comment 1 Martin Sivák 2021-06-16 12:55:18 UTC

That is weird. I see NTO should be using this to start stalld:

ExecStart=/usr/bin/chrt -f 10 /usr/local/bin/stalld --systemd $CLIST $AGGR $BP $BR $BD $THRESH $LOGGING $FG $PF

And that should to setting FIFO:10.

Can you double check the stalld systemd unit that is present on the node?

Comment 3 Martin Sivák 2021-06-16 13:37:38 UTC

Oh.. I think I know what happened. RHCOS 8.4 includes stalld and systemd picked up the unit shipped with it instead of the unit that NTO creates.

Jirka: Where do you install the systemd unit that you install via NTO?

Comment 4 Jiří Mencák 2021-06-16 14:07:32 UTC

(In reply to Martin Sivák from comment #1)
> That is weird. I see NTO should be using this to start stalld:
> 
> ExecStart=/usr/bin/chrt -f 10 /usr/local/bin/stalld --systemd $CLIST $AGGR
> $BP $BR $BD $THRESH $LOGGING $FG $PF

Where do you see this, Martine?
sh-4.4# grep ExecStart= /usr/lib/systemd/system/stalld.service 
ExecStart=/usr/bin/stalld --systemd $CLIST $AGGR $BP $BR $BD $THRESH $LOGGING $FG $PF

NTO no longer ships the stalld unit files as of:
https://github.com/openshift/cluster-node-tuning-operator/pull/226

The CoreOS-shipped stalld.service file is now used and that one seems to be missing the "/usr/bin/chrt -f 10" as pointed out by Brent.

Comment 5 Jiří Mencák 2021-06-16 14:09:55 UTC

(In reply to Martin Sivák from comment #3)
> Jirka: Where do you install the systemd unit that you install via NTO?

Again, NTO no longer installs any systemd stalld unit files, it relies on the CoreOS provided ones.

Comment 6 Martin Sivák 2021-06-16 14:25:47 UTC

I found it here: https://github.com/openshift/cluster-node-tuning-operator/blob/master/pkg/tuned/host_payload.go#L80

Comment 12 Jiří Mencák 2021-06-18 06:44:15 UTC

Fixed in 4.9.0-0.nightly-2021-06-18-002931 and above.  The next OCP 4.8 nightly should also have the fix as
https://github.com/openshift/cluster-node-tuning-operator/pull/237 merged a while ago.

Comment 13 Martin Sivák 2021-06-18 08:50:07 UTC

Thanks Jirka!

Comment 14 Shereen Haj Makhoul 2021-06-24 08:57:03 UTC

Verifying the bug fix on :

oc version 
Client Version: 4.8.0-0.nightly-2021-06-22-192915
Server Version: 4.8.0-0.nightly-2021-06-22-192915
Kubernetes Version: v1.21.0-rc.0+120883f

oc get csv 
NAME                                DISPLAY                      VERSION   REPLACES   PHASE
performance-addon-operator.v4.8.0   Performance Addon Operator   4.8.0                Succeeded

Verify that stalld runs now as sched_fifo :

ps -ef | grep stalld
root        7294       1  0 14:16 ?        00:00:00 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid


systemctl status stalld
# Write a pidfile
# ex: PF=--pidfile /run/stalld.pid
Environment=PF="--pidfile /run/stalld.pid"

ExecStartPre=/usr/local/bin/throttlectl.sh off
ExecStart=/usr/bin/chrt -f 10 /usr/local/bin/stalld --systemd $CLIST $AGGR $BP $BR $BD $THRESH $LOGGING >
ExecStopPost=/usr/local/bin/throttlectl.sh on
Restart=always
User=root

As it can be noticed above , stalld binary is used now from nto & running with fifo scheduler (fifo flag of chrt is -f) with priority 10.

Comment 15 Shereen Haj Makhoul 2021-06-24 11:00:47 UTC

following comment 14:

Retrieving the scheduling attributes of the stalld pid, we get :
chrt -ap 7294
pid 7294's current scheduling policy: SCHED_FIFO
pid 7294's current scheduling priority: 10

& by verifying that the scheduling policy is SCHED_FIFO.

Comment 16 Shereen Haj Makhoul 2021-06-29 14:01:18 UTC

PR link : https://github.com/openshift-kni/performance-addon-operators/pull/674