1905492 – The stalld service has a higher scheduler priority than ksoftirq and rcu{b, c} threads

Bug 1905492 - The stalld service has a higher scheduler priority than ksoftirq and rcu{b, c} threads

Summary: The stalld service has a higher scheduler priority than ksoftirq and rcu{b, c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Performance Addon Operator
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Martin Sivák
QA Contact:	Gowrishankar Rajaiyan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1960386 1985371
TreeView+	depends on / blocked

Reported:	2020-12-08 13:46 UTC by Artyom
Modified:	2021-11-26 14:31 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1960386 1985365 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:41:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift-kni performance-addon-operators pull 462	None	closed	Bug 1905492: Update the tuned profile to set higher scheduler priority	2021-02-05 15:47:56 UTC
Github	openshift-kni performance-addon-operators pull 685	None	open	e2e: Add stalld scheduling priority test	2021-07-12 12:44:59 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:41:37 UTC

Description Artyom 2020-12-08 13:46:08 UTC

Description of problem:
The stalld service has higher scheduler priority than ksoftirq and rcu{b, c} threads, that can lead to the system freeze when running load intensive processes as oslat.

Version-Release number of selected component (if applicable):
Client Version: 4.6.0-0.nightly-2020-07-25-091217
Server Version: 4.7.0-0.ci-2020-12-08-050547
Kubernetes Version: v1.19.2-1007+ad738ba548b6d6-dirty

How reproducible:
Always

Steps to Reproduce:
1. run oslat tool on the node to create pressure on the real-time kernel environment
2.
3.

Actual results:
The host will be freeze until the oslat will finish running

Expected results:
The host should continue to work as expected.


Additional info:

Comment 4 Niranjan Mallapadi Raghavender 2020-12-28 10:10:44 UTC

Version:
--------

[root@dell-r730-009 cnf-internal-deploy]# oc version
Client Version: 4.7.0-0.nightly-2020-12-04-013308
Server Version: 4.7.0-0.nightly-2020-12-21-131655
Kubernetes Version: v1.20.0+87544c5


Version of Performance Operator:
---------------------------------
rh-osbs/openshift4-performance-addon-operator-bundle-registry-container-rhel8:v4.7.0-285
Index image v4.7: registry-proxy.engineering.redhat.com/rh-osbs/iib:34633
[root@dell-r730-009 cnf-internal-deploy]# oc logs pods/performance-operator-767ddb449-lknqr -n openshift-performance-addon-operator
I1228 07:50:49.908341       1 main.go:72] Operator Version:
I1228 07:50:49.908453       1 main.go:73] Git Commit:
I1228 07:50:49.908461       1 main.go:74] Build Date: 2020-12-22T13:00:05+0000
I1228 07:50:49.908475       1 main.go:75] Go Version: go1.13.15
I1228 07:50:49.908481       1 main.go:76] Go OS/Arch: linux/amd64
I1228 07:50:50.962708       1 request.go:621] Throttling request took 1.036451752s, request: GET:https://172.30.0.1:443/apis/tuned.openshift.io/v1?timeout=32s

Deployed performance profile and checked the tuned profile pod and it contains the relevant information. 


[main]
summary=Openshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)
include=openshift-node,cpu-partitioning   

[variables]

isolated_cores=2-15

not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}}

[cpu]
force_latency=cstate.id:1|3                   #  latency-performance  (override)
governor=performance                          #  latency-performance 
energy_perf_bias=performance                  #  latency-performance 
min_perf_pct=100                              #  latency-performance 

[service]
service.stalld=start,enable

[vm]
transparent_hugepages=never                   #  network-latency
[scheduler]
group.ksoftirqd=0:f:11:*:ksoftirqd.*
group.rcuc=0:f:11:*:rcuc.*

Marking it verified.

Comment 6 errata-xmlrpc 2021-02-24 15:41:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.