1949027 – [4.6.z] Re-enable stalld

Bug 1949027 - [4.6.z] Re-enable stalld

Summary: [4.6.z] Re-enable stalld

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Performance Addon Operator
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Martin Sivák
QA Contact:	Gowrishankar Rajaiyan
Docs Contact:
URL:
Whiteboard:
Depends On:	1949739
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-13 09:31 UTC by elevin
Modified:	2021-07-09 05:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:	performance-addon-operator.v4.6.4-5
Doc Type:	Bug Fix
Doc Text:	Cause: Stalld service was disabled by default for performance profiles due to a kernel bug in the HRTICK kernel subsystem that caused the system to hang. Consequence: Stalld is a required functionality for preventing the starvation of operating system threads. Not having it enabled by default impacts achievable latency. Fix: Kernel versions: 4.18.0-193.49.1.el8_2.x86_64 4.18.0-193.51.1.rt13.101.el8_2.x86_64 (for non real-time and real-time respectively) contain a fix for the HRTICK bug which allows us to re-enable stalld by default for all performance profiles. Result: Stalld functionality re-enabled by default reinstates the required latency results provided by tuning the system via the Performance addon operator. Important Note: All nodes kernel version must be : 4.18.0-193.49.1.el8_2.x86_64 4.18.0-193.51.1.rt13.101.el8_2.x86_64 Or higher This means that the underlying RHCOS version needs to be updated BEFORE PAO itself.
Clone Of:
Environment:
Last Closed:	2021-06-22 11:22:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift-kni performance-addon-operators pull 621	None	open	[release-4.6] Bug 1949027: Re-enable stalld	2021-04-27 13:24:23 UTC
Github	openshift-kni performance-addon-operators pull 676	None	open	e2e: Reactivate stalld test	2021-07-09 05:52:58 UTC
Red Hat Product Errata	RHBA-2021:2482	None	None	None	2021-06-22 11:22:18 UTC

Description elevin 2021-04-13 09:31:42 UTC

Description of problem:

Performance test (cnf-tests-container) "[It] [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host" fails on OCP v4.6.Z.

 

Version-Release number of selected component (if applicable):

v4.6.24
cnf-tests-container-v4.6.3-1

How reproducible:
Always



Actual results:

[rfe_id:27368][performance]
/remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:45
  Pre boot tuning adjusted by tuned 
  /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:118
    [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host [It]
    /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:179
    Unexpected error:
        <*exec.ExitError | 0xc0009578a0>: {
            ProcessState: {
                pid: 248,
                status: 256,
                rusage: {
                    Utime: {Sec: 0, Usec: 199418},
                    Stime: {Sec: 0, Usec: 46329},
                    Maxrss: 54312,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 15454,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 183,
                    Nivcsw: 29,
                },
            },
            Stderr: [99, 111, 109, 109, 97, 110, 100, 32, 116, 101, 114, 109, 105, 110, 97, 116, 101, 100, 32, 119, 105, 116, 104, 32, 101, 120, 105, 116, 32, 99, 111, 100, 101, 32, 49, 10],
        }
        exit status 1
    occurred

Comment 1 Martin Sivák 2021-04-13 09:55:43 UTC

The bug is actually in the test suite that does not know that stalld was disabled intentionally due to kernel bugs. This will resolve itself once we re-enable stalld and that is going to happen soon.

Comment 2 Martin Sivák 2021-04-26 13:32:35 UTC

RHEL 8.2.z released with the fixed kernel: kernel-4.18.0-193.49.1.el8_2

It seems we are waiting for an OCP release that uses RHCOS https://releases-rhcos-art.cloud.privileged.psi.redhat.com/contents.html?stream=releases%2Frhcos-4.6&release=46.82.202104151840-0 and newer

Comment 5 Shereen Haj Makhoul 2021-06-18 13:51:23 UTC

Verify the bug fix 
===================

Ocp version
===========
[root@ocp-edge41 ]# oc version
Client Version: 4.6.34
Server Version: 4.6.34
Kubernetes Version: v1.19.0+c3e2e69
[root@ocp-edge41 performance]# 

PAO:
====
performance-addon-operator.v4.6.4-5


Steps to verify:
================
-login to the worker node :
ssh core@<node ip>

-then check if stalld is running :
[core@ocp45-worker-0 ~]$ pidof stalld
11481
[core@ocp45-worker-0 ~]$ systemctl status stalld
● stalld.service - Stall Monitor
   Loaded: loaded (/etc/systemd/system/stalld.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-06-18 12:51:43 UTC; 6min ago
 Main PID: 11481 (stalld)
    Tasks: 1 (limit: 205249)
   Memory: 4.4M
      CPU: 10.101s
   CGroup: /system.slice/stalld.service
           └─11481 /usr/local/bin/stalld -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run>

-also by checking :
oc get tuned -A
oc project openshift-cluster-node-tuning-operator
oc get tuned
oc describe tuned/openshift-node-performance-performance

and on the last command got the following output :

.
.
.
[service]
service.stalld=start,enable

Comment 7 errata-xmlrpc 2021-06-22 11:22:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.35 low-latency extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2482

Note You need to log in before you can comment on or make changes to this bug.