Bug 1949027

Summary: [4.6.z] Re-enable stalld
Product: OpenShift Container Platform Reporter: elevin
Component: Performance Addon OperatorAssignee: Martin Sivák <msivak>
Status: CLOSED ERRATA QA Contact: Gowrishankar Rajaiyan <grajaiya>
Severity: medium Docs Contact:
Priority: high    
Version: 4.6.zCC: aos-bugs, mniranja, shajmakh, yquinn
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: performance-addon-operator.v4.6.4-5 Doc Type: Bug Fix
Doc Text:
Cause: Stalld service was disabled by default for performance profiles due to a kernel bug in the HRTICK kernel subsystem that caused the system to hang. Consequence: Stalld is a required functionality for preventing the starvation of operating system threads. Not having it enabled by default impacts achievable latency. Fix: Kernel versions: 4.18.0-193.49.1.el8_2.x86_64 4.18.0-193.51.1.rt13.101.el8_2.x86_64 (for non real-time and real-time respectively) contain a fix for the HRTICK bug which allows us to re-enable stalld by default for all performance profiles. Result: Stalld functionality re-enabled by default reinstates the required latency results provided by tuning the system via the Performance addon operator. Important Note: All nodes kernel version must be : 4.18.0-193.49.1.el8_2.x86_64 4.18.0-193.51.1.rt13.101.el8_2.x86_64 Or higher This means that the underlying RHCOS version needs to be updated BEFORE PAO itself.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-22 11:22:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1949739    
Bug Blocks:    

Description elevin 2021-04-13 09:31:42 UTC
Description of problem:

Performance test (cnf-tests-container) "[It] [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host" fails on OCP v4.6.Z.

 

Version-Release number of selected component (if applicable):

v4.6.24
cnf-tests-container-v4.6.3-1

How reproducible:
Always



Actual results:

[rfe_id:27368][performance]
/remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:45
  Pre boot tuning adjusted by tuned 
  /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:118
    [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host [It]
    /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:179
    Unexpected error:
        <*exec.ExitError | 0xc0009578a0>: {
            ProcessState: {
                pid: 248,
                status: 256,
                rusage: {
                    Utime: {Sec: 0, Usec: 199418},
                    Stime: {Sec: 0, Usec: 46329},
                    Maxrss: 54312,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 15454,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 183,
                    Nivcsw: 29,
                },
            },
            Stderr: [99, 111, 109, 109, 97, 110, 100, 32, 116, 101, 114, 109, 105, 110, 97, 116, 101, 100, 32, 119, 105, 116, 104, 32, 101, 120, 105, 116, 32, 99, 111, 100, 101, 32, 49, 10],
        }
        exit status 1
    occurred

Comment 1 Martin Sivák 2021-04-13 09:55:43 UTC
The bug is actually in the test suite that does not know that stalld was disabled intentionally due to kernel bugs. This will resolve itself once we re-enable stalld and that is going to happen soon.

Comment 2 Martin Sivák 2021-04-26 13:32:35 UTC
RHEL 8.2.z released with the fixed kernel: kernel-4.18.0-193.49.1.el8_2

It seems we are waiting for an OCP release that uses RHCOS https://releases-rhcos-art.cloud.privileged.psi.redhat.com/contents.html?stream=releases%2Frhcos-4.6&release=46.82.202104151840-0 and newer

Comment 5 Shereen Haj Makhoul 2021-06-18 13:51:23 UTC
Verify the bug fix 
===================

Ocp version
===========
[root@ocp-edge41 ]# oc version
Client Version: 4.6.34
Server Version: 4.6.34
Kubernetes Version: v1.19.0+c3e2e69
[root@ocp-edge41 performance]# 

PAO:
====
performance-addon-operator.v4.6.4-5


Steps to verify:
================
-login to the worker node :
ssh core@<node ip>

-then check if stalld is running :
[core@ocp45-worker-0 ~]$ pidof stalld
11481
[core@ocp45-worker-0 ~]$ systemctl status stalld
● stalld.service - Stall Monitor
   Loaded: loaded (/etc/systemd/system/stalld.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-06-18 12:51:43 UTC; 6min ago
 Main PID: 11481 (stalld)
    Tasks: 1 (limit: 205249)
   Memory: 4.4M
      CPU: 10.101s
   CGroup: /system.slice/stalld.service
           └─11481 /usr/local/bin/stalld -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run>

-also by checking :
oc get tuned -A
oc project openshift-cluster-node-tuning-operator
oc get tuned
oc describe tuned/openshift-node-performance-performance

and on the last command got the following output :

.
.
.
[service]
service.stalld=start,enable

Comment 7 errata-xmlrpc 2021-06-22 11:22:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.35 low-latency extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2482