Description of problem: Performance test (cnf-tests-container) "[It] [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host" fails on OCP v4.6.Z. Version-Release number of selected component (if applicable): v4.6.24 cnf-tests-container-v4.6.3-1 How reproducible: Always Actual results: [rfe_id:27368][performance] /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:45 Pre boot tuning adjusted by tuned /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:118 [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host [It] /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:179 Unexpected error: <*exec.ExitError | 0xc0009578a0>: { ProcessState: { pid: 248, status: 256, rusage: { Utime: {Sec: 0, Usec: 199418}, Stime: {Sec: 0, Usec: 46329}, Maxrss: 54312, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 15454, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 183, Nivcsw: 29, }, }, Stderr: [99, 111, 109, 109, 97, 110, 100, 32, 116, 101, 114, 109, 105, 110, 97, 116, 101, 100, 32, 119, 105, 116, 104, 32, 101, 120, 105, 116, 32, 99, 111, 100, 101, 32, 49, 10], } exit status 1 occurred
The bug is actually in the test suite that does not know that stalld was disabled intentionally due to kernel bugs. This will resolve itself once we re-enable stalld and that is going to happen soon.
RHEL 8.2.z released with the fixed kernel: kernel-4.18.0-193.49.1.el8_2 It seems we are waiting for an OCP release that uses RHCOS https://releases-rhcos-art.cloud.privileged.psi.redhat.com/contents.html?stream=releases%2Frhcos-4.6&release=46.82.202104151840-0 and newer
Verify the bug fix =================== Ocp version =========== [root@ocp-edge41 ]# oc version Client Version: 4.6.34 Server Version: 4.6.34 Kubernetes Version: v1.19.0+c3e2e69 [root@ocp-edge41 performance]# PAO: ==== performance-addon-operator.v4.6.4-5 Steps to verify: ================ -login to the worker node : ssh core@<node ip> -then check if stalld is running : [core@ocp45-worker-0 ~]$ pidof stalld 11481 [core@ocp45-worker-0 ~]$ systemctl status stalld ● stalld.service - Stall Monitor Loaded: loaded (/etc/systemd/system/stalld.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2021-06-18 12:51:43 UTC; 6min ago Main PID: 11481 (stalld) Tasks: 1 (limit: 205249) Memory: 4.4M CPU: 10.101s CGroup: /system.slice/stalld.service └─11481 /usr/local/bin/stalld -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run> -also by checking : oc get tuned -A oc project openshift-cluster-node-tuning-operator oc get tuned oc describe tuned/openshift-node-performance-performance and on the last command got the following output : . . . [service] service.stalld=start,enable
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.35 low-latency extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2482