Bug 1949027
| Summary: | [4.6.z] Re-enable stalld | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | elevin |
| Component: | Performance Addon Operator | Assignee: | Martin Sivák <msivak> |
| Status: | CLOSED ERRATA | QA Contact: | Gowrishankar Rajaiyan <grajaiya> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.6.z | CC: | aos-bugs, mniranja, shajmakh, yquinn |
| Target Milestone: | --- | ||
| Target Release: | 4.6.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | performance-addon-operator.v4.6.4-5 | Doc Type: | Bug Fix |
| Doc Text: |
Cause:
Stalld service was disabled by default for performance profiles due to a kernel bug in the HRTICK kernel subsystem that caused the system to hang.
Consequence:
Stalld is a required functionality for preventing the starvation of operating system threads. Not having it enabled by default impacts achievable latency.
Fix:
Kernel versions:
4.18.0-193.49.1.el8_2.x86_64
4.18.0-193.51.1.rt13.101.el8_2.x86_64
(for non real-time and real-time respectively) contain a fix for the HRTICK bug which allows us to re-enable stalld by default for all performance profiles.
Result:
Stalld functionality re-enabled by default reinstates the required latency results provided by tuning the system via the Performance addon operator.
Important Note:
All nodes kernel version must be :
4.18.0-193.49.1.el8_2.x86_64
4.18.0-193.51.1.rt13.101.el8_2.x86_64
Or higher
This means that the underlying RHCOS version needs to be updated BEFORE PAO itself.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-22 11:22:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1949739 | ||
| Bug Blocks: | |||
The bug is actually in the test suite that does not know that stalld was disabled intentionally due to kernel bugs. This will resolve itself once we re-enable stalld and that is going to happen soon. RHEL 8.2.z released with the fixed kernel: kernel-4.18.0-193.49.1.el8_2 It seems we are waiting for an OCP release that uses RHCOS https://releases-rhcos-art.cloud.privileged.psi.redhat.com/contents.html?stream=releases%2Frhcos-4.6&release=46.82.202104151840-0 and newer Verify the bug fix
===================
Ocp version
===========
[root@ocp-edge41 ]# oc version
Client Version: 4.6.34
Server Version: 4.6.34
Kubernetes Version: v1.19.0+c3e2e69
[root@ocp-edge41 performance]#
PAO:
====
performance-addon-operator.v4.6.4-5
Steps to verify:
================
-login to the worker node :
ssh core@<node ip>
-then check if stalld is running :
[core@ocp45-worker-0 ~]$ pidof stalld
11481
[core@ocp45-worker-0 ~]$ systemctl status stalld
● stalld.service - Stall Monitor
Loaded: loaded (/etc/systemd/system/stalld.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2021-06-18 12:51:43 UTC; 6min ago
Main PID: 11481 (stalld)
Tasks: 1 (limit: 205249)
Memory: 4.4M
CPU: 10.101s
CGroup: /system.slice/stalld.service
└─11481 /usr/local/bin/stalld -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run>
-also by checking :
oc get tuned -A
oc project openshift-cluster-node-tuning-operator
oc get tuned
oc describe tuned/openshift-node-performance-performance
and on the last command got the following output :
.
.
.
[service]
service.stalld=start,enable
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.35 low-latency extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2482 |
Description of problem: Performance test (cnf-tests-container) "[It] [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host" fails on OCP v4.6.Z. Version-Release number of selected component (if applicable): v4.6.24 cnf-tests-container-v4.6.3-1 How reproducible: Always Actual results: [rfe_id:27368][performance] /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:45 Pre boot tuning adjusted by tuned /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:118 [test_id:35363][crit:high][vendor:cnf-qe][level:acceptance] stalld daemon is running on the host [It] /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/performance.go:179 Unexpected error: <*exec.ExitError | 0xc0009578a0>: { ProcessState: { pid: 248, status: 256, rusage: { Utime: {Sec: 0, Usec: 199418}, Stime: {Sec: 0, Usec: 46329}, Maxrss: 54312, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 15454, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 183, Nivcsw: 29, }, }, Stderr: [99, 111, 109, 109, 97, 110, 100, 32, 116, 101, 114, 109, 105, 110, 97, 116, 101, 100, 32, 119, 105, 116, 104, 32, 101, 120, 105, 116, 32, 99, 111, 100, 101, 32, 49, 10], } exit status 1 occurred