+++ This bug was initially created as a clone of Bug #2017427 +++ Description of problem: There are cases where TuneD daemon seems to be stuck during applications of a profile (see rhbz#2013940). NTO does not restart TuneD daemon when profile application is taking too long. Version-Release number of selected component (if applicable): All How reproducible: Always Steps to Reproduce: 1. Create a profile that will take too long to get applied by NTO. For example: apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-profile-stuck namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=OpenShift profile stuck [variables] v=${f:exec:sleep:inf} name: openshift-profile-stuck recommend: - match: - label: profile priority: 20 profile: openshift-profile-stuck Actual results: Profile application will never be restarted/retried. Expected results: Profile application should be restarted/retried. Additional info: https://github.com/openshift/cluster-node-tuning-operator/pull/282
Fixed in 4.9.0-0.nightly-2021-10-30-120753 and above. QE, please confirm so we can unblock the 4.8 backport. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-10-30-120753 True False 88m Cluster version is 4.9.0-0.nightly-2021-10-30-120753 $ oc get no NAME STATUS ROLES AGE VERSION jmencak-fh99x-master-0.c.openshift-gce-devel.internal Ready master 104m v1.22.0-rc.0+a44d0f0 jmencak-fh99x-master-1.c.openshift-gce-devel.internal Ready master 105m v1.22.0-rc.0+a44d0f0 jmencak-fh99x-master-2.c.openshift-gce-devel.internal Ready master 104m v1.22.0-rc.0+a44d0f0 jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal Ready worker 97m v1.22.0-rc.0+a44d0f0 jmencak-fh99x-worker-b-hxdms.c.openshift-gce-devel.internal Ready worker 97m v1.22.0-rc.0+a44d0f0 $ oc label no jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal profile= node/jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal labeled $ cat stuck.yaml apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-profile-stuck namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=OpenShift profile stuck [variables] v=${f:exec:sleep:72} name: openshift-profile-stuck recommend: - match: - label: profile priority: 20 profile: openshift-profile-stuck $ oc create -f stuck.yaml $ oc project openshift-cluster-node-tuning-operator $ oc get po -o wide|grep worker-a tuned-kkvr9 1/1 Running 0 101m 10.0.128.3 jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal <none> <none> $ oc logs tuned-kkvr9 | tail -n28 I1102 11:59:12.416986 2274 tuned.go:1229] previous application of TuneD profile failed; change detected, scheduling full restart in 1s 2021-11-02 11:59:12,518 INFO tuned.daemon.application: dynamic tuning is globally disabled 2021-11-02 11:59:12,523 INFO tuned.daemon.daemon: using sleep interval of 1 second(s) 2021-11-02 11:59:12,523 INFO tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration. 2021-11-02 11:59:12,524 INFO tuned.daemon.daemon: Using 'openshift-profile-stuck' profile 2021-11-02 11:59:12,524 INFO tuned.profiles.loader: loading profile: openshift-profile-stuck E1102 11:59:13.417860 2274 tuned.go:1211] timeout (60) to apply TuneD profile; restarting TuneD daemon E1102 11:59:13.419970 2274 tuned.go:508] error waiting for tuned: signal: terminated I1102 11:59:13.420128 2274 tuned.go:441] starting tuned... 2021-11-02 11:59:13,538 INFO tuned.daemon.application: dynamic tuning is globally disabled 2021-11-02 11:59:13,543 INFO tuned.daemon.daemon: using sleep interval of 1 second(s) 2021-11-02 11:59:13,543 INFO tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration. 2021-11-02 11:59:13,544 INFO tuned.daemon.daemon: Using 'openshift-profile-stuck' profile 2021-11-02 11:59:13,544 INFO tuned.profiles.loader: loading profile: openshift-profile-stuck E1102 12:00:13.420158 2274 tuned.go:1211] timeout (120) to apply TuneD profile; restarting TuneD daemon E1102 12:00:13.421876 2274 tuned.go:508] error waiting for tuned: signal: terminated I1102 12:00:13.421965 2274 tuned.go:441] starting tuned... 2021-11-02 12:00:13,532 INFO tuned.daemon.application: dynamic tuning is globally disabled 2021-11-02 12:00:13,537 INFO tuned.daemon.daemon: using sleep interval of 1 second(s) 2021-11-02 12:00:13,538 INFO tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration. 2021-11-02 12:00:13,538 INFO tuned.daemon.daemon: Using 'openshift-profile-stuck' profile 2021-11-02 12:00:13,539 INFO tuned.profiles.loader: loading profile: openshift-profile-stuck 2021-11-02 12:01:25,544 INFO tuned.daemon.controller: starting controller 2021-11-02 12:01:25,544 INFO tuned.daemon.daemon: starting tuning 2021-11-02 12:01:25,546 INFO tuned.daemon.daemon: static tuning from profile 'openshift-profile-stuck' applied I1102 12:01:25.558914 2274 tuned.go:428] written "/etc/tuned/recommend.d/50-openshift.conf" to set TuneD profile openshift-profile-stuck I1102 12:01:25.559183 2274 tuned.go:995] updated Profile jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal stalld=<nil>, bootcmdline: I1102 12:01:25.682873 2274 tuned.go:719] active and recommended profile (openshift-profile-stuck) match; profile change will not trigger profile reload
Verified in my environment also, the bugs is fixed now
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.6 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4119