Bug 2017488

Summary: NTO does not restart TuneD daemon when profile application is taking too long
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: Node Tuning OperatorAssignee: Jiří Mencák <jmencak>
Status: CLOSED ERRATA QA Contact: liqcui
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: aos-bugs, dagray
Target Milestone: ---   
Target Release: 4.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2018053 (view as bug list) Environment:
Last Closed: 2021-11-10 21:03:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2017427, 2029436    
Bug Blocks: 2018053    

Description OpenShift BugZilla Robot 2021-10-26 15:30:17 UTC
+++ This bug was initially created as a clone of Bug #2017427 +++

Description of problem:
There are cases where TuneD daemon seems to be stuck during applications of a profile (see rhbz#2013940).  NTO does not restart TuneD daemon when profile application is taking too long.

Version-Release number of selected component (if applicable):
All

How reproducible:
Always

Steps to Reproduce:
1. Create a profile that will take too long to get applied by NTO.  For example:
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:inf}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

Actual results:
Profile application will never be restarted/retried.

Expected results:
Profile application should be restarted/retried.

Additional info:
https://github.com/openshift/cluster-node-tuning-operator/pull/282

Comment 2 Jiří Mencák 2021-11-02 12:05:56 UTC
Fixed in 4.9.0-0.nightly-2021-10-30-120753 and above.  QE, please confirm so we can unblock the 4.8 backport.

$ oc get clusterversion

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-10-30-120753   True        False         88m     Cluster version is 4.9.0-0.nightly-2021-10-30-120753

$ oc get no
NAME                                                          STATUS   ROLES    AGE    VERSION
jmencak-fh99x-master-0.c.openshift-gce-devel.internal         Ready    master   104m   v1.22.0-rc.0+a44d0f0
jmencak-fh99x-master-1.c.openshift-gce-devel.internal         Ready    master   105m   v1.22.0-rc.0+a44d0f0
jmencak-fh99x-master-2.c.openshift-gce-devel.internal         Ready    master   104m   v1.22.0-rc.0+a44d0f0
jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal   Ready    worker   97m    v1.22.0-rc.0+a44d0f0
jmencak-fh99x-worker-b-hxdms.c.openshift-gce-devel.internal   Ready    worker   97m    v1.22.0-rc.0+a44d0f0

$ oc label no jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal profile=
node/jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal labeled

$ cat stuck.yaml
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:72}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

$ oc create -f stuck.yaml

$ oc project openshift-cluster-node-tuning-operator

$ oc get po -o wide|grep worker-a
tuned-kkvr9                                     1/1     Running   0          101m   10.0.128.3    jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal   <none>           <none>

$ oc logs tuned-kkvr9 | tail -n28
I1102 11:59:12.416986    2274 tuned.go:1229] previous application of TuneD profile failed; change detected, scheduling full restart in 1s
2021-11-02 11:59:12,518 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-02 11:59:12,523 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-02 11:59:12,523 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-02 11:59:12,524 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-02 11:59:12,524 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1102 11:59:13.417860    2274 tuned.go:1211] timeout (60) to apply TuneD profile; restarting TuneD daemon
E1102 11:59:13.419970    2274 tuned.go:508] error waiting for tuned: signal: terminated
I1102 11:59:13.420128    2274 tuned.go:441] starting tuned...
2021-11-02 11:59:13,538 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-02 11:59:13,543 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-02 11:59:13,543 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-02 11:59:13,544 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-02 11:59:13,544 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1102 12:00:13.420158    2274 tuned.go:1211] timeout (120) to apply TuneD profile; restarting TuneD daemon
E1102 12:00:13.421876    2274 tuned.go:508] error waiting for tuned: signal: terminated
I1102 12:00:13.421965    2274 tuned.go:441] starting tuned...
2021-11-02 12:00:13,532 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-02 12:00:13,537 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-02 12:00:13,538 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-02 12:00:13,538 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-02 12:00:13,539 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
2021-11-02 12:01:25,544 INFO     tuned.daemon.controller: starting controller
2021-11-02 12:01:25,544 INFO     tuned.daemon.daemon: starting tuning
2021-11-02 12:01:25,546 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-profile-stuck' applied
I1102 12:01:25.558914    2274 tuned.go:428] written "/etc/tuned/recommend.d/50-openshift.conf" to set TuneD profile openshift-profile-stuck
I1102 12:01:25.559183    2274 tuned.go:995] updated Profile jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal stalld=<nil>, bootcmdline: 
I1102 12:01:25.682873    2274 tuned.go:719] active and recommended profile (openshift-profile-stuck) match; profile change will not trigger profile reload

Comment 3 liqcui 2021-11-02 13:02:21 UTC
Verified in my environment also, the bugs is fixed now

Comment 6 errata-xmlrpc 2021-11-10 21:03:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4119