Bug 2018053

Summary: NTO does not restart TuneD daemon when profile application is taking too long
Product: OpenShift Container Platform Reporter: Jiří Mencák <jmencak>
Component: Node Tuning OperatorAssignee: Jiří Mencák <jmencak>
Status: CLOSED ERRATA QA Contact: liqcui
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: aos-bugs, dagray, openshift-bugzilla-robot, skordas
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2017488
: 2020518 (view as bug list) Environment:
Last Closed: 2021-11-16 21:22:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2017488    
Bug Blocks: 2020518    

Description Jiří Mencák 2021-10-28 05:50:25 UTC
+++ This bug was initially created as a clone of Bug #2017488 +++

+++ This bug was initially created as a clone of Bug #2017427 +++

Description of problem:
There are cases where TuneD daemon seems to be stuck during applications of a profile (see rhbz#2013940).  NTO does not restart TuneD daemon when profile application is taking too long.

Version-Release number of selected component (if applicable):
All

How reproducible:
Always

Steps to Reproduce:
1. Create a profile that will take too long to get applied by NTO.  For example:
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:inf}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

Actual results:
Profile application will never be restarted/retried.

Expected results:
Profile application should be restarted/retried.

Additional info:
https://github.com/openshift/cluster-node-tuning-operator/pull/282

Comment 4 Jiří Mencák 2021-11-04 15:53:33 UTC
Fixed in 4.8.0-0.nightly-2021-11-03-171325 and above.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-11-03-171325   True        False         7h37m   Cluster version is 4.8.0-0.nightly-2021-11-03-171325

$ oc project openshift-cluster-node-tuning-operator

$ oc get po -o wide|grep worker-a
tuned-d6s6j                                    1/1     Running   0          7h51m   10.0.128.3    jmencak-hcp9p-worker-a-7rzvf.c.openshift-gce-devel.internal   <none>           <none>

$ oc label no jmencak-hcp9p-worker-a-7rzvf.c.openshift-gce-devel.internal profile=

$ cat stuck.yaml
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:72}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

$ oc create -f stuck.yaml

$ oc logs -f tuned-d6s6j | tail -n17
I1104 15:45:46.249348    2398 tuned.go:542] reloading tuned...
I1104 15:45:46.249354    2398 tuned.go:545] sending HUP to PID 3628
2021-11-04 15:45:46,249 INFO     tuned.daemon.daemon: stopping tuning
2021-11-04 15:45:46,266 INFO     tuned.daemon.daemon: terminating Tuned, rolling back all changes
2021-11-04 15:45:46,313 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-04 15:45:46,314 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-04 15:45:46,314 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1104 15:46:46.249925    2398 tuned.go:1128] timeout (60) to apply TuneD profile; restarting TuneD daemon
E1104 15:46:56.252435    2398 tuned.go:479] error waiting for tuned: signal: killed
I1104 15:46:56.252578    2398 tuned.go:429] starting tuned...
I1104 15:46:56.268933    2398 tuned.go:917] updated Profile jmencak-hcp9p-worker-a-7rzvf.c.openshift-gce-devel.internal stalld=<nil>, bootcmdline: 
I1104 15:46:56.269286    2398 tuned.go:416] written "/etc/tuned/recommend.d/50-openshift.conf" to set Tuned profile openshift-profile-stuck
2021-11-04 15:46:56,371 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-04 15:46:56,377 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-04 15:46:56,377 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-04 15:46:56,378 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-04 15:46:56,379 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck

In 4.8, no exponential backoff was implemented, but the profile application times out after 60 seconds
and is retried.

QE, please acknowledge the fix.

Comment 7 errata-xmlrpc 2021-11-16 21:22:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.20 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4574