Bug 2017427 - NTO does not restart TuneD daemon when profile application is taking too long
Summary: NTO does not restart TuneD daemon when profile application is taking too long
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Tuning Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Jiří Mencák
QA Contact: liqcui
URL:
Whiteboard:
Depends On:
Blocks: 2017488 2029436
TreeView+ depends on / blocked
 
Reported: 2021-10-26 13:54 UTC by Jiří Mencák
Modified: 2022-03-10 16:22 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2029436 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:22:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-node-tuning-operator pull 282 0 None open Bug 2017427: tuned: add timeout and restarts 2021-10-26 13:54:28 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:22:31 UTC

Description Jiří Mencák 2021-10-26 13:54:01 UTC
Description of problem:
There are cases where TuneD daemon seems to be stuck during applications of a profile (see rhbz#2013940).  NTO does not restart TuneD daemon when profile application is taking too long.

Version-Release number of selected component (if applicable):
All

How reproducible:
Always

Steps to Reproduce:
1. Create a profile that will take too long to get applied by NTO.  For example:
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:inf}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

Actual results:
Profile application will never be restarted/retried.

Expected results:
Profile application should be restarted/retried.

Additional info:
https://github.com/openshift/cluster-node-tuning-operator/pull/282

Comment 3 Jiří Mencák 2021-10-28 06:44:43 UTC
Fixed in 4.10.0-0.nightly-2021-10-27-230233 and above.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-27-230233   True        False         20m     Cluster version is 4.10.0-0.nightly-2021-10-27-230233

$ oc get no
NAME                                                          STATUS   ROLES    AGE   VERSION
jmencak-7r5lg-master-0.c.openshift-gce-devel.internal         Ready    master   35m   v1.22.1+674f31e
jmencak-7r5lg-master-1.c.openshift-gce-devel.internal         Ready    master   35m   v1.22.1+674f31e
jmencak-7r5lg-master-2.c.openshift-gce-devel.internal         Ready    master   35m   v1.22.1+674f31e
jmencak-7r5lg-worker-a-tlq29.c.openshift-gce-devel.internal   Ready    worker   27m   v1.22.1+674f31e
jmencak-7r5lg-worker-b-dd727.c.openshift-gce-devel.internal   Ready    worker   27m   v1.22.1+674f31e

$ oc label no jmencak-7r5lg-worker-a-tlq29.c.openshift-gce-devel.internal profile=
node/jmencak-7r5lg-worker-a-tlq29.c.openshift-gce-devel.internal labeled

$ oc get po -o wide|grep jmencak-7r5lg-worker-a-tlq29.c.openshift-gce-devel.internal
tuned-pnl8x                                     1/1     Running   0          28m   10.0.128.2    jmencak-7r5lg-worker-a-tlq29.c.openshift-gce-devel.internal   <none>           <none>

$ cat stuck.yaml
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:72}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

$ oc create -f stuck.yaml

$ oc logs tuned-pnl8x | tail -n 28
I1028 06:37:13.201963    2182 tuned.go:1229] previous application of TuneD profile failed; change detected, scheduling full restart in 1s
2021-10-28 06:37:13,299 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-10-28 06:37:13,303 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-10-28 06:37:13,304 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-10-28 06:37:13,304 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-10-28 06:37:13,305 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1028 06:37:14.202848    2182 tuned.go:1211] timeout (60) to apply TuneD profile; restarting TuneD daemon
E1028 06:37:14.205003    2182 tuned.go:508] error waiting for tuned: signal: terminated
I1028 06:37:14.205213    2182 tuned.go:441] starting tuned...
2021-10-28 06:37:14,327 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-10-28 06:37:14,332 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-10-28 06:37:14,333 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-10-28 06:37:14,333 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-10-28 06:37:14,334 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1028 06:38:14.205888    2182 tuned.go:1211] timeout (120) to apply TuneD profile; restarting TuneD daemon
E1028 06:38:14.207821    2182 tuned.go:508] error waiting for tuned: signal: terminated
I1028 06:38:14.208077    2182 tuned.go:441] starting tuned...
2021-10-28 06:38:14,339 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-10-28 06:38:14,343 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-10-28 06:38:14,344 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-10-28 06:38:14,344 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-10-28 06:38:14,345 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
2021-10-28 06:39:26,351 INFO     tuned.daemon.controller: starting controller
2021-10-28 06:39:26,351 INFO     tuned.daemon.daemon: starting tuning
2021-10-28 06:39:26,352 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-profile-stuck' applied
I1028 06:39:26.365143    2182 tuned.go:995] updated Profile jmencak-7r5lg-worker-a-tlq29.c.openshift-gce-devel.internal stalld=<nil>, bootcmdline: 
I1028 06:39:26.365402    2182 tuned.go:428] written "/etc/tuned/recommend.d/50-openshift.conf" to set TuneD profile openshift-profile-stuck
I1028 06:39:26.476307    2182 tuned.go:719] active and recommended profile (openshift-profile-stuck) match; profile change will not trigger profile reload

$ oc get profile
NAME                                                          TUNED                     APPLIED   DEGRADED   AGE
jmencak-7r5lg-master-0.c.openshift-gce-devel.internal         openshift-control-plane   True      False      48m
jmencak-7r5lg-master-1.c.openshift-gce-devel.internal         openshift-control-plane   True      False      48m
jmencak-7r5lg-master-2.c.openshift-gce-devel.internal         openshift-control-plane   True      False      48m
jmencak-7r5lg-worker-a-tlq29.c.openshift-gce-devel.internal   openshift-profile-stuck   True      False      42m
jmencak-7r5lg-worker-b-dd727.c.openshift-gce-devel.internal   openshift-node            True      False      42m

Comment 6 errata-xmlrpc 2022-03-10 16:22:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.