2017488 – NTO does not restart TuneD daemon when profile application is taking too long

Bug 2017488 - NTO does not restart TuneD daemon when profile application is taking too long

Summary: NTO does not restart TuneD daemon when profile application is taking too long

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Tuning Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Jiří Mencák
QA Contact:	liqcui
Docs Contact:
URL:
Whiteboard:
Depends On:	2017427 2029436
Blocks:	2018053
TreeView+	depends on / blocked

Reported:	2021-10-26 15:30 UTC by OpenShift BugZilla Robot
Modified:	2021-12-06 14:05 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2018053 (view as bug list)
Environment:
Last Closed:	2021-11-10 21:03:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-node-tuning-operator pull 285	0	None	open	[release-4.9] Bug 2017488: tuned: add timeout and restarts	2021-10-29 05:40:18 UTC
Red Hat Product Errata	RHBA-2021:4119	0	None	None	None	2021-11-10 21:03:14 UTC

Description OpenShift BugZilla Robot 2021-10-26 15:30:17 UTC

+++ This bug was initially created as a clone of Bug #2017427 +++

Description of problem:
There are cases where TuneD daemon seems to be stuck during applications of a profile (see rhbz#2013940).  NTO does not restart TuneD daemon when profile application is taking too long.

Version-Release number of selected component (if applicable):
All

How reproducible:
Always

Steps to Reproduce:
1. Create a profile that will take too long to get applied by NTO.  For example:
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:inf}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

Actual results:
Profile application will never be restarted/retried.

Expected results:
Profile application should be restarted/retried.

Additional info:
https://github.com/openshift/cluster-node-tuning-operator/pull/282

Comment 2 Jiří Mencák 2021-11-02 12:05:56 UTC

Fixed in 4.9.0-0.nightly-2021-10-30-120753 and above.  QE, please confirm so we can unblock the 4.8 backport.

$ oc get clusterversion

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-10-30-120753   True        False         88m     Cluster version is 4.9.0-0.nightly-2021-10-30-120753

$ oc get no
NAME                                                          STATUS   ROLES    AGE    VERSION
jmencak-fh99x-master-0.c.openshift-gce-devel.internal         Ready    master   104m   v1.22.0-rc.0+a44d0f0
jmencak-fh99x-master-1.c.openshift-gce-devel.internal         Ready    master   105m   v1.22.0-rc.0+a44d0f0
jmencak-fh99x-master-2.c.openshift-gce-devel.internal         Ready    master   104m   v1.22.0-rc.0+a44d0f0
jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal   Ready    worker   97m    v1.22.0-rc.0+a44d0f0
jmencak-fh99x-worker-b-hxdms.c.openshift-gce-devel.internal   Ready    worker   97m    v1.22.0-rc.0+a44d0f0

$ oc label no jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal profile=
node/jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal labeled

$ cat stuck.yaml
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:72}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

$ oc create -f stuck.yaml

$ oc project openshift-cluster-node-tuning-operator

$ oc get po -o wide|grep worker-a
tuned-kkvr9                                     1/1     Running   0          101m   10.0.128.3    jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal   <none>           <none>

$ oc logs tuned-kkvr9 | tail -n28
I1102 11:59:12.416986    2274 tuned.go:1229] previous application of TuneD profile failed; change detected, scheduling full restart in 1s
2021-11-02 11:59:12,518 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-02 11:59:12,523 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-02 11:59:12,523 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-02 11:59:12,524 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-02 11:59:12,524 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1102 11:59:13.417860    2274 tuned.go:1211] timeout (60) to apply TuneD profile; restarting TuneD daemon
E1102 11:59:13.419970    2274 tuned.go:508] error waiting for tuned: signal: terminated
I1102 11:59:13.420128    2274 tuned.go:441] starting tuned...
2021-11-02 11:59:13,538 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-02 11:59:13,543 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-02 11:59:13,543 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-02 11:59:13,544 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-02 11:59:13,544 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1102 12:00:13.420158    2274 tuned.go:1211] timeout (120) to apply TuneD profile; restarting TuneD daemon
E1102 12:00:13.421876    2274 tuned.go:508] error waiting for tuned: signal: terminated
I1102 12:00:13.421965    2274 tuned.go:441] starting tuned...
2021-11-02 12:00:13,532 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-02 12:00:13,537 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-02 12:00:13,538 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-02 12:00:13,538 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-02 12:00:13,539 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
2021-11-02 12:01:25,544 INFO     tuned.daemon.controller: starting controller
2021-11-02 12:01:25,544 INFO     tuned.daemon.daemon: starting tuning
2021-11-02 12:01:25,546 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-profile-stuck' applied
I1102 12:01:25.558914    2274 tuned.go:428] written "/etc/tuned/recommend.d/50-openshift.conf" to set TuneD profile openshift-profile-stuck
I1102 12:01:25.559183    2274 tuned.go:995] updated Profile jmencak-fh99x-worker-a-kkhfc.c.openshift-gce-devel.internal stalld=<nil>, bootcmdline: 
I1102 12:01:25.682873    2274 tuned.go:719] active and recommended profile (openshift-profile-stuck) match; profile change will not trigger profile reload

Comment 3 liqcui 2021-11-02 13:02:21 UTC

Verified in my environment also, the bugs is fixed now

Comment 6 errata-xmlrpc 2021-11-10 21:03:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4119

Note You need to log in before you can comment on or make changes to this bug.