2018053 – NTO does not restart TuneD daemon when profile application is taking too long

Bug 2018053 - NTO does not restart TuneD daemon when profile application is taking too long

Summary: NTO does not restart TuneD daemon when profile application is taking too long

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Tuning Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.z
Assignee:	Jiří Mencák
QA Contact:	liqcui
Docs Contact:
URL:
Whiteboard:
Depends On:	2017488
Blocks:	2020518
TreeView+	depends on / blocked

Reported:	2021-10-28 05:50 UTC by Jiří Mencák
Modified:	2021-12-06 14:05 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2017488
Clones:	2020518 (view as bug list)
Environment:
Last Closed:	2021-11-16 21:22:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-node-tuning-operator pull 286	0	None	open	Bug 2018053: tuned: add timeout and restarts	2021-11-02 13:08:24 UTC
Red Hat Product Errata	RHBA-2021:4574	0	None	None	None	2021-11-16 21:23:07 UTC

Description Jiří Mencák 2021-10-28 05:50:25 UTC

+++ This bug was initially created as a clone of Bug #2017488 +++

+++ This bug was initially created as a clone of Bug #2017427 +++

Description of problem:
There are cases where TuneD daemon seems to be stuck during applications of a profile (see rhbz#2013940).  NTO does not restart TuneD daemon when profile application is taking too long.

Version-Release number of selected component (if applicable):
All

How reproducible:
Always

Steps to Reproduce:
1. Create a profile that will take too long to get applied by NTO.  For example:
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:inf}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

Actual results:
Profile application will never be restarted/retried.

Expected results:
Profile application should be restarted/retried.

Additional info:
https://github.com/openshift/cluster-node-tuning-operator/pull/282

Comment 4 Jiří Mencák 2021-11-04 15:53:33 UTC

Fixed in 4.8.0-0.nightly-2021-11-03-171325 and above.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-11-03-171325   True        False         7h37m   Cluster version is 4.8.0-0.nightly-2021-11-03-171325

$ oc project openshift-cluster-node-tuning-operator

$ oc get po -o wide|grep worker-a
tuned-d6s6j                                    1/1     Running   0          7h51m   10.0.128.3    jmencak-hcp9p-worker-a-7rzvf.c.openshift-gce-devel.internal   <none>           <none>

$ oc label no jmencak-hcp9p-worker-a-7rzvf.c.openshift-gce-devel.internal profile=

$ cat stuck.yaml
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-stuck
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=OpenShift profile stuck
      [variables]
      v=${f:exec:sleep:72}
    name: openshift-profile-stuck
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-stuck

$ oc create -f stuck.yaml

$ oc logs -f tuned-d6s6j | tail -n17
I1104 15:45:46.249348    2398 tuned.go:542] reloading tuned...
I1104 15:45:46.249354    2398 tuned.go:545] sending HUP to PID 3628
2021-11-04 15:45:46,249 INFO     tuned.daemon.daemon: stopping tuning
2021-11-04 15:45:46,266 INFO     tuned.daemon.daemon: terminating Tuned, rolling back all changes
2021-11-04 15:45:46,313 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-04 15:45:46,314 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-04 15:45:46,314 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck
E1104 15:46:46.249925    2398 tuned.go:1128] timeout (60) to apply TuneD profile; restarting TuneD daemon
E1104 15:46:56.252435    2398 tuned.go:479] error waiting for tuned: signal: killed
I1104 15:46:56.252578    2398 tuned.go:429] starting tuned...
I1104 15:46:56.268933    2398 tuned.go:917] updated Profile jmencak-hcp9p-worker-a-7rzvf.c.openshift-gce-devel.internal stalld=<nil>, bootcmdline: 
I1104 15:46:56.269286    2398 tuned.go:416] written "/etc/tuned/recommend.d/50-openshift.conf" to set Tuned profile openshift-profile-stuck
2021-11-04 15:46:56,371 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-11-04 15:46:56,377 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-11-04 15:46:56,377 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-11-04 15:46:56,378 INFO     tuned.daemon.daemon: Using 'openshift-profile-stuck' profile
2021-11-04 15:46:56,379 INFO     tuned.profiles.loader: loading profile: openshift-profile-stuck

In 4.8, no exponential backoff was implemented, but the profile application times out after 60 seconds
and is retried.

QE, please acknowledge the fix.

Comment 7 errata-xmlrpc 2021-11-16 21:22:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.20 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4574

Note You need to log in before you can comment on or make changes to this bug.