Bug 1920525 - NTO does not update when the tuned profile is updated.
Summary: NTO does not update when the tuned profile is updated.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Tuning Operator
Version: 4.6.z
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.6.z
Assignee: jmencak
QA Contact: Simon
URL:
Whiteboard:
Depends On: 1919970
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-26 13:40 UTC by jmencak
Modified: 2021-03-02 04:48 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: openshift-tuned does not handle failures to apply a Tuned profile. Consequence: When an invalid Tuned profile is created, the openshift-tuned supervisor process may ignore profile updates and reload a valid profile. Fix: Keep state information about Tuned profile application success/failure. Result: openshift-tuned will recover from profile application failures on receiving new valid profiles.
Clone Of: 1919970
Environment:
Last Closed: 2021-03-02 04:48:10 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-node-tuning-operator pull 197 0 None open Bug 1920525: Recover gracefully after Tuned errors. 2021-02-16 21:14:24 UTC
Red Hat Product Errata RHBA-2021:0634 0 None None None 2021-03-02 04:48:28 UTC

Description jmencak 2021-01-26 13:40:09 UTC
+++ This bug was initially created as a clone of Bug #1919970 +++

Description of problem:
When the Tuned profile is updated. Node tuning operator does not get updated to apply the changes in the profile. 

Version-Release number of selected component (if applicable):

[root@dell-r640-028 performance]# oc version
Client Version: 4.7.0-fc.3
Server Version: 4.7.0-fc.3
Kubernetes Version: v1.20.0+d9c52c

How reproducible:
1. Setup up OCP 4.7
2. Install and setup performance addon operator
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  additionalKernelArgs:
  - nosmt
  cpu:
    isolated: "2-3"
    reserved: "0-1"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      node: 0
      count: 1
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

3. Create a tuned profile. as show below. (In this profile we are disabling the stalld).

[root@dell-r640-028 performance]# cat disable_stalld.yaml 
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch 
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-example-performanceprofile
      [service]
      service.stalld=stop,disable
    name: performance-patch 
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: performance-patch 

4. Once the above profile is applied .

5. Modify the Tuned profile/performance-patch . Update the include mentioned in the Tuned profile.

In the above mentioned profile in the include parameter in tuned profile doesn't exist. Once the profile is updated to specify the right Tuned profile. 

[root@dell-r640-028 performance]# oc get Tuned
NAME                                     AGE
default                                  143m
openshift-node-performance-performance   25m
performance-patch                        28m
rendered                                 143m

6. Modified tuned profile to specify the right profile. 

$ cat disable_stalld.yaml

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-performance
      [service]
      service.stalld=stop,disable
    name: performance-patch
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: performance-patch

[root@dell-r640-028 performance]# oc apply -f  disable_stalld.yaml 
tuned.tuned.openshift.io/performance-patch configured

Check any changes in tuned. 

[root@dell-r640-028 performance]# oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-node-tuning-operator-674966bd95-dkltc   1/1     Running   0          66m
tuned-6d9v4                                     1/1     Running   0          146m
tuned-8j54t                                     1/1     Running   0          138m
tuned-8mh25                                     1/1     Running   0          146m
tuned-dp2gp                                     1/1     Running   0          138m
tuned-jqfw9                                     1/1     Running   0          138m
tuned-sgv76                                     1/1     Running   0          146m
[root@dell-r640-028 performance]# oc get mcp
NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master       rendered-master-fd44a5696af050011856431fbb3b2c3b       True      False      False      3              3                   3                     0                      147m
worker       rendered-worker-0e4354cac64e3253ee87d7aeb3449782       True      False      False      1              1                   1                     0                      147m
worker-cnf   rendered-worker-cnf-dc8fe15e9eaa459be86d35da3d6c8701   True      False      False      2              2                   2                     0                      33m
[root@dell-r640-028 performance]# oc get mcp
NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master       rendered-master-fd44a5696af050011856431fbb3b2c3b       True      False      False      3              3                   3                     0                      147m
worker       rendered-worker-0e4354cac64e3253ee87d7aeb3449782       True      False      False      1              1                   1                     0                      147m
worker-cnf   rendered-worker-cnf-dc8fe15e9eaa459be86d35da3d6c8701   True      False      False      2              2                   2                     0                      33m



Actual results:

Once the Tuned profile is modified. NTO doesn't seem to update the changes. 

Expected results:
NTO should update the changes in Tuned profile.

Additional info:
Logs:

I0125 13:06:21.489811    3534 tuned.go:462] sending HUP to PID 5550
2021-01-25 13:06:21,490 INFO     tuned.daemon.daemon: stopping tuning
2021-01-25 13:06:21,511 INFO     tuned.daemon.daemon: terminating Tuned, rolling back all changes
2021-01-25 13:06:21,524 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-01-25 13:06:21,524 INFO     tuned.daemon.daemon: Using 'performance-patch' profile
2021-01-25 13:06:21,525 INFO     tuned.profiles.loader: loading profile: performance-patch
2021-01-25 13:06:21,525 ERROR    tuned.daemon.controller: Failed to reload Tuned: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'.
I0125 13:09:33.689001    3534 tuned.go:291] extracting Tuned profiles
I0125 13:09:33.848530    3534 tuned.go:325] recommended Tuned profile performance-patch content unchanged
2021-01-25 13:16:13,332 INFO     tuned.daemon.controller: terminating controller
E0125 13:17:13.785882    4556 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Profile: failed to list *v1.Profile: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
E0125 13:17:13.785882    4556 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
I0125 13:17:15.089012    4556 tuned.go:274] disabling system tuned...
I0125 13:17:15.433528    4556 tuned.go:852] started events processor
I0125 13:17:15.434670    4556 tuned.go:895] started controller
I0125 13:17:15.435280    4556 tuned.go:369] written "/etc/tuned/recommend.d/50-openshift.conf" to set Tuned profile performance-patch
I0125 13:17:15.435352    4556 tuned.go:291] extracting Tuned profiles
I0125 13:17:15.675014    4556 tuned.go:325] recommended Tuned profile performance-patch content changed
I0125 13:17:16.594504    4556 tuned.go:595] active profile () != recommended profile (performance-patch)
I0125 13:17:16.594601    4556 tuned.go:382] starting tuned...
2021-01-25 13:17:16,752 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-01-25 13:17:16,762 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-01-25 13:17:16,762 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-01-25 13:17:16,763 INFO     tuned.daemon.daemon: Using 'performance-patch' profile
2021-01-25 13:17:16,764 INFO     tuned.profiles.loader: loading profile: performance-patch
2021-01-25 13:17:16,765 ERROR    tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'.
2021-01-25 13:17:16,766 INFO     tuned.daemon.controller: starting controller
I0125 13:37:48.121788    4556 tuned.go:291] extracting Tuned profiles
I0125 13:37:48.282621    4556 tuned.go:325] recommended Tuned profile performance-patch content changed

--- Additional comment from Niranjan Mallapadi Raghavender on 2021-01-25 13:56:16 UTC ---

To workaround the issue is to delete the nto pods running on worker-cnf nodes, then the updated tuned profile gets applied.

--- Additional comment from Niranjan Mallapadi Raghavender on 2021-01-25 14:03:27 UTC ---



--- Additional comment from Yanir Quinn on 2021-01-25 14:10:30 UTC ---

Another way of dealing with it is deleting the tuned CR and recreating it properly.

--- Additional comment from  on 2021-01-25 14:27:17 UTC ---

From the Tuned Pod logs I can see you're missing the `openshift-node-performance-example-performanceprofile` profile.  It also doesn't show in your `oc get Tuned` output.  Is it created before you instantiate disable_stalld.yaml ?

--- Additional comment from  on 2021-01-25 14:50:56 UTC ---

OK, I think I know what you mean now and this is a known issue.
It is planned to be fixed in 4.8 and the fix is already included here: https://github.com/openshift/cluster-node-tuning-operator/pull/188

--- Additional comment from Niranjan Mallapadi Raghavender on 2021-01-25 15:01:20 UTC ---

From the Tuned Pod logs I can see you're missing the `openshift-node-performance-example-performanceprofile` profile.  It also doesn't show in your `oc get Tuned` output.  Is it created before you instantiate disable_stalld.yaml ?

Yes openshift-node-performance-example-performanceprofile is missing , So we modified the tuned profile to provide the right profile. But after updating the profile. 
NTO still doesn't get updated.

--- Additional comment from  on 2021-01-25 15:11:10 UTC ---

(In reply to Niranjan Mallapadi Raghavender from comment #6)
> Yes openshift-node-performance-example-performanceprofile is missing , So we
> modified the tuned profile to provide the right profile. But after updating
> the profile. 
> NTO still doesn't get updated.

Understood and thanks for clarification.  This is a know issue which I was planning
to address in 4.8 with the PR I mentioned above.  It might be worth, however, backporting
part of this PR to address the issue in 4.7 (and maybe even earlier) already.  Thank you.

Comment 4 Simon 2021-02-24 17:24:03 UTC
Cluster version: 4.6.0-0.nightly-2021-02-22-141201

Comment 6 errata-xmlrpc 2021-03-02 04:48:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.19 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0634


Note You need to log in before you can comment on or make changes to this bug.