Bug 1919970

Summary:

NTO does not update when the tuned profile is updated.

Product:

OpenShift Container Platform

Reporter:

Niranjan Mallapadi Raghavender <mniranja>

Component:

Node Tuning Operator

Assignee:

Jiří Mencák <jmencak>

Status:

CLOSED ERRATA

QA Contact:

Simon <skordas>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.7

CC:

grajaiya, kquinn, sejug, yquinn

Target Milestone:

---

Target Release:

4.7.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: openshift-tuned does not handle failures to apply a Tuned profile. Consequence: When an invalid Tuned profile is created, the openshift-tuned supervisor process may ignore future profile updates( and fail to apply the updated profile). Fix: Keep state information about Tuned profile application success or failure. Result: openshift-tuned will recover from profile application failures on receiving new valid profiles.

Story Points:

---

Clone Of:

Clones:

1920525 (view as bug list)

Environment:

Last Closed:

2021-02-24 15:55:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1920525

Attachments:

Description	Flags
NTO logs from pods running on worker-cnf node.	none

Description Niranjan Mallapadi Raghavender 2021-01-25 13:41:21 UTC

Description of problem:
When the Tuned profile is updated. Node tuning operator does not get updated to apply the changes in the profile. 

Version-Release number of selected component (if applicable):

[root@dell-r640-028 performance]# oc version
Client Version: 4.7.0-fc.3
Server Version: 4.7.0-fc.3
Kubernetes Version: v1.20.0+d9c52c

How reproducible:
1. Setup up OCP 4.7
2. Install and setup performance addon operator
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  additionalKernelArgs:
  - nosmt
  cpu:
    isolated: "2-3"
    reserved: "0-1"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      node: 0
      count: 1
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

3. Create a tuned profile. as show below. (In this profile we are disabling the stalld).

[root@dell-r640-028 performance]# cat disable_stalld.yaml 
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch 
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-example-performanceprofile
      [service]
      service.stalld=stop,disable
    name: performance-patch 
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: performance-patch 

4. Once the above profile is applied .

5. Modify the Tuned profile/performance-patch . Update the include mentioned in the Tuned profile.

In the above mentioned profile in the include parameter in tuned profile doesn't exist. Once the profile is updated to specify the right Tuned profile. 

[root@dell-r640-028 performance]# oc get Tuned
NAME                                     AGE
default                                  143m
openshift-node-performance-performance   25m
performance-patch                        28m
rendered                                 143m

6. Modified tuned profile to specify the right profile. 

$ cat disable_stalld.yaml

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-performance
      [service]
      service.stalld=stop,disable
    name: performance-patch
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: performance-patch

[root@dell-r640-028 performance]# oc apply -f  disable_stalld.yaml 
tuned.tuned.openshift.io/performance-patch configured

Check any changes in tuned. 

[root@dell-r640-028 performance]# oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-node-tuning-operator-674966bd95-dkltc   1/1     Running   0          66m
tuned-6d9v4                                     1/1     Running   0          146m
tuned-8j54t                                     1/1     Running   0          138m
tuned-8mh25                                     1/1     Running   0          146m
tuned-dp2gp                                     1/1     Running   0          138m
tuned-jqfw9                                     1/1     Running   0          138m
tuned-sgv76                                     1/1     Running   0          146m
[root@dell-r640-028 performance]# oc get mcp
NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master       rendered-master-fd44a5696af050011856431fbb3b2c3b       True      False      False      3              3                   3                     0                      147m
worker       rendered-worker-0e4354cac64e3253ee87d7aeb3449782       True      False      False      1              1                   1                     0                      147m
worker-cnf   rendered-worker-cnf-dc8fe15e9eaa459be86d35da3d6c8701   True      False      False      2              2                   2                     0                      33m
[root@dell-r640-028 performance]# oc get mcp
NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master       rendered-master-fd44a5696af050011856431fbb3b2c3b       True      False      False      3              3                   3                     0                      147m
worker       rendered-worker-0e4354cac64e3253ee87d7aeb3449782       True      False      False      1              1                   1                     0                      147m
worker-cnf   rendered-worker-cnf-dc8fe15e9eaa459be86d35da3d6c8701   True      False      False      2              2                   2                     0                      33m



Actual results:

Once the Tuned profile is modified. NTO doesn't seem to update the changes. 

Expected results:
NTO should update the changes in Tuned profile.

Additional info:
Logs:

I0125 13:06:21.489811    3534 tuned.go:462] sending HUP to PID 5550
2021-01-25 13:06:21,490 INFO     tuned.daemon.daemon: stopping tuning
2021-01-25 13:06:21,511 INFO     tuned.daemon.daemon: terminating Tuned, rolling back all changes
2021-01-25 13:06:21,524 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-01-25 13:06:21,524 INFO     tuned.daemon.daemon: Using 'performance-patch' profile
2021-01-25 13:06:21,525 INFO     tuned.profiles.loader: loading profile: performance-patch
2021-01-25 13:06:21,525 ERROR    tuned.daemon.controller: Failed to reload Tuned: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'.
I0125 13:09:33.689001    3534 tuned.go:291] extracting Tuned profiles
I0125 13:09:33.848530    3534 tuned.go:325] recommended Tuned profile performance-patch content unchanged
2021-01-25 13:16:13,332 INFO     tuned.daemon.controller: terminating controller
E0125 13:17:13.785882    4556 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Profile: failed to list *v1.Profile: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
E0125 13:17:13.785882    4556 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
I0125 13:17:15.089012    4556 tuned.go:274] disabling system tuned...
I0125 13:17:15.433528    4556 tuned.go:852] started events processor
I0125 13:17:15.434670    4556 tuned.go:895] started controller
I0125 13:17:15.435280    4556 tuned.go:369] written "/etc/tuned/recommend.d/50-openshift.conf" to set Tuned profile performance-patch
I0125 13:17:15.435352    4556 tuned.go:291] extracting Tuned profiles
I0125 13:17:15.675014    4556 tuned.go:325] recommended Tuned profile performance-patch content changed
I0125 13:17:16.594504    4556 tuned.go:595] active profile () != recommended profile (performance-patch)
I0125 13:17:16.594601    4556 tuned.go:382] starting tuned...
2021-01-25 13:17:16,752 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2021-01-25 13:17:16,762 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2021-01-25 13:17:16,762 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2021-01-25 13:17:16,763 INFO     tuned.daemon.daemon: Using 'performance-patch' profile
2021-01-25 13:17:16,764 INFO     tuned.profiles.loader: loading profile: performance-patch
2021-01-25 13:17:16,765 ERROR    tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'.
2021-01-25 13:17:16,766 INFO     tuned.daemon.controller: starting controller
I0125 13:37:48.121788    4556 tuned.go:291] extracting Tuned profiles
I0125 13:37:48.282621    4556 tuned.go:325] recommended Tuned profile performance-patch content changed

Comment 1 Niranjan Mallapadi Raghavender 2021-01-25 13:56:16 UTC

To workaround the issue is to delete the nto pods running on worker-cnf nodes, then the updated tuned profile gets applied.

Comment 2 Niranjan Mallapadi Raghavender 2021-01-25 14:03:27 UTC

Created attachment 1750525 [details]
NTO logs from pods running on worker-cnf node.

Comment 3 Yanir Quinn 2021-01-25 14:10:30 UTC

Another way of dealing with it is deleting the tuned CR and recreating it properly.

Comment 4 Jiří Mencák 2021-01-25 14:27:17 UTC

From the Tuned Pod logs I can see you're missing the `openshift-node-performance-example-performanceprofile` profile.  It also doesn't show in your `oc get Tuned` output.  Is it created before you instantiate disable_stalld.yaml ?

Comment 5 Jiří Mencák 2021-01-25 14:50:56 UTC

OK, I think I know what you mean now and this is a known issue.
It is planned to be fixed in 4.8 and the fix is already included here: https://github.com/openshift/cluster-node-tuning-operator/pull/188

Comment 6 Niranjan Mallapadi Raghavender 2021-01-25 15:01:20 UTC

From the Tuned Pod logs I can see you're missing the `openshift-node-performance-example-performanceprofile` profile.  It also doesn't show in your `oc get Tuned` output.  Is it created before you instantiate disable_stalld.yaml ?

Yes openshift-node-performance-example-performanceprofile is missing , So we modified the tuned profile to provide the right profile. But after updating the profile. 
NTO still doesn't get updated.

Comment 7 Jiří Mencák 2021-01-25 15:11:10 UTC

(In reply to Niranjan Mallapadi Raghavender from comment #6)
> Yes openshift-node-performance-example-performanceprofile is missing , So we
> modified the tuned profile to provide the right profile. But after updating
> the profile. 
> NTO still doesn't get updated.

Understood and thanks for clarification.  This is a know issue which I was planning
to address in 4.8 with the PR I mentioned above.  It might be worth, however, backporting
part of this PR to address the issue in 4.7 (and maybe even earlier) already.  Thank you.

Comment 9 Simon 2021-01-29 19:06:27 UTC

Cluster version: 4.7.0-0.nightly-2021-01-29-094746

# Get worker node and NTO pod on this node
node=$(oc get nodes | grep -m 1 worker | cut -f 1 -d ' ') && echo $node
pod=$(oc get pods -n openshift-cluster-node-tuning-operator -o wide | grep $node | cut -d ' ' -f 1) && echo $pod

# label the node:
oc label node $node node-role.kubernetes.io/worker-cnf=

# Log in into web console
# Operators -> Operator Hub -> Performance Addon Operator -> Install

# Adding performance profile:

oc create -f- <<EOF
apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
  name: performance
  namespace: openshift-operators
spec:
  additionalKernelArgs:
  - nosmt
  cpu:
    isolated: "1"
    reserved: "0-1"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      node: 0
      count: 1
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
EOF

# New tuned is created
# Create and wait for mcp:

oc create -f- <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-cnf
  labels:
    worker-cnf: ""
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-cnf]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-cnf: ""
EOF

# Check tuned profile on worker-cnf node
oc get profiles $node -n openshift-cluster-node-tuning-operator -o json | jq ".spec.config.tunedProfile"
"openshift-node-performance-performance"

# Check logs
oc logs $pod
2021-01-29 17:38:10,226 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-performance-performance' applied

# Check node - new openshift-node-performance-performance profile with set up vm.stat_interval = 10:
oc debug node/$node -- chroot /host sysctl vm.stat_interval
Starting pod/skordas129-smst4-worker-a-92ll8copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
vm.stat_interval = 10

# Create performance-patch tuned, include no existing profile

oc create -f- <<EOF
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch 
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-example-performanceprofile
      [service]
      service.stalld=stop,disable
    name: performance-patch 
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: performance-patch
EOF

# Check once again tuned profile - new profile
oc get profiles $node -n openshift-cluster-node-tuning-operator -o json | jq ".spec.config.tunedProfile"
"performance-patch"

# Check logs (missing profile as expected)
oc logs $pod
2021-01-29 17:43:21,554 ERROR    tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'.
2021-01-29 17:43:21,554 INFO     tuned.daemon.controller: starting controller

# Check node (here is 1 not 10 like previously, because profile openshift-node-performance-performance is not included)
oc debug node/$node -- chroot /host sysctl vm.stat_interval
Starting pod/skordas129-smst4-worker-a-92ll8copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
vm.stat_interval = 1

# Update performance-patch profile, including correct profile:
oc edit tuned performance-patch
include=openshift-node-performance-example-performanceprofile -> include=openshift-node-performance-performance

# Check tuned profile on worker-cnf node
oc get profiles $node -n openshift-cluster-node-tuning-operator -o json | jq ".spec.config.tunedProfile"
"performance-patch"

# Chck logs once again
oc logs $pod
2021-01-29 18:56:39,999 INFO     tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2
2021-01-29 18:56:39,999 INFO     tuned.plugins.plugin_bootloader: cannot find grub.cfg to patch
2021-01-29 18:56:40,001 INFO     tuned.daemon.daemon: static tuning from profile 'performance-patch' applied

# Check value on node - value was included
oc debug node/$node -- chroot /host sysctl vm.stat_interval
Starting pod/skordas129-smst4-worker-a-92ll8copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
vm.stat_interval = 10

Comment 12 errata-xmlrpc 2021-02-24 15:55:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633