Bug 1919970
| Summary: | NTO does not update when the tuned profile is updated. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Niranjan Mallapadi Raghavender <mniranja> | ||||
| Component: | Node Tuning Operator | Assignee: | Jiří Mencák <jmencak> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Simon <skordas> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.7 | CC: | grajaiya, kquinn, sejug, yquinn | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.7.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause:
openshift-tuned does not handle failures to apply a Tuned profile.
Consequence:
When an invalid Tuned profile is created, the openshift-tuned supervisor process may ignore future profile updates( and fail to apply the updated profile).
Fix:
Keep state information about Tuned profile application success or failure.
Result:
openshift-tuned will recover from profile application failures on receiving new valid profiles.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1920525 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-02-24 15:55:53 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1920525 | ||||||
| Attachments: |
|
||||||
To workaround the issue is to delete the nto pods running on worker-cnf nodes, then the updated tuned profile gets applied. Created attachment 1750525 [details]
NTO logs from pods running on worker-cnf node.
Another way of dealing with it is deleting the tuned CR and recreating it properly. From the Tuned Pod logs I can see you're missing the `openshift-node-performance-example-performanceprofile` profile. It also doesn't show in your `oc get Tuned` output. Is it created before you instantiate disable_stalld.yaml ? OK, I think I know what you mean now and this is a known issue. It is planned to be fixed in 4.8 and the fix is already included here: https://github.com/openshift/cluster-node-tuning-operator/pull/188 From the Tuned Pod logs I can see you're missing the `openshift-node-performance-example-performanceprofile` profile. It also doesn't show in your `oc get Tuned` output. Is it created before you instantiate disable_stalld.yaml ? Yes openshift-node-performance-example-performanceprofile is missing , So we modified the tuned profile to provide the right profile. But after updating the profile. NTO still doesn't get updated. (In reply to Niranjan Mallapadi Raghavender from comment #6) > Yes openshift-node-performance-example-performanceprofile is missing , So we > modified the tuned profile to provide the right profile. But after updating > the profile. > NTO still doesn't get updated. Understood and thanks for clarification. This is a know issue which I was planning to address in 4.8 with the PR I mentioned above. It might be worth, however, backporting part of this PR to address the issue in 4.7 (and maybe even earlier) already. Thank you. Cluster version: 4.7.0-0.nightly-2021-01-29-094746
# Get worker node and NTO pod on this node
node=$(oc get nodes | grep -m 1 worker | cut -f 1 -d ' ') && echo $node
pod=$(oc get pods -n openshift-cluster-node-tuning-operator -o wide | grep $node | cut -d ' ' -f 1) && echo $pod
# label the node:
oc label node $node node-role.kubernetes.io/worker-cnf=
# Log in into web console
# Operators -> Operator Hub -> Performance Addon Operator -> Install
# Adding performance profile:
oc create -f- <<EOF
apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
name: performance
namespace: openshift-operators
spec:
additionalKernelArgs:
- nosmt
cpu:
isolated: "1"
reserved: "0-1"
hugepages:
defaultHugepagesSize: "1G"
pages:
- size: "1G"
node: 0
count: 1
realTimeKernel:
enabled: true
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
EOF
# New tuned is created
# Create and wait for mcp:
oc create -f- <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-cnf
labels:
worker-cnf: ""
spec:
machineConfigSelector:
matchExpressions:
- {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-cnf]}
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-cnf: ""
EOF
# Check tuned profile on worker-cnf node
oc get profiles $node -n openshift-cluster-node-tuning-operator -o json | jq ".spec.config.tunedProfile"
"openshift-node-performance-performance"
# Check logs
oc logs $pod
2021-01-29 17:38:10,226 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node-performance-performance' applied
# Check node - new openshift-node-performance-performance profile with set up vm.stat_interval = 10:
oc debug node/$node -- chroot /host sysctl vm.stat_interval
Starting pod/skordas129-smst4-worker-a-92ll8copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
vm.stat_interval = 10
# Create performance-patch tuned, include no existing profile
oc create -f- <<EOF
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: performance-patch
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Configuration changes profile inherited from performance created tuned
include=openshift-node-performance-example-performanceprofile
[service]
service.stalld=stop,disable
name: performance-patch
recommend:
- machineConfigLabels:
machineconfiguration.openshift.io/role: "worker-cnf"
priority: 19
profile: performance-patch
EOF
# Check once again tuned profile - new profile
oc get profiles $node -n openshift-cluster-node-tuning-operator -o json | jq ".spec.config.tunedProfile"
"performance-patch"
# Check logs (missing profile as expected)
oc logs $pod
2021-01-29 17:43:21,554 ERROR tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'.
2021-01-29 17:43:21,554 INFO tuned.daemon.controller: starting controller
# Check node (here is 1 not 10 like previously, because profile openshift-node-performance-performance is not included)
oc debug node/$node -- chroot /host sysctl vm.stat_interval
Starting pod/skordas129-smst4-worker-a-92ll8copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
vm.stat_interval = 1
# Update performance-patch profile, including correct profile:
oc edit tuned performance-patch
include=openshift-node-performance-example-performanceprofile -> include=openshift-node-performance-performance
# Check tuned profile on worker-cnf node
oc get profiles $node -n openshift-cluster-node-tuning-operator -o json | jq ".spec.config.tunedProfile"
"performance-patch"
# Chck logs once again
oc logs $pod
2021-01-29 18:56:39,999 INFO tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2
2021-01-29 18:56:39,999 INFO tuned.plugins.plugin_bootloader: cannot find grub.cfg to patch
2021-01-29 18:56:40,001 INFO tuned.daemon.daemon: static tuning from profile 'performance-patch' applied
# Check value on node - value was included
oc debug node/$node -- chroot /host sysctl vm.stat_interval
Starting pod/skordas129-smst4-worker-a-92ll8copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
vm.stat_interval = 10
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Description of problem: When the Tuned profile is updated. Node tuning operator does not get updated to apply the changes in the profile. Version-Release number of selected component (if applicable): [root@dell-r640-028 performance]# oc version Client Version: 4.7.0-fc.3 Server Version: 4.7.0-fc.3 Kubernetes Version: v1.20.0+d9c52c How reproducible: 1. Setup up OCP 4.7 2. Install and setup performance addon operator apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: additionalKernelArgs: - nosmt cpu: isolated: "2-3" reserved: "0-1" hugepages: defaultHugepagesSize: "1G" pages: - size: "1G" node: 0 count: 1 realTimeKernel: enabled: true nodeSelector: node-role.kubernetes.io/worker-cnf: "" 3. Create a tuned profile. as show below. (In this profile we are disabling the stalld). [root@dell-r640-028 performance]# cat disable_stalld.yaml apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: performance-patch namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Configuration changes profile inherited from performance created tuned include=openshift-node-performance-example-performanceprofile [service] service.stalld=stop,disable name: performance-patch recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "worker-cnf" priority: 19 profile: performance-patch 4. Once the above profile is applied . 5. Modify the Tuned profile/performance-patch . Update the include mentioned in the Tuned profile. In the above mentioned profile in the include parameter in tuned profile doesn't exist. Once the profile is updated to specify the right Tuned profile. [root@dell-r640-028 performance]# oc get Tuned NAME AGE default 143m openshift-node-performance-performance 25m performance-patch 28m rendered 143m 6. Modified tuned profile to specify the right profile. $ cat disable_stalld.yaml apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: performance-patch namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Configuration changes profile inherited from performance created tuned include=openshift-node-performance-performance [service] service.stalld=stop,disable name: performance-patch recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "worker-cnf" priority: 19 profile: performance-patch [root@dell-r640-028 performance]# oc apply -f disable_stalld.yaml tuned.tuned.openshift.io/performance-patch configured Check any changes in tuned. [root@dell-r640-028 performance]# oc get pods NAME READY STATUS RESTARTS AGE cluster-node-tuning-operator-674966bd95-dkltc 1/1 Running 0 66m tuned-6d9v4 1/1 Running 0 146m tuned-8j54t 1/1 Running 0 138m tuned-8mh25 1/1 Running 0 146m tuned-dp2gp 1/1 Running 0 138m tuned-jqfw9 1/1 Running 0 138m tuned-sgv76 1/1 Running 0 146m [root@dell-r640-028 performance]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-fd44a5696af050011856431fbb3b2c3b True False False 3 3 3 0 147m worker rendered-worker-0e4354cac64e3253ee87d7aeb3449782 True False False 1 1 1 0 147m worker-cnf rendered-worker-cnf-dc8fe15e9eaa459be86d35da3d6c8701 True False False 2 2 2 0 33m [root@dell-r640-028 performance]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-fd44a5696af050011856431fbb3b2c3b True False False 3 3 3 0 147m worker rendered-worker-0e4354cac64e3253ee87d7aeb3449782 True False False 1 1 1 0 147m worker-cnf rendered-worker-cnf-dc8fe15e9eaa459be86d35da3d6c8701 True False False 2 2 2 0 33m Actual results: Once the Tuned profile is modified. NTO doesn't seem to update the changes. Expected results: NTO should update the changes in Tuned profile. Additional info: Logs: I0125 13:06:21.489811 3534 tuned.go:462] sending HUP to PID 5550 2021-01-25 13:06:21,490 INFO tuned.daemon.daemon: stopping tuning 2021-01-25 13:06:21,511 INFO tuned.daemon.daemon: terminating Tuned, rolling back all changes 2021-01-25 13:06:21,524 INFO tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration. 2021-01-25 13:06:21,524 INFO tuned.daemon.daemon: Using 'performance-patch' profile 2021-01-25 13:06:21,525 INFO tuned.profiles.loader: loading profile: performance-patch 2021-01-25 13:06:21,525 ERROR tuned.daemon.controller: Failed to reload Tuned: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'. I0125 13:09:33.689001 3534 tuned.go:291] extracting Tuned profiles I0125 13:09:33.848530 3534 tuned.go:325] recommended Tuned profile performance-patch content unchanged 2021-01-25 13:16:13,332 INFO tuned.daemon.controller: terminating controller E0125 13:17:13.785882 4556 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Profile: failed to list *v1.Profile: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host E0125 13:17:13.785882 4556 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host I0125 13:17:15.089012 4556 tuned.go:274] disabling system tuned... I0125 13:17:15.433528 4556 tuned.go:852] started events processor I0125 13:17:15.434670 4556 tuned.go:895] started controller I0125 13:17:15.435280 4556 tuned.go:369] written "/etc/tuned/recommend.d/50-openshift.conf" to set Tuned profile performance-patch I0125 13:17:15.435352 4556 tuned.go:291] extracting Tuned profiles I0125 13:17:15.675014 4556 tuned.go:325] recommended Tuned profile performance-patch content changed I0125 13:17:16.594504 4556 tuned.go:595] active profile () != recommended profile (performance-patch) I0125 13:17:16.594601 4556 tuned.go:382] starting tuned... 2021-01-25 13:17:16,752 INFO tuned.daemon.application: dynamic tuning is globally disabled 2021-01-25 13:17:16,762 INFO tuned.daemon.daemon: using sleep interval of 1 second(s) 2021-01-25 13:17:16,762 INFO tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration. 2021-01-25 13:17:16,763 INFO tuned.daemon.daemon: Using 'performance-patch' profile 2021-01-25 13:17:16,764 INFO tuned.profiles.loader: loading profile: performance-patch 2021-01-25 13:17:16,765 ERROR tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'performance-patch': Cannot find profile 'openshift-node-performance-example-performanceprofile' in '['/etc/tuned', '/usr/lib/tuned']'. 2021-01-25 13:17:16,766 INFO tuned.daemon.controller: starting controller I0125 13:37:48.121788 4556 tuned.go:291] extracting Tuned profiles I0125 13:37:48.282621 4556 tuned.go:325] recommended Tuned profile performance-patch content changed