+++ This bug was initially created as a clone of Bug #2036303 +++ Description of problem: When the Tuned profile is updated. The tuned profile is applied to the node, but still remains DEGRADED. Version-Release number of selected component (if applicable): $ omg get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.12 True False 38m Error while reconciling 4.9.12: the cluster operator insights is degraded How reproducible: Steps to Reproduce: 1. Install and setup performance addon operator [root@bastion1 dk]# oc get performanceprofiles.performance.openshift.io performance -oyaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: creationTimestamp: "2021-11-02T10:18:56Z" finalizers: - foreground-deletion generation: 1 name: performance resourceVersion: "9172819" uid: 931a600a-7e9a-499d-9e08-f99abbdd90ed spec: cpu: isolated: 4-39,44-79 reserved: 0-3,40-43 globallyDisableIrqLoadBalancing: true hugepages: defaultHugepagesSize: 1G pages: - count: 32 node: 0 size: 1G - count: 32 node: 1 size: 1G nodeSelector: node-role.kubernetes.io/sys: "" numa: topologyPolicy: restricted 2. create a tuned profile [root@bastion1 smile]# cat tuned_sysctl_socket_buffer_profile.yaml apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: sysctl-socket-buffer namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Set rmem_default,rmem_max,wmem_default,wmem_max include=openshift-node [sysctl] net.core.rmem_default = 2097152 net.core.rmem_max = 2097152 net.core.wmem_default = 2097152 net.core.wmem_max = 2097152 name: openshift-sysctl recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "sys" priority: 20 profile: openshift-sysctl 3. tuned profile is degraded [root@bastion1 dk]# oc get profile -A NAMESPACE NAME TUNED APPLIED DEGRADED AGE openshift-cluster-node-tuning-operator master01.ss2.samsung.local openshift-control-plane True False 65d openshift-cluster-node-tuning-operator master02.ss2.samsung.local openshift-control-plane True False 64d openshift-cluster-node-tuning-operator master03.ss2.samsung.local openshift-control-plane True False 65d openshift-cluster-node-tuning-operator worker01.ss2.samsung.local openshift-sysctl-oam True True 61d openshift-cluster-node-tuning-operator worker02.ss2.samsung.local openshift-sysctl-oam True False 61d openshift-cluster-node-tuning-operator worker03.ss2.samsung.local openshift-sysctl-oam True True 61d openshift-cluster-node-tuning-operator worker04.ss2.samsung.local openshift-sysctl-oam True False 61d openshift-cluster-node-tuning-operator worker05.ss2.samsung.local openshift-sysctl-sys True False 61d openshift-cluster-node-tuning-operator worker06.ss2.samsung.local openshift-sysctl-sys True True 61d openshift-cluster-node-tuning-operator worker07.ss2.samsung.local openshift-sysctl-sys True False 61d openshift-cluster-node-tuning-operator worker08.ss2.samsung.local openshift-sysctl-sys True False 61d openshift-cluster-node-tuning-operator worker09.ss2.samsung.local openshift-sysctl-call True False 34d openshift-cluster-node-tuning-operator worker10.ss2.samsung.local openshift-sysctl-call True True 34d openshift-cluster-node-tuning-operator worker11.ss2.samsung.local openshift-sysctl-call2 True False 6d20h openshift-cluster-node-tuning-operator worker12.ss2.samsung.local openshift-sysctl-call2 True False 6d20h Actual results: 1) Error occurred in tuned profile -- $ omg get profile worker10.ss2.samsung.local -o yaml ~ status: bootcmdline: skew_tick=1 nohz=on rcu_nocbs=4-27,32-55 tuned.non_isolcpus=f000000f intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,4-27,32-55 systemd.cpu_affinity=0,1,2,3,28,29,30,31 default_hugepagesz=1G + conditions: - lastTransitionTime: '2021-12-29T03:30:22Z' message: Tuned profile applied. reason: AsExpected status: 'True' type: Applied - lastTransitionTime: '2021-12-29T03:30:22Z' message: Tuned daemon issued one or more error message(s) during profile application. reason: TunedError status: 'True' type: Degraded tunedProfile: openshift-sysctl-call -- 2) error log in tuned Pod -- $ omg logs tuned-zzgm5 ~ 2021-12-29T03:30:24.027172311Z 2021-12-29 03:30:24,027 INFO tuned.plugins.plugin_cpu: setting new cpu latency 2 2021-12-29T03:30:24.033503757Z 2021-12-29 03:30:24,033 INFO tuned.plugins.plugin_sysctl: reapplying system sysctl 2021-12-29T03:30:24.528353891Z 2021-12-29 03:30:24,528 INFO tuned.plugins.plugin_systemd: setting 'CPUAffinity' to '0 1 2 3 28 29 30 31' in the '/etc/systemd/system.conf' 2021-12-29T03:30:25.007818601Z 2021-12-29 03:30:25,007 INFO tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['start']' 2021-12-29T03:30:25.535868718Z 2021-12-29 03:30:25,535 ERROR tuned.plugins.plugin_script: script '/usr/lib/tuned/cpu-partitioning/script.sh' error output: 'Unit ksm.service does not exist, proceeding anyway. 2021-12-29T03:30:25.535868718Z Unit ksmtuned.service does not exist, proceeding anyway.' 2021-12-29T03:30:25.536893772Z 2021-12-29 03:30:25,536 INFO tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2 2021-12-29T03:30:25.537422292Z E1229 03:30:25.537398 16277 tuned.go:776] unable to sync(daemon/) requeued (6) 2021-12-29T03:30:25.537499978Z E1229 03:30:25.537479 16277 tuned.go:776] unable to sync(daemon/) requeued (7) 2021-12-29T03:30:25.537575410Z 2021-12-29 03:30:25,537 INFO tuned.daemon.daemon: static tuning from profile 'openshift-sysctl-call' applied Expected results: tuned profile 'DEGRADED STATUS' will be false Additional info:
This is fixed upstream by https://github.com/redhat-performance/tuned/pull/331 The latest TuneD shipped via FDP in 4.10 already has the fix. Nevertheless, other fix is needed for 4.10 for [bootloader] plugin. PR to follow soon.
Fixed on 4.10.0-0.nightly-2022-01-05-181126 and above. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-05-181126 True False 11h Cluster version is 4.10.0-0.nightly-2022-01-05-181126 $ oc get no NAME STATUS ROLES AGE VERSION jmencak-jjhml-master-0.c.openshift-gce-devel.internal Ready master 12h v1.22.1+6859754 jmencak-jjhml-master-1.c.openshift-gce-devel.internal Ready master 12h v1.22.1+6859754 jmencak-jjhml-master-2.c.openshift-gce-devel.internal Ready master 12h v1.22.1+6859754 jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal Ready worker 12h v1.22.1+6859754 jmencak-jjhml-worker-b-k54sj.c.openshift-gce-devel.internal Ready worker 12h v1.22.1+6859754 $ oc label no jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal node-role.kubernetes.io/worker-rt= node/jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal labeled $ oc create -f- <<EOF apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-cpu-partitioning namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift cpu-partitioning profile include=openshift-node,cpu-partitioning [variables] # {isolated,no_balance}_cores take a list of ranges; e.g. isolated_cores=2,4-7 isolated_cores=1 no_balance_cores=1 [bootloader] # set empty values to disable RHEL initrd setting in cpu-partitioning initrd_remove_dir= initrd_dst_img= initrd_add_dir= name: openshift-cpu-partitioning recommend: - match: - label: node-role.kubernetes.io/worker-rt priority: 20 profile: openshift-cpu-partitioning EOF $ oc get po -o wide|grep worker-a tuned-hshhh 1/1 Running 0 12h 10.0.128.3 jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal <none> <none> $ oc logs tuned-hshhh | grep ERROR 2022-01-06 08:37:25,761 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524 2022-01-06 08:37:26,253 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524 2022-01-06 08:37:26,312 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524 $ oc get profile NAME TUNED APPLIED DEGRADED AGE jmencak-jjhml-master-0.c.openshift-gce-devel.internal openshift-control-plane True False 12h jmencak-jjhml-master-1.c.openshift-gce-devel.internal openshift-control-plane True False 12h jmencak-jjhml-master-2.c.openshift-gce-devel.internal openshift-control-plane True False 12h jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal openshift-cpu-partitioning True True 12h jmencak-jjhml-worker-b-k54sj.c.openshift-gce-devel.internal openshift-node True False 12h Now, the profile `jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal` is Degraded, however, that's expected on GCP/AWS/... and VMs where you cannot set kernel.nmi_watchdog sysctl and TuneD issues ERROR in the logs. You will not see this on bare metal and the profile will not be degraded. Looking through the logs, there is no longer an issue with ksm.service. $ oc logs tuned-hshhh | grep ksm.service
Verified in my cluster as below: [ocpadmin@ec2-18-217-45-133 sro]$ oc get no NAME STATUS ROLES AGE VERSION liqcui-gcp4906-pmrrj-master-0.c.openshift-qe.internal Ready master 86m v1.22.1+6859754 liqcui-gcp4906-pmrrj-master-1.c.openshift-qe.internal Ready master 86m v1.22.1+6859754 liqcui-gcp4906-pmrrj-master-2.c.openshift-qe.internal Ready master 86m v1.22.1+6859754 liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal Ready worker 72m v1.22.1+6859754 liqcui-gcp4906-pmrrj-worker-b-7lz6j.c.openshift-qe.internal Ready worker 75m v1.22.1+6859754 liqcui-gcp4906-pmrrj-worker-c-llvnm.c.openshift-qe.internal Ready worker 75m v1.22.1+6859754 [ocpadmin@ec2-18-217-45-133 sro]$ oc label no liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal node-role.kubernetes.io/worker-rt= node/liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal labeled [ocpadmin@ec2-18-217-45-133 sro]$ oc create -f- <<EOF > apiVersion: tuned.openshift.io/v1 > kind: Tuned > metadata: > name: openshift-cpu-partitioning > namespace: openshift-cluster-node-tuning-operator > spec: > profile: > - data: | > [main] > summary=Custom OpenShift cpu-partitioning profile > include=openshift-node,cpu-partitioning > [variables] > # {isolated,no_balance}_cores take a list of ranges; e.g. isolated_cores=2,4-7 > isolated_cores=1 > no_balance_cores=1 > [bootloader] > # set empty values to disable RHEL initrd setting in cpu-partitioning > initrd_remove_dir= > initrd_dst_img= > initrd_add_dir= > name: openshift-cpu-partitioning > > recommend: > - match: > - label: node-role.kubernetes.io/worker-rt > priority: 20 > profile: openshift-cpu-partitioning > EOF tuned.tuned.openshift.io/openshift-cpu-partitioning created [ocpadmin@ec2-18-217-45-133 sro]$ oc get ns |grep tun openshift-cluster-node-tuning-operator Active 92m [ocpadmin@ec2-18-217-45-133 sro]$ oc get po -n openshift-cluster-node-tuning-operator -o wide|grep liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal tuned-fnxz8 1/1 Running 0 75m 10.0.128.2 liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal <none> <none> [ocpadmin@ec2-18-217-45-133 sro]$ oc logs tuned-fnxz8 -n openshift-cluster-node-tuning-operator | tail -10 2022-01-06 14:54:25,388 INFO tuned.plugins.plugin_cpu: setting new cpu latency 0 2022-01-06 14:54:25,390 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524 2022-01-06 14:54:25,390 INFO tuned.plugins.plugin_sysctl: reapplying system sysctl 2022-01-06 14:54:25,489 INFO tuned.plugins.plugin_systemd: setting 'CPUAffinity' to '0 2 3' in the '/etc/systemd/system.conf' 2022-01-06 14:54:25,508 INFO tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['start']' 2022-01-06 14:54:25,642 INFO tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2 2022-01-06 14:54:25,643 INFO tuned.plugins.plugin_bootloader: cannot find grub.cfg to patch E0106 14:54:25.643783 3470 controller.go:775] unable to sync(daemon/) requeued (4) E0106 14:54:25.643824 3470 controller.go:775] unable to sync(daemon/) requeued (5) 2022-01-06 14:54:25,643 INFO tuned.daemon.daemon: static tuning from profile 'openshift-cpu-partitioning' applied [ocpadmin@ec2-18-217-45-133 sro]$ oc get profile -n openshift-cluster-node-tuning-operator NAME TUNED APPLIED DEGRADED AGE liqcui-gcp4906-pmrrj-master-0.c.openshift-qe.internal openshift-control-plane True False 86m liqcui-gcp4906-pmrrj-master-1.c.openshift-qe.internal openshift-control-plane True False 86m liqcui-gcp4906-pmrrj-master-2.c.openshift-qe.internal openshift-control-plane True False 86m liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal openshift-cpu-partitioning True True 76m liqcui-gcp4906-pmrrj-worker-b-7lz6j.c.openshift-qe.internal openshift-node True False 78m liqcui-gcp4906-pmrrj-worker-c-llvnm.c.openshift-qe.internal openshift-node True False 78m [ocpadmin@ec2-18-217-45-133 sro]$ oc logs tuned-fnxz8 -n openshift-cluster-node-tuning-operator | grep ksm.service
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056