Bug 2037036
Summary: | The tuned profile goes into degraded status and ksm.service is displayed in the log. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jiří Mencák <jmencak> |
Component: | Node Tuning Operator | Assignee: | Jiří Mencák <jmencak> |
Status: | CLOSED ERRATA | QA Contact: | liqcui |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.10 | CC: | aapark, aos-bugs, dagray, liqcui |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 2036303 | Environment: | |
Last Closed: | 2022-03-12 04:40:12 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2036303 |
Description
Jiří Mencák
2022-01-04 17:11:40 UTC
This is fixed upstream by https://github.com/redhat-performance/tuned/pull/331 The latest TuneD shipped via FDP in 4.10 already has the fix. Nevertheless, other fix is needed for 4.10 for [bootloader] plugin. PR to follow soon. Fixed on 4.10.0-0.nightly-2022-01-05-181126 and above. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-05-181126 True False 11h Cluster version is 4.10.0-0.nightly-2022-01-05-181126 $ oc get no NAME STATUS ROLES AGE VERSION jmencak-jjhml-master-0.c.openshift-gce-devel.internal Ready master 12h v1.22.1+6859754 jmencak-jjhml-master-1.c.openshift-gce-devel.internal Ready master 12h v1.22.1+6859754 jmencak-jjhml-master-2.c.openshift-gce-devel.internal Ready master 12h v1.22.1+6859754 jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal Ready worker 12h v1.22.1+6859754 jmencak-jjhml-worker-b-k54sj.c.openshift-gce-devel.internal Ready worker 12h v1.22.1+6859754 $ oc label no jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal node-role.kubernetes.io/worker-rt= node/jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal labeled $ oc create -f- <<EOF apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-cpu-partitioning namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift cpu-partitioning profile include=openshift-node,cpu-partitioning [variables] # {isolated,no_balance}_cores take a list of ranges; e.g. isolated_cores=2,4-7 isolated_cores=1 no_balance_cores=1 [bootloader] # set empty values to disable RHEL initrd setting in cpu-partitioning initrd_remove_dir= initrd_dst_img= initrd_add_dir= name: openshift-cpu-partitioning recommend: - match: - label: node-role.kubernetes.io/worker-rt priority: 20 profile: openshift-cpu-partitioning EOF $ oc get po -o wide|grep worker-a tuned-hshhh 1/1 Running 0 12h 10.0.128.3 jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal <none> <none> $ oc logs tuned-hshhh | grep ERROR 2022-01-06 08:37:25,761 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524 2022-01-06 08:37:26,253 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524 2022-01-06 08:37:26,312 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524 $ oc get profile NAME TUNED APPLIED DEGRADED AGE jmencak-jjhml-master-0.c.openshift-gce-devel.internal openshift-control-plane True False 12h jmencak-jjhml-master-1.c.openshift-gce-devel.internal openshift-control-plane True False 12h jmencak-jjhml-master-2.c.openshift-gce-devel.internal openshift-control-plane True False 12h jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal openshift-cpu-partitioning True True 12h jmencak-jjhml-worker-b-k54sj.c.openshift-gce-devel.internal openshift-node True False 12h Now, the profile `jmencak-jjhml-worker-a-8kc2n.c.openshift-gce-devel.internal` is Degraded, however, that's expected on GCP/AWS/... and VMs where you cannot set kernel.nmi_watchdog sysctl and TuneD issues ERROR in the logs. You will not see this on bare metal and the profile will not be degraded. Looking through the logs, there is no longer an issue with ksm.service. $ oc logs tuned-hshhh | grep ksm.service Verified in my cluster as below:
[ocpadmin@ec2-18-217-45-133 sro]$ oc get no
NAME STATUS ROLES AGE VERSION
liqcui-gcp4906-pmrrj-master-0.c.openshift-qe.internal Ready master 86m v1.22.1+6859754
liqcui-gcp4906-pmrrj-master-1.c.openshift-qe.internal Ready master 86m v1.22.1+6859754
liqcui-gcp4906-pmrrj-master-2.c.openshift-qe.internal Ready master 86m v1.22.1+6859754
liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal Ready worker 72m v1.22.1+6859754
liqcui-gcp4906-pmrrj-worker-b-7lz6j.c.openshift-qe.internal Ready worker 75m v1.22.1+6859754
liqcui-gcp4906-pmrrj-worker-c-llvnm.c.openshift-qe.internal Ready worker 75m v1.22.1+6859754
[ocpadmin@ec2-18-217-45-133 sro]$ oc label no liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal node-role.kubernetes.io/worker-rt=
node/liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal labeled
[ocpadmin@ec2-18-217-45-133 sro]$ oc create -f- <<EOF
> apiVersion: tuned.openshift.io/v1
> kind: Tuned
> metadata:
> name: openshift-cpu-partitioning
> namespace: openshift-cluster-node-tuning-operator
> spec:
> profile:
> - data: |
> [main]
> summary=Custom OpenShift cpu-partitioning profile
> include=openshift-node,cpu-partitioning
> [variables]
> # {isolated,no_balance}_cores take a list of ranges; e.g. isolated_cores=2,4-7
> isolated_cores=1
> no_balance_cores=1
> [bootloader]
> # set empty values to disable RHEL initrd setting in cpu-partitioning
> initrd_remove_dir=
> initrd_dst_img=
> initrd_add_dir=
> name: openshift-cpu-partitioning
>
> recommend:
> - match:
> - label: node-role.kubernetes.io/worker-rt
> priority: 20
> profile: openshift-cpu-partitioning
> EOF
tuned.tuned.openshift.io/openshift-cpu-partitioning created
[ocpadmin@ec2-18-217-45-133 sro]$ oc get ns |grep tun
openshift-cluster-node-tuning-operator Active 92m
[ocpadmin@ec2-18-217-45-133 sro]$ oc get po -n openshift-cluster-node-tuning-operator -o wide|grep liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal
tuned-fnxz8 1/1 Running 0 75m 10.0.128.2 liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal <none> <none>
[ocpadmin@ec2-18-217-45-133 sro]$ oc logs tuned-fnxz8 -n openshift-cluster-node-tuning-operator | tail -10
2022-01-06 14:54:25,388 INFO tuned.plugins.plugin_cpu: setting new cpu latency 0
2022-01-06 14:54:25,390 ERROR tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524
2022-01-06 14:54:25,390 INFO tuned.plugins.plugin_sysctl: reapplying system sysctl
2022-01-06 14:54:25,489 INFO tuned.plugins.plugin_systemd: setting 'CPUAffinity' to '0 2 3' in the '/etc/systemd/system.conf'
2022-01-06 14:54:25,508 INFO tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['start']'
2022-01-06 14:54:25,642 INFO tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2
2022-01-06 14:54:25,643 INFO tuned.plugins.plugin_bootloader: cannot find grub.cfg to patch
E0106 14:54:25.643783 3470 controller.go:775] unable to sync(daemon/) requeued (4)
E0106 14:54:25.643824 3470 controller.go:775] unable to sync(daemon/) requeued (5)
2022-01-06 14:54:25,643 INFO tuned.daemon.daemon: static tuning from profile 'openshift-cpu-partitioning' applied
[ocpadmin@ec2-18-217-45-133 sro]$ oc get profile -n openshift-cluster-node-tuning-operator
NAME TUNED APPLIED DEGRADED AGE
liqcui-gcp4906-pmrrj-master-0.c.openshift-qe.internal openshift-control-plane True False 86m
liqcui-gcp4906-pmrrj-master-1.c.openshift-qe.internal openshift-control-plane True False 86m
liqcui-gcp4906-pmrrj-master-2.c.openshift-qe.internal openshift-control-plane True False 86m
liqcui-gcp4906-pmrrj-worker-a-vh9d7.c.openshift-qe.internal openshift-cpu-partitioning True True 76m
liqcui-gcp4906-pmrrj-worker-b-7lz6j.c.openshift-qe.internal openshift-node True False 78m
liqcui-gcp4906-pmrrj-worker-c-llvnm.c.openshift-qe.internal openshift-node True False 78m
[ocpadmin@ec2-18-217-45-133 sro]$ oc logs tuned-fnxz8 -n openshift-cluster-node-tuning-operator | grep ksm.service
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |