Bug 2036303

Summary: The tuned profile goes into degraded status and ksm.service is displayed in the log.
Product: OpenShift Container Platform Reporter: Aaron Park <aapark>
Component: Node Tuning OperatorAssignee: Jiří Mencák <jmencak>
Status: CLOSED ERRATA QA Contact: liqcui
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, dagray
Target Milestone: ---   
Target Release: 4.9.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2037036 (view as bug list) Environment:
Last Closed: 2022-01-17 08:07:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2037036    
Bug Blocks:    

Description Aaron Park 2021-12-31 05:22:11 UTC
Description of problem:
When the Tuned profile is updated. The tuned profile is applied to the node, but still remains DEGRADED.


Version-Release number of selected component (if applicable):
$ omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version  4.9.12   True       False        38m    Error while reconciling 4.9.12: the cluster operator insights is degraded

How reproducible:


Steps to Reproduce:
1. Install and setup performance addon operator
[root@bastion1 dk]# oc get performanceprofiles.performance.openshift.io performance -oyaml
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
 ​creationTimestamp: "2021-11-02T10:18:56Z"
 ​finalizers:
 ​- foreground-deletion
 ​generation: 1
 ​name: performance
 ​resourceVersion: "9172819"
 ​uid: 931a600a-7e9a-499d-9e08-f99abbdd90ed
spec:
 ​cpu:
   ​isolated: 4-39,44-79
   ​reserved: 0-3,40-43
 ​globallyDisableIrqLoadBalancing: true
 ​hugepages:
   ​defaultHugepagesSize: 1G
   ​pages:
   ​- count: 32
     ​node: 0
     ​size: 1G
   ​- count: 32
     ​node: 1
     ​size: 1G
 ​nodeSelector:
   ​node-role.kubernetes.io/sys: ""
 ​numa:
   ​topologyPolicy: restricted

2. create a tuned profile
[root@bastion1 smile]# cat tuned_sysctl_socket_buffer_profile.yaml
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
 ​name: sysctl-socket-buffer
 ​namespace: openshift-cluster-node-tuning-operator
spec:
 ​profile:
 ​- data: |
     ​[main]
     ​summary=Set rmem_default,rmem_max,wmem_default,wmem_max
     ​include=openshift-node
     ​[sysctl]
     ​net.core.rmem_default = 2097152
     ​net.core.rmem_max = 2097152
     ​net.core.wmem_default = 2097152
     ​net.core.wmem_max = 2097152
   ​name: openshift-sysctl
 ​recommend:
 ​- machineConfigLabels:
     ​machineconfiguration.openshift.io/role: "sys"
   ​priority: 20
   ​profile: openshift-sysctl

3. tuned profile is degraded
[root@bastion1 dk]# oc get profile -A
NAMESPACE                                NAME                         TUNED                     APPLIED   DEGRADED   AGE
openshift-cluster-node-tuning-operator   master01.ss2.samsung.local   openshift-control-plane   True      False      65d
openshift-cluster-node-tuning-operator   master02.ss2.samsung.local   openshift-control-plane   True      False      64d
openshift-cluster-node-tuning-operator   master03.ss2.samsung.local   openshift-control-plane   True      False      65d
openshift-cluster-node-tuning-operator   worker01.ss2.samsung.local   openshift-sysctl-oam      True      True       61d
openshift-cluster-node-tuning-operator   worker02.ss2.samsung.local   openshift-sysctl-oam      True      False      61d
openshift-cluster-node-tuning-operator   worker03.ss2.samsung.local   openshift-sysctl-oam      True      True       61d
openshift-cluster-node-tuning-operator   worker04.ss2.samsung.local   openshift-sysctl-oam      True      False      61d
openshift-cluster-node-tuning-operator   worker05.ss2.samsung.local   openshift-sysctl-sys      True      False      61d
openshift-cluster-node-tuning-operator   worker06.ss2.samsung.local   openshift-sysctl-sys      True      True       61d
openshift-cluster-node-tuning-operator   worker07.ss2.samsung.local   openshift-sysctl-sys      True      False      61d
openshift-cluster-node-tuning-operator   worker08.ss2.samsung.local   openshift-sysctl-sys      True      False      61d
openshift-cluster-node-tuning-operator   worker09.ss2.samsung.local   openshift-sysctl-call     True      False      34d
openshift-cluster-node-tuning-operator   worker10.ss2.samsung.local   openshift-sysctl-call     True      True       34d
openshift-cluster-node-tuning-operator   worker11.ss2.samsung.local   openshift-sysctl-call2    True      False      6d20h
openshift-cluster-node-tuning-operator   worker12.ss2.samsung.local   openshift-sysctl-call2    True      False      6d20h

Actual results:
1) Error occurred in tuned profile
--
$ omg get profile worker10.ss2.samsung.local -o yaml
~
status:
  bootcmdline: skew_tick=1 nohz=on rcu_nocbs=4-27,32-55 tuned.non_isolcpus=f000000f
    intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,4-27,32-55
    systemd.cpu_affinity=0,1,2,3,28,29,30,31 default_hugepagesz=1G +
  conditions:
  - lastTransitionTime: '2021-12-29T03:30:22Z'
    message: Tuned profile applied.
    reason: AsExpected
    status: 'True'
    type: Applied
  - lastTransitionTime: '2021-12-29T03:30:22Z'
    message: Tuned daemon issued one or more error message(s) during profile application.
    reason: TunedError
    status: 'True'
    type: Degraded
  tunedProfile: openshift-sysctl-call
--

2) error log in tuned Pod
--
$ omg logs tuned-zzgm5
~
2021-12-29T03:30:24.027172311Z 2021-12-29 03:30:24,027 INFO     tuned.plugins.plugin_cpu: setting new cpu latency 2
2021-12-29T03:30:24.033503757Z 2021-12-29 03:30:24,033 INFO     tuned.plugins.plugin_sysctl: reapplying system sysctl
2021-12-29T03:30:24.528353891Z 2021-12-29 03:30:24,528 INFO     tuned.plugins.plugin_systemd: setting 'CPUAffinity' to '0 1 2 3 28 29 30 31' in the '/etc/systemd/system.conf'
2021-12-29T03:30:25.007818601Z 2021-12-29 03:30:25,007 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['start']'
2021-12-29T03:30:25.535868718Z 2021-12-29 03:30:25,535 ERROR    tuned.plugins.plugin_script: script '/usr/lib/tuned/cpu-partitioning/script.sh' error output: 'Unit ksm.service does not exist, proceeding anyway.
2021-12-29T03:30:25.535868718Z Unit ksmtuned.service does not exist, proceeding anyway.'
2021-12-29T03:30:25.536893772Z 2021-12-29 03:30:25,536 INFO     tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2
2021-12-29T03:30:25.537422292Z E1229 03:30:25.537398   16277 tuned.go:776] unable to sync(daemon/) requeued (6)
2021-12-29T03:30:25.537499978Z E1229 03:30:25.537479   16277 tuned.go:776] unable to sync(daemon/) requeued (7)
2021-12-29T03:30:25.537575410Z 2021-12-29 03:30:25,537 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-sysctl-call' applied

Expected results:
tuned profile 'DEGRADED STATUS' will be false

Additional info:

Comment 2 liqcui 2022-01-10 10:06:19 UTC
Verified Result:

[ocpadmin@ec2-18-217-45-133 ~]$ oc get nodes
NAME                                                         STATUS   ROLES    AGE   VERSION
liqcui-oc4903-x4dvl-master-0.c.openshift-qe.internal         Ready    master   24m   v1.22.3+e790d7f
liqcui-oc4903-x4dvl-master-1.c.openshift-qe.internal         Ready    master   24m   v1.22.3+e790d7f
liqcui-oc4903-x4dvl-master-2.c.openshift-qe.internal         Ready    master   24m   v1.22.3+e790d7f
liqcui-oc4903-x4dvl-worker-a-xmdcs.c.openshift-qe.internal   Ready    worker   13m   v1.22.3+e790d7f
liqcui-oc4903-x4dvl-worker-b-fl7pn.c.openshift-qe.internal   Ready    worker   13m   v1.22.3+e790d7f
liqcui-oc4903-x4dvl-worker-c-2d4zg.c.openshift-qe.internal   Ready    worker   16m   v1.22.3+e790d7f
[ocpadmin@ec2-18-217-45-133 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2022-01-10-045851   True        False         19m     Cluster version is 4.9.0-0.nightly-2022-01-10-045851
[ocpadmin@ec2-18-217-45-133 ~]$ oc label no liqcui-oc4903-x4dvl-worker-a-xmdcs.c.openshift-qe.internal node-role.kubernetes.io/worker-rt=
node/liqcui-oc4903-x4dvl-worker-a-xmdcs.c.openshift-qe.internal labeled
[ocpadmin@ec2-18-217-45-133 ~]$ oc create -f- <<EOF
> apiVersion: tuned.openshift.io/v1
> kind: Tuned
> metadata:
>   name: openshift-cpu-partitioning
>   namespace: openshift-cluster-node-tuning-operator
> spec:
>   profile:
>   - data: |
>       [main]
>       summary=Custom OpenShift cpu-partitioning profile
>       include=openshift-node,cpu-partitioning
>       [variables]
>       # {isolated,no_balance}_cores take a list of ranges; e.g. isolated_cores=2,4-7
>       isolated_cores=1
>       no_balance_cores=1
>       [bootloader]
>       # set empty values to disable RHEL initrd setting in cpu-partitioning
>       initrd_remove_dir=
>       initrd_dst_img=
>       initrd_add_dir=
>     name: openshift-cpu-partitioning
> 
>   recommend:
>   - match:
>     - label: node-role.kubernetes.io/worker-rt
>     priority: 20
>     profile: openshift-cpu-partitioning
> EOF
tuned.tuned.openshift.io/openshift-cpu-partitioning created
[ocpadmin@ec2-18-217-45-133 ~]$ oc project openshift-cluster-node-tuning-operator
Now using project "openshift-cluster-node-tuning-operator" on server "https://api.liqcui-oc4903.qe.gcp.devcluster.openshift.com:6443".
[ocpadmin@ec2-18-217-45-133 ~]$ oc get po -o wide|grep liqcui-oc4903-x4dvl-worker-a-xmdcs.c.openshift-qe.internal
tuned-fq9jf                                     1/1     Running   0          29m   10.0.128.2    liqcui-oc4903-x4dvl-worker-a-xmdcs.c.openshift-qe.internal   <none>           <none>
[ocpadmin@ec2-18-217-45-133 ~]$ oc get profile
NAME                                                         TUNED                        APPLIED   DEGRADED   AGE
liqcui-oc4903-x4dvl-master-0.c.openshift-qe.internal         openshift-control-plane      True      False      35m
liqcui-oc4903-x4dvl-master-1.c.openshift-qe.internal         openshift-control-plane      True      False      35m
liqcui-oc4903-x4dvl-master-2.c.openshift-qe.internal         openshift-control-plane      True      False      35m
liqcui-oc4903-x4dvl-worker-a-xmdcs.c.openshift-qe.internal   openshift-cpu-partitioning   True      True       29m
liqcui-oc4903-x4dvl-worker-b-fl7pn.c.openshift-qe.internal   openshift-node               True      False      29m
liqcui-oc4903-x4dvl-worker-c-2d4zg.c.openshift-qe.internal   openshift-node               True      False      30m
[ocpadmin@ec2-18-217-45-133 ~]$ oc logs tuned-fq9jf | grep ksm.service
[ocpadmin@ec2-18-217-45-133 ~]$ oc logs tuned-fq9jf | tail -15
2022-01-10 10:00:19,575 INFO     tuned.daemon.daemon: starting tuning
2022-01-10 10:00:19,579 INFO     tuned.plugins.base: instance cpu: assigning devices cpu2, cpu3, cpu1, cpu0
2022-01-10 10:00:19,580 INFO     tuned.plugins.plugin_cpu: We are running on an x86 GenuineIntel platform
2022-01-10 10:00:19,583 WARNING  tuned.plugins.plugin_cpu: your CPU doesn't support MSR_IA32_ENERGY_PERF_BIAS, ignoring CPU energy performance bias
2022-01-10 10:00:19,585 INFO     tuned.plugins.base: instance disk: assigning devices sda
2022-01-10 10:00:19,587 INFO     tuned.plugins.base: instance net: assigning devices ens4
2022-01-10 10:00:19,594 INFO     tuned.plugins.plugin_cpu: setting new cpu latency 0
2022-01-10 10:00:19,597 ERROR    tuned.plugins.plugin_sysctl: Failed to set sysctl parameter 'kernel.nmi_watchdog' to '0': [Errno 524] Unknown error 524
2022-01-10 10:00:19,597 INFO     tuned.plugins.plugin_sysctl: reapplying system sysctl
2022-01-10 10:00:19,711 INFO     tuned.plugins.plugin_systemd: setting 'CPUAffinity' to '0 2 3' in the '/etc/systemd/system.conf'
2022-01-10 10:00:19,741 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['start']'
2022-01-10 10:00:19,881 INFO     tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2
E0110 10:00:19.882539    2566 tuned.go:776] unable to sync(daemon/) requeued (4)
E0110 10:00:19.882576    2566 tuned.go:776] unable to sync(daemon/) requeued (5)
2022-01-10 10:00:19,882 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-cpu-partitioning' applied

Comment 5 errata-xmlrpc 2022-01-17 08:07:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0110