Bug 1998247 - Tuned configuration fails and does not recover when profile references a not yet existing performance profile configuration
Summary: Tuned configuration fails and does not recover when profile references a not ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Tuning Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Jiří Mencák
QA Contact: Simon
URL:
Whiteboard:
Depends On:
Blocks: 1999608
TreeView+ depends on / blocked
 
Reported: 2021-08-26 17:03 UTC by Marius Cornea
Modified: 2021-10-18 17:49 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1999608 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:49:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-node-tuning-operator pull 267 0 None None None 2021-08-30 18:32:07 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:49:35 UTC

Description Marius Cornea 2021-08-26 17:03:33 UTC
Description of problem:

The following Tuned profile created at 2021-08-26T15:16:49Z includes a configuration(include=openshift-node-performance-profile) which would be created by a PerformanceProfile at a later time 2021-08-26T15:25:04Z. After creating the PerformanceProfile the Tuned configuration still doesn't get applied and the performance profile reports a TunedError. I'd expect that once the performance profile gets created the performance-patch Tuned profile which includes it can continue its configuration.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  creationTimestamp: "2021-08-26T15:16:49Z"
  generation: 1
  name: performance-patch
  namespace: openshift-cluster-node-tuning-operator
  resourceVersion: "25666"
  uid: 99e9a0ec-d9dc-4f7e-a515-6ae5b2b2047b
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-profile
      [bootloader]
      cmdline_crash=nohz_full=2-23,26-47
      [sysctl]
      kernel.timer_migration=1
      [service]
      service.stalld=start,enable
    name: performance-patch
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: master
    priority: 19
    profile: performance-patch


apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  creationTimestamp: "2021-08-26T15:25:04Z"
  finalizers:
  - foreground-deletion
  generation: 1
  name: openshift-node-performance-profile
  resourceVersion: "36276"
  uid: bf81e817-6347-4393-afff-6ee1850e09e8
spec:
  additionalKernelArgs:
  - idle=poll
  cpu:
    isolated: 2-23,26-47
    reserved: 0-1,24-25
  globallyDisableIrqLoadBalancing: true
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 32
      size: 1G
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: false
status:
  conditions:
  - lastHeartbeatTime: "2021-08-26T15:43:21Z"
    lastTransitionTime: "2021-08-26T15:43:21Z"
    status: "False"
    type: Available
  - lastHeartbeatTime: "2021-08-26T15:43:21Z"
    lastTransitionTime: "2021-08-26T15:43:21Z"
    status: "False"
    type: Upgradeable
  - lastHeartbeatTime: "2021-08-26T15:43:21Z"
    lastTransitionTime: "2021-08-26T15:43:21Z"
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2021-08-26T15:43:21Z"
    lastTransitionTime: "2021-08-26T15:43:21Z"
    message: |
      Tuned sno.kni-qe-1.lab.eng.rdu2.redhat.com Degraded Reason: TunedError.
      Tuned sno.kni-qe-1.lab.eng.rdu2.redhat.com Degraded Message: Tuned daemon issued one or more error message(s) during profile application..
      Tuned sno.kni-qe-1.lab.eng.rdu2.redhat.com Degraded Reason: TunedError.
      Tuned sno.kni-qe-1.lab.eng.rdu2.redhat.com Degraded Message: Tuned daemon issued one or more error message(s) during profile application..
    reason: TunedProfileDegraded
    status: "True"
    type: Degraded
  runtimeClass: performance-openshift-node-performance-profile
  tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-openshift-node-performance-profile

Version-Release number of selected component (if applicable):
4.8.5

How reproducible:
100%

Steps to Reproduce:
1. Create a Tuned profile which includes configuration set by a performance profile which does not yet exist
2. Create the performance profile at a later time than step 1

Actual results:
Performance profile reports Tuned errors

Expected results:
Tuned configuration retries and succeeds once the performance profile is created

Additional info:

This issue has been observed while testing the DU ZTP flow where the profiles get created by ACM policies and there is no ordering in which resource gets created first.

Comment 1 Jiří Mencák 2021-08-27 07:11:31 UTC
Thank you for the report.  Could you please provide either must-gather, or the output of:

$ oc get profile -n openshift-cluster-node-tuning-operator 

and the logs from the Tuned container on the node that fail to apply the profile?

Comment 2 Jiří Mencák 2021-08-27 08:41:38 UTC
No need for must-gather or the output I asked for.  Have a minimal reproducer for NTO.

Comment 5 Simon 2021-09-02 15:59:35 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-01-193941   True        False         3h45m   Cluster version is 4.9.0-0.nightly-2021-09-01-193941

$ node=$(oc get nodes | grep -m 1 worker | cut -f 1 -d ' ') && echo $node
pod=$(oc get pods -n openshift-cluster-node-tuning-operator -o wide | grep $node | cut -d ' ' -f 1) && echo $pod
ip-10-0-136-123.us-east-2.compute.internal
tuned-xsxrv

$ oc get routes -n openshift-console
NAME        HOST/PORT                                                                 PATH   SERVICES    PORT    TERMINATION          WILDCARD
console     console-openshift-console.apps.skordas92b.qe.devcluster.openshift.com            console     https   reencrypt/Redirect   None
downloads   downloads-openshift-console.apps.skordas92b.qe.devcluster.openshift.com          downloads   http    edge/Redirect        None

# Log in into console
# Install Performance Addon Operator
# Operators -> Operator Hub -> Performance Addon Operator -> Install

$ oc get pods -n openshift-operators 
NAME                                    READY   STATUS    RESTARTS   AGE
performance-operator-7fc5bcb7c9-4m67g   1/1     Running   0          91s

# Create tuned

oc create -f- <<EOF
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-profile
      [bootloader]
      cmdline_crash=nohz_full=2-23,26-47
      [sysctl]
      kernel.timer_migration=1
      [service]
      service.stalld=start,enable
    name: performance-patch
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: master
    priority: 19
    profile: performance-patch
EOF


$ oc get tuned -n openshift-cluster-node-tuning-operator 
NAME                AGE
default             4h31m
performance-patch   14s
rendered            4h31m

$ oc get profiles -n openshift-cluster-node-tuning-operator 
NAME                                         TUNED               APPLIED   DEGRADED   AGE
ip-10-0-136-123.us-east-2.compute.internal   openshift-node      True      False      4h24m
ip-10-0-147-0.us-east-2.compute.internal     performance-patch   False     True       4h31m
ip-10-0-161-12.us-east-2.compute.internal    performance-patch   False     True       4h31m
ip-10-0-178-33.us-east-2.compute.internal    openshift-node      True      False      4h24m
ip-10-0-199-56.us-east-2.compute.internal    performance-patch   False     True       4h31m
ip-10-0-204-47.us-east-2.compute.internal    openshift-node      True      False      4h24m


# create Performance profile

oc create -f- <<EOF
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  finalizers:
  - foreground-deletion
  name: openshift-node-performance-profile
spec:
  additionalKernelArgs:
  - idle=poll
  cpu:
    isolated: 2-23,26-47
    reserved: 0-1,24-25
  globallyDisableIrqLoadBalancing: true
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 32
      size: 1G
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: false
EOF

$ oc get performanceprofiles.performance.openshift.io -n openshift-operators -o yaml
apiVersion: v1
items:
- apiVersion: performance.openshift.io/v2
  kind: PerformanceProfile
  metadata:
    creationTimestamp: "2021-09-02T15:54:19Z"
    finalizers:
    - foreground-deletion
    generation: 1
    name: openshift-node-performance-profile
    resourceVersion: "105104"
    uid: a227e6c2-8480-49c9-b7d6-619292d2f8eb
  spec:
    additionalKernelArgs:
    - idle=poll
    cpu:
      isolated: 2-23,26-47
      reserved: 0-1,24-25
    globallyDisableIrqLoadBalancing: true
    hugepages:
      defaultHugepagesSize: 1G
      pages:
      - count: 32
        size: 1G
    machineConfigPoolSelector:
      pools.operator.machineconfiguration.openshift.io/master: ""
    nodeSelector:
      node-role.kubernetes.io/master: ""
    numa:
      topologyPolicy: restricted
    realTimeKernel:
      enabled: false
  status:
    conditions:
    - lastHeartbeatTime: "2021-09-02T15:54:20Z"
      lastTransitionTime: "2021-09-02T15:54:20Z"
      status: "True"
      type: Available
    - lastHeartbeatTime: "2021-09-02T15:54:20Z"
      lastTransitionTime: "2021-09-02T15:54:20Z"
      status: "True"
      type: Upgradeable
    - lastHeartbeatTime: "2021-09-02T15:54:20Z"
      lastTransitionTime: "2021-09-02T15:54:20Z"
      status: "False"
      type: Progressing
    - lastHeartbeatTime: "2021-09-02T15:54:20Z"
      lastTransitionTime: "2021-09-02T15:54:20Z"
      status: "False"
      type: Degraded
    runtimeClass: performance-openshift-node-performance-profile
    tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-openshift-node-performance-profile
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

No errors after applying performance after tuned

Comment 8 errata-xmlrpc 2021-10-18 17:49:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.