Bug 2021151 - Sometimes the DU node does not get the performance profile configuration applied and MachineConfigPool stays stuck in Updating
Summary: Sometimes the DU node does not get the performance profile configuration appl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.10.0
Assignee: Sebastian Scheinkman
QA Contact: Marius Cornea
URL:
Whiteboard:
: 2015305 2016600 2021534 2022665 (view as bug list)
Depends On:
Blocks: 2066401
TreeView+ depends on / blocked
 
Reported: 2021-11-08 13:11 UTC by Marius Cornea
Modified: 2022-03-21 16:50 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: MachineConfigPool stays stuck in Updating because the sriov operator was not pausing the machine config pool the is assign to on SNO Consequence: The machine config configuration was not able to be updated Fix: The sriov operator will pause the right machine config pool before running any configuration requiring reboot. Result: The sriov operator is fully supported on SNO deployments
Clone Of:
: 2066401 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:26:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github k8snetworkplumbingwg sriov-network-operator pull 213 0 None open Better support for openshift single node 2021-12-07 18:27:49 UTC
Github openshift sriov-network-operator pull 607 0 None open Bug 2021151: Sync master 23 12 21 2021-12-23 12:42:22 UTC
Github openshift sriov-network-operator pull 609 0 None open Revert "Bug 2021151: Sync master 23 12 21" 2021-12-27 12:35:38 UTC
Github openshift sriov-network-operator pull 610 0 None open Bug 2021151: Sync master 27 12 21 2021-12-27 14:43:41 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:26:27 UTC

Description Marius Cornea 2021-11-08 13:11:44 UTC
Description of problem:

Sometimes the DU node does not get the performance profile configuration applied and MachineConfigPool stays stuck in Updating

Version-Release number of selected component (if applicable):
OCP 4.9.6
PAO 4.9.0

How reproducible:
Not all the times, aproximately 1/5 times

Steps to Reproduce:

1. Deploy DU node via ZTP process from http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9
2. Wait for OCP to finish deployment
3. Wait for the policies to get created and applied

Actual results:
Performance profile gets created but its configuration are not applied to the node:

perf profile:

spec:
  additionalKernelArgs:
  - idle=poll
  - rcupdate.rcu_normal_after_boot=0
  cpu:
    isolated: 2-23,26-47
    reserved: 0-1,24-25
  globallyDisableIrqLoadBalancing: true
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 32
      size: 1G
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: true


ssh core.lab.eng.rdu2.redhat.com -6    'cat /proc/cmdline'
BOOT_IMAGE=(hd2,gpt3)/ostree/rhcos-6837dc5ee75f6f61a4949e5954648bce575363916ef26b0b7002cfbd40a9cb8d/vmlinuz-4.18.0-305.25.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/6837dc5ee75f6f61a4949e5954648bce575363916ef26b0b7002cfbd40a9cb8d/0 ip=ens2f0:dhcp6 root=UUID=b75e1774-5260-42d9-ad5d-de3db9890cdc rw rootflags=prjquota intel_iommu=on iommu=pt


Expected results:
Configuration specified in the performance profile get applied to the node.

Additional info:

Setup is stuck on:

oc get nodes,mcp
NAME                                        STATUS                     ROLES           AGE     VERSION
node/sno.kni-qe-1.lab.eng.rdu2.redhat.com   Ready,SchedulingDisabled   master,worker   4h22m   v1.22.1+d8c4430

NAME                                                         CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master   rendered-master-90fe2b00c7185b2de24b103db4a32ec4   False     True       False      1              0                   0                     0                      4h21m
machineconfigpool.machineconfiguration.openshift.io/worker   rendered-worker-31197fc6da09ee3f662ba1f19a8f0dda   True      False      False      0              0                   0                     0                      4h21m

Comment 3 Marius Cornea 2021-11-09 13:51:10 UTC
The issue was reproduced again today.

Comment 12 Ken Young 2021-11-24 13:50:19 UTC
*** Bug 2021534 has been marked as a duplicate of this bug. ***

Comment 13 Ken Young 2021-11-24 13:50:56 UTC
*** Bug 2022665 has been marked as a duplicate of this bug. ***

Comment 18 Ken Young 2021-12-08 21:09:36 UTC
*** Bug 2015305 has been marked as a duplicate of this bug. ***

Comment 21 zhaozhanqi 2022-01-14 10:54:42 UTC
Hi, Marius Cornea,  could you help verify this bug, assign QA to you, thanks.

Comment 22 Marius Cornea 2022-01-26 18:15:29 UTC
Verified on a 4.10 DU node deployed via ZTP process with sriov-network-operator.4.10.0-202201210948

[root@sno core]# grep -Ri 'reqReboot true'  /var/log/pods/openshift-sriov-network-operator*
[root@sno core]#

Comment 24 Angie Wang 2022-03-04 17:55:20 UTC
*** Bug 2016600 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2022-03-10 16:26:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.