Bug 2021151

Summary: Sometimes the DU node does not get the performance profile configuration applied and MachineConfigPool stays stuck in Updating
Product: OpenShift Container Platform Reporter: Marius Cornea <mcornea>
Component: NetworkingAssignee: Sebastian Scheinkman <sscheink>
Networking sub component: SR-IOV QA Contact: Marius Cornea <mcornea>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: achernet, browsell, imiller, jdelft, jerzhang, sscheink, trozet, yliu1
Version: 4.9   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: MachineConfigPool stays stuck in Updating because the sriov operator was not pausing the machine config pool the is assign to on SNO Consequence: The machine config configuration was not able to be updated Fix: The sriov operator will pause the right machine config pool before running any configuration requiring reboot. Result: The sriov operator is fully supported on SNO deployments
Story Points: ---
Clone Of:
: 2066401 (view as bug list) Environment:
Last Closed: 2022-03-10 16:26:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2066401    

Description Marius Cornea 2021-11-08 13:11:44 UTC
Description of problem:

Sometimes the DU node does not get the performance profile configuration applied and MachineConfigPool stays stuck in Updating

Version-Release number of selected component (if applicable):
OCP 4.9.6
PAO 4.9.0

How reproducible:
Not all the times, aproximately 1/5 times

Steps to Reproduce:

1. Deploy DU node via ZTP process from http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9
2. Wait for OCP to finish deployment
3. Wait for the policies to get created and applied

Actual results:
Performance profile gets created but its configuration are not applied to the node:

perf profile:

spec:
  additionalKernelArgs:
  - idle=poll
  - rcupdate.rcu_normal_after_boot=0
  cpu:
    isolated: 2-23,26-47
    reserved: 0-1,24-25
  globallyDisableIrqLoadBalancing: true
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 32
      size: 1G
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: true


ssh core.lab.eng.rdu2.redhat.com -6    'cat /proc/cmdline'
BOOT_IMAGE=(hd2,gpt3)/ostree/rhcos-6837dc5ee75f6f61a4949e5954648bce575363916ef26b0b7002cfbd40a9cb8d/vmlinuz-4.18.0-305.25.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/6837dc5ee75f6f61a4949e5954648bce575363916ef26b0b7002cfbd40a9cb8d/0 ip=ens2f0:dhcp6 root=UUID=b75e1774-5260-42d9-ad5d-de3db9890cdc rw rootflags=prjquota intel_iommu=on iommu=pt


Expected results:
Configuration specified in the performance profile get applied to the node.

Additional info:

Setup is stuck on:

oc get nodes,mcp
NAME                                        STATUS                     ROLES           AGE     VERSION
node/sno.kni-qe-1.lab.eng.rdu2.redhat.com   Ready,SchedulingDisabled   master,worker   4h22m   v1.22.1+d8c4430

NAME                                                         CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master   rendered-master-90fe2b00c7185b2de24b103db4a32ec4   False     True       False      1              0                   0                     0                      4h21m
machineconfigpool.machineconfiguration.openshift.io/worker   rendered-worker-31197fc6da09ee3f662ba1f19a8f0dda   True      False      False      0              0                   0                     0                      4h21m

Comment 3 Marius Cornea 2021-11-09 13:51:10 UTC
The issue was reproduced again today.

Comment 12 Ken Young 2021-11-24 13:50:19 UTC
*** Bug 2021534 has been marked as a duplicate of this bug. ***

Comment 13 Ken Young 2021-11-24 13:50:56 UTC
*** Bug 2022665 has been marked as a duplicate of this bug. ***

Comment 18 Ken Young 2021-12-08 21:09:36 UTC
*** Bug 2015305 has been marked as a duplicate of this bug. ***

Comment 21 zhaozhanqi 2022-01-14 10:54:42 UTC
Hi, Marius Cornea,  could you help verify this bug, assign QA to you, thanks.

Comment 22 Marius Cornea 2022-01-26 18:15:29 UTC
Verified on a 4.10 DU node deployed via ZTP process with sriov-network-operator.4.10.0-202201210948

[root@sno core]# grep -Ri 'reqReboot true'  /var/log/pods/openshift-sriov-network-operator*
[root@sno core]#

Comment 24 Angie Wang 2022-03-04 17:55:20 UTC
*** Bug 2016600 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2022-03-10 16:26:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056