2021151 – Sometimes the DU node does not get the performance profile configuration applied and MachineConfigPool stays stuck in Updating

Bug 2021151 - Sometimes the DU node does not get the performance profile configuration applied and MachineConfigPool stays stuck in Updating

Summary: Sometimes the DU node does not get the performance profile configuration appl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Sebastian Scheinkman
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	2015305 2016600 2021534 2022665 (view as bug list)
Depends On:
Blocks:	2066401
TreeView+	depends on / blocked

Reported:	2021-11-08 13:11 UTC by Marius Cornea
Modified:	2022-03-21 16:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: MachineConfigPool stays stuck in Updating because the sriov operator was not pausing the machine config pool the is assign to on SNO Consequence: The machine config configuration was not able to be updated Fix: The sriov operator will pause the right machine config pool before running any configuration requiring reboot. Result: The sriov operator is fully supported on SNO deployments
Clone Of:
Clones:	2066401 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:26:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	k8snetworkplumbingwg sriov-network-operator pull 213	None	open	Better support for openshift single node	2021-12-07 18:27:49 UTC
Github	openshift sriov-network-operator pull 607	None	open	Bug 2021151: Sync master 23 12 21	2021-12-23 12:42:22 UTC
Github	openshift sriov-network-operator pull 609	None	open	Revert "Bug 2021151: Sync master 23 12 21"	2021-12-27 12:35:38 UTC
Github	openshift sriov-network-operator pull 610	None	open	Bug 2021151: Sync master 27 12 21	2021-12-27 14:43:41 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:26:27 UTC

Description Marius Cornea 2021-11-08 13:11:44 UTC

Description of problem:

Sometimes the DU node does not get the performance profile configuration applied and MachineConfigPool stays stuck in Updating

Version-Release number of selected component (if applicable):
OCP 4.9.6
PAO 4.9.0

How reproducible:
Not all the times, aproximately 1/5 times

Steps to Reproduce:

1. Deploy DU node via ZTP process from http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9
2. Wait for OCP to finish deployment
3. Wait for the policies to get created and applied

Actual results:
Performance profile gets created but its configuration are not applied to the node:

perf profile:

spec:
  additionalKernelArgs:
  - idle=poll
  - rcupdate.rcu_normal_after_boot=0
  cpu:
    isolated: 2-23,26-47
    reserved: 0-1,24-25
  globallyDisableIrqLoadBalancing: true
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 32
      size: 1G
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: true


ssh core.lab.eng.rdu2.redhat.com -6    'cat /proc/cmdline'
BOOT_IMAGE=(hd2,gpt3)/ostree/rhcos-6837dc5ee75f6f61a4949e5954648bce575363916ef26b0b7002cfbd40a9cb8d/vmlinuz-4.18.0-305.25.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/6837dc5ee75f6f61a4949e5954648bce575363916ef26b0b7002cfbd40a9cb8d/0 ip=ens2f0:dhcp6 root=UUID=b75e1774-5260-42d9-ad5d-de3db9890cdc rw rootflags=prjquota intel_iommu=on iommu=pt


Expected results:
Configuration specified in the performance profile get applied to the node.

Additional info:

Setup is stuck on:

oc get nodes,mcp
NAME                                        STATUS                     ROLES           AGE     VERSION
node/sno.kni-qe-1.lab.eng.rdu2.redhat.com   Ready,SchedulingDisabled   master,worker   4h22m   v1.22.1+d8c4430

NAME                                                         CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master   rendered-master-90fe2b00c7185b2de24b103db4a32ec4   False     True       False      1              0                   0                     0                      4h21m
machineconfigpool.machineconfiguration.openshift.io/worker   rendered-worker-31197fc6da09ee3f662ba1f19a8f0dda   True      False      False      0              0                   0                     0                      4h21m

Comment 3 Marius Cornea 2021-11-09 13:51:10 UTC

The issue was reproduced again today.

Comment 12 Ken Young 2021-11-24 13:50:19 UTC

*** Bug 2021534 has been marked as a duplicate of this bug. ***

Comment 13 Ken Young 2021-11-24 13:50:56 UTC

*** Bug 2022665 has been marked as a duplicate of this bug. ***

Comment 18 Ken Young 2021-12-08 21:09:36 UTC

*** Bug 2015305 has been marked as a duplicate of this bug. ***

Comment 21 zhaozhanqi 2022-01-14 10:54:42 UTC

Hi, Marius Cornea,  could you help verify this bug, assign QA to you, thanks.

Comment 22 Marius Cornea 2022-01-26 18:15:29 UTC

Verified on a 4.10 DU node deployed via ZTP process with sriov-network-operator.4.10.0-202201210948

[root@sno core]# grep -Ri 'reqReboot true'  /var/log/pods/openshift-sriov-network-operator*
[root@sno core]#

Comment 24 Angie Wang 2022-03-04 17:55:20 UTC

*** Bug 2016600 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2022-03-10 16:26:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.