This bug was initially created as a copy of Bug #1921321 I am copying this bug because: Description of problem: When applying sriovNetworkNodePolicy in conjunction to applying an MachineConfig that takes a while to apply (like switching to rt-kernel), SR-IOV reboot the node in the middle of that process. when node come back online it is left in an intermediate state it cannot reconsile IMHO this is a design bug, all node configuration changes should be done through MCO. Version-Release number of selected component (if applicable): 4.7 How reproducible: very often, with below steps Steps to Reproduce: to use it: this need a node with Intel SRIOV capable NIC. make sure to update the SriovNetworkNodePolicy with that NIC name then: 1. oc apply -f reproducer.yaml # it is expected to fail on missing CRDs 2. wait for cluster to settle and sriov-network-operator to become operational 3. apply worker-duprofile to node 4. oc apply -f reproducer.yaml # again to apply missing CRs 5. you can inspect sriov-daemon and machine-config-daemon on that node to see what happening Actual results: no kernel-rt on node Expected results: kernel-rt on node Additional info: this is the bz on MCO part - https://bugzilla.redhat.com/show_bug.cgi?id=1916169
Verified this bug on 4.7.0-202105211528.p0 # oc logs sriov-network-config-daemon-4qmgg | grep MCP I0524 09:40:22.613936 690227 daemon.go:768] getNodeMachinePool(): find node in MCP worker I0524 13:27:41.098274 7778 daemon.go:768] getNodeMachinePool(): find node in MCP worker I0526 05:23:46.372124 7205 daemon.go:768] getNodeMachinePool(): find node in MCP worker I0526 05:24:29.259393 7205 daemon.go:768] getNodeMachinePool(): find node in MCP worker I0526 05:24:33.877516 7205 daemon.go:861] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-05-24 03:00:29 +0000 UTC } {NodeDegraded False 2021-05-24 03:00:34 +0000 UTC } {Degraded False 2021-05-24 03:00:34 +0000 UTC } {Updated False 2021-05-26 05:24:33 +0000 UTC } {Updating True 2021-05-26 05:24:33 +0000 UTC All nodes are updating to rendered-worker-b58b27a1b88a1d318d9816e8c2766c8a}], wait... I0526 05:24:38.859707 7205 daemon.go:861] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-05-24 03:00:29 +0000 UTC } {NodeDegraded False 2021-05-24 03:00:34 +0000 UTC } {Degraded False 2021-05-24 03:00:34 +0000 UTC } {Updated False 2021-05-26 05:24:33 +0000 UTC } {Updating True 2021-05-26 05:24:33 +0000 UTC All nodes are updating to rendered-worker-b58b27a1b88a1d318d9816e8c2766c8a}], wait... I0526 05:30:22.220643 6474 daemon.go:768] getNodeMachinePool(): find node in MCP worker I0526 05:31:34.684576 6474 daemon.go:579] completeDrain(): resume MCP worker
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.13 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2121
Need to backport one more patch which fixes the scenario where custom MCP is created.
Verified this bug on 4.7.0-202106170722
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.29 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3303