1960103 – SR-IOV obliviously reboot the node

Bug 1960103 - SR-IOV obliviously reboot the node

Summary: SR-IOV obliviously reboot the node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Peng Liu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1921321
Blocks:	1960263
TreeView+	depends on / blocked

Reported:	2021-05-13 03:58 UTC by Peng Liu
Modified:	2021-09-08 13:18 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1960263 (view as bug list)
Environment:
Last Closed:	2021-09-08 13:17:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift sriov-network-operator pull 504	None	open	[release-4.7] Bug 1960103: Pause MCP before draining/rebooting node	2021-05-13 04:03:36 UTC
Github	openshift sriov-network-operator pull 521	None	open	[release-4.7] Bug 1960103: Find the MCP based on the owner of node's desired MC	2021-06-17 04:42:24 UTC
Red Hat Product Errata	RHSA-2021:2121	None	None	None	2021-06-01 04:50:42 UTC
Red Hat Product Errata	RHSA-2021:3303	None	None	None	2021-09-08 13:18:17 UTC

Description Peng Liu 2021-05-13 03:58:55 UTC

This bug was initially created as a copy of Bug #1921321

I am copying this bug because: 



Description of problem:
When applying sriovNetworkNodePolicy in conjunction to applying an MachineConfig that takes a while to apply (like switching to rt-kernel),
SR-IOV reboot the node in the middle of that process.
when node come back online it is left in an intermediate state it cannot reconsile

IMHO this is a design bug,
all node configuration changes should be done through MCO.

Version-Release number of selected component (if applicable):
4.7


How reproducible:
very often, with below steps


Steps to Reproduce:
to use it:
this need a node with Intel SRIOV capable NIC.
make sure to update the SriovNetworkNodePolicy with that NIC name
then:
1. oc apply -f reproducer.yaml # it is expected to fail on missing CRDs
2. wait for cluster to settle and sriov-network-operator to become operational
3. apply worker-duprofile to node
4. oc apply -f reproducer.yaml # again to apply missing CRs
5. you can inspect sriov-daemon and machine-config-daemon on that node to see what happening

Actual results:
no kernel-rt on node

Expected results:
kernel-rt on node

Additional info:
this is the bz on MCO part - https://bugzilla.redhat.com/show_bug.cgi?id=1916169

Comment 4 zhaozhanqi 2021-05-26 06:53:00 UTC


Verified this bug on 4.7.0-202105211528.p0


# oc logs sriov-network-config-daemon-4qmgg | grep MCP
I0524 09:40:22.613936  690227 daemon.go:768] getNodeMachinePool(): find node in MCP worker
I0524 13:27:41.098274    7778 daemon.go:768] getNodeMachinePool(): find node in MCP worker
I0526 05:23:46.372124    7205 daemon.go:768] getNodeMachinePool(): find node in MCP worker
I0526 05:24:29.259393    7205 daemon.go:768] getNodeMachinePool(): find node in MCP worker
I0526 05:24:33.877516    7205 daemon.go:861] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-05-24 03:00:29 +0000 UTC  } {NodeDegraded False 2021-05-24 03:00:34 +0000 UTC  } {Degraded False 2021-05-24 03:00:34 +0000 UTC  } {Updated False 2021-05-26 05:24:33 +0000 UTC  } {Updating True 2021-05-26 05:24:33 +0000 UTC  All nodes are updating to rendered-worker-b58b27a1b88a1d318d9816e8c2766c8a}], wait...
I0526 05:24:38.859707    7205 daemon.go:861] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-05-24 03:00:29 +0000 UTC  } {NodeDegraded False 2021-05-24 03:00:34 +0000 UTC  } {Degraded False 2021-05-24 03:00:34 +0000 UTC  } {Updated False 2021-05-26 05:24:33 +0000 UTC  } {Updating True 2021-05-26 05:24:33 +0000 UTC  All nodes are updating to rendered-worker-b58b27a1b88a1d318d9816e8c2766c8a}], wait...
I0526 05:30:22.220643    6474 daemon.go:768] getNodeMachinePool(): find node in MCP worker
I0526 05:31:34.684576    6474 daemon.go:579] completeDrain(): resume MCP worker

Comment 6 errata-xmlrpc 2021-06-01 04:50:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.13 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2121

Comment 7 Peng Liu 2021-06-17 04:42:02 UTC

Need to backport one more patch which fixes the scenario where custom MCP is created.

Comment 9 zhaozhanqi 2021-06-21 01:41:59 UTC

Verified this bug on 4.7.0-202106170722

Comment 12 errata-xmlrpc 2021-09-08 13:17:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.29 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3303

Note You need to log in before you can comment on or make changes to this bug.