1834141 – Apply sriov network policy stuck

Bug 1834141 - Apply sriov network policy stuck

Summary: Apply sriov network policy stuck

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Peng Liu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1771572
TreeView+	depends on / blocked

Reported:	2020-05-11 07:42 UTC by Sebastian Scheinkman
Modified:	2020-07-13 17:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:36:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sriov-network-operator pull 207	0	None	closed	Bug 1834141: Fix draining watch	2020-10-28 04:52:51 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:37:26 UTC

Description Sebastian Scheinkman 2020-05-11 07:42:58 UTC

Description of problem:
failed to apply a sriov network policy on the cluster.

I0510 17:28:29.959345  358401 daemon.go:535] drainNode(): node cnfd0-worker-0.fci1.kni.lab.eng.bos.redhat.com is draining
I0510 17:28:31.760679  358401 daemon.go:532] drainNode(): Check if any other node is draining
...


How reproducible:
100%

Steps to Reproduce:
1. create sriov network policy for a now 
2. watch the logs in the relevant sriov daemon-config
when you see this log
I0510 17:36:04.332160  358401 daemon.go:586] drainNode(): drain complete
I0510 17:36:04.332164  358401 daemon.go:474] annotateNode(): Annotate node cnfd1-worker-0.fci1.kni.lab.eng.bos.redhat.com with: Idle

3. check the node using "oc get node cnfd1-worker-0.fci1.kni.lab.eng.bos.redhat.com -oyaml | more" 

The sriovnetwork.openshift.io/state label for that node will be on Draining and not Idle

Comment 1 Federico Simoncelli 2020-05-11 09:08:44 UTC

Is this affecting 4.4 as well?

Comment 2 Peng Liu 2020-05-11 09:21:17 UTC

@Federico It will not affect 4.4. The new draining logic was introduced by https://github.com/openshift/sriov-network-operator/pull/165, which hasn't yet been backported to 4.4.

Comment 3 zenghui.shi 2020-05-11 10:47:07 UTC

Out of curiosity why all nodes are in draining states. Is it because the node is not set to Idle (from Draining) after deleting policy?

Comment 7 Sebastian Scheinkman 2020-05-11 11:46:22 UTC

(In reply to zenghui.shi from comment #3)
> Out of curiosity why all nodes are in draining states. Is it because the
> node is not set to Idle (from Draining) after deleting policy?

Thanks for the comment.

Not all the nodes are in draining state only one.

The general flow is:
1. run an Informer to validate if there is a node that is in draining status.
2. then they are no nodes in that status we change the current node to "Draining" and start the Drain on the node.
3. when that finish we change the status to Idle again on that node
4. another config-daemon go to 2 and mark is node as Draining

This issue was in the section to marks the node as Idle again didn't work.

Comment 9 errata-xmlrpc 2020-07-13 17:36:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.