Bug 1834141 - Apply sriov network policy stuck
Summary: Apply sriov network policy stuck
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Peng Liu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1771572
TreeView+ depends on / blocked
 
Reported: 2020-05-11 07:42 UTC by Sebastian Scheinkman
Modified: 2020-07-13 17:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:36:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 207 0 None closed Bug 1834141: Fix draining watch 2020-10-28 04:52:51 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:37:26 UTC

Description Sebastian Scheinkman 2020-05-11 07:42:58 UTC
Description of problem:
failed to apply a sriov network policy on the cluster.

I0510 17:28:29.959345  358401 daemon.go:535] drainNode(): node cnfd0-worker-0.fci1.kni.lab.eng.bos.redhat.com is draining
I0510 17:28:31.760679  358401 daemon.go:532] drainNode(): Check if any other node is draining
...


How reproducible:
100%

Steps to Reproduce:
1. create sriov network policy for a now 
2. watch the logs in the relevant sriov daemon-config
when you see this log
I0510 17:36:04.332160  358401 daemon.go:586] drainNode(): drain complete
I0510 17:36:04.332164  358401 daemon.go:474] annotateNode(): Annotate node cnfd1-worker-0.fci1.kni.lab.eng.bos.redhat.com with: Idle

3. check the node using "oc get node cnfd1-worker-0.fci1.kni.lab.eng.bos.redhat.com -oyaml | more" 

The sriovnetwork.openshift.io/state label for that node will be on Draining and not Idle

Comment 1 Federico Simoncelli 2020-05-11 09:08:44 UTC
Is this affecting 4.4 as well?

Comment 2 Peng Liu 2020-05-11 09:21:17 UTC
@Federico It will not affect 4.4. The new draining logic was introduced by https://github.com/openshift/sriov-network-operator/pull/165, which hasn't yet been backported to 4.4.

Comment 3 zenghui.shi 2020-05-11 10:47:07 UTC
Out of curiosity why all nodes are in draining states. Is it because the node is not set to Idle (from Draining) after deleting policy?

Comment 7 Sebastian Scheinkman 2020-05-11 11:46:22 UTC
(In reply to zenghui.shi from comment #3)
> Out of curiosity why all nodes are in draining states. Is it because the
> node is not set to Idle (from Draining) after deleting policy?

Thanks for the comment.

Not all the nodes are in draining state only one.

The general flow is:
1. run an Informer to validate if there is a node that is in draining status.
2. then they are no nodes in that status we change the current node to "Draining" and start the Drain on the node.
3. when that finish we change the status to Idle again on that node
4. another config-daemon go to 2 and mark is node as Draining

This issue was in the section to marks the node as Idle again didn't work.

Comment 9 errata-xmlrpc 2020-07-13 17:36:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.