Bug 1834141

Summary: Apply sriov network policy stuck
Product: OpenShift Container Platform Reporter: Sebastian Scheinkman <sscheink>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: fsimonce, pliu, yadu, zshi
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:36:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1771572    

Description Sebastian Scheinkman 2020-05-11 07:42:58 UTC
Description of problem:
failed to apply a sriov network policy on the cluster.

I0510 17:28:29.959345  358401 daemon.go:535] drainNode(): node cnfd0-worker-0.fci1.kni.lab.eng.bos.redhat.com is draining
I0510 17:28:31.760679  358401 daemon.go:532] drainNode(): Check if any other node is draining
...


How reproducible:
100%

Steps to Reproduce:
1. create sriov network policy for a now 
2. watch the logs in the relevant sriov daemon-config
when you see this log
I0510 17:36:04.332160  358401 daemon.go:586] drainNode(): drain complete
I0510 17:36:04.332164  358401 daemon.go:474] annotateNode(): Annotate node cnfd1-worker-0.fci1.kni.lab.eng.bos.redhat.com with: Idle

3. check the node using "oc get node cnfd1-worker-0.fci1.kni.lab.eng.bos.redhat.com -oyaml | more" 

The sriovnetwork.openshift.io/state label for that node will be on Draining and not Idle

Comment 1 Federico Simoncelli 2020-05-11 09:08:44 UTC
Is this affecting 4.4 as well?

Comment 2 Peng Liu 2020-05-11 09:21:17 UTC
@Federico It will not affect 4.4. The new draining logic was introduced by https://github.com/openshift/sriov-network-operator/pull/165, which hasn't yet been backported to 4.4.

Comment 3 zenghui.shi 2020-05-11 10:47:07 UTC
Out of curiosity why all nodes are in draining states. Is it because the node is not set to Idle (from Draining) after deleting policy?

Comment 7 Sebastian Scheinkman 2020-05-11 11:46:22 UTC
(In reply to zenghui.shi from comment #3)
> Out of curiosity why all nodes are in draining states. Is it because the
> node is not set to Idle (from Draining) after deleting policy?

Thanks for the comment.

Not all the nodes are in draining state only one.

The general flow is:
1. run an Informer to validate if there is a node that is in draining status.
2. then they are no nodes in that status we change the current node to "Draining" and start the Drain on the node.
3. when that finish we change the status to Idle again on that node
4. another config-daemon go to 2 and mark is node as Draining

This issue was in the section to marks the node as Idle again didn't work.

Comment 9 errata-xmlrpc 2020-07-13 17:36:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409