Bug 1550266

Summary: SDN fails to clear NodeNetworkUnavailable node condition on GCP
Product: OpenShift Container Platform Reporter: Ravi Sankar <rpenta>
Component: NetworkingAssignee: Ravi Sankar <rpenta>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, bbennett, hongli
Target Milestone: ---Keywords: NeedsTestCase
Target Release: 3.10.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: We may fail to clear NodeNetworkUnavailable condition sometimes on GCP Consequence: Node can not take pod traffic for longer period until the NodeNetworkUnavailable condition is removed. Fix: Fixed bug in clearing NodeNetworkUnavailable condition Result: Node should be able to handle pod traffic as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:10:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ravi Sankar 2018-02-28 21:33:06 UTC
Description of problem:
We found a bug in the implementation where clearing NodeNetworkUnavailable condition could fail when there a race between master and node trying to update the node status. Since we get node events for every kubelet node status update, eventually we will clear this condition.

Proposed https://github.com/openshift/origin/pull/18758 to fix this cleanly.

Version-Release number of selected component (if applicable):
oc v3.10.0-alpha.0+1d01229-4-dirty (also valid for older releases)
kubernetes v1.9.1+a0ce1bc657

How reproducible:
Not Always (easy with some instrumentation)

Verification/Testing:
Please ensure there is no regression on GCP with this fix.

Comment 1 Ravi Sankar 2018-02-28 21:33:45 UTC
https://github.com/openshift/origin/pull/18758

Comment 2 openshift-github-bot 2018-03-16 15:24:24 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/308bb2e8f4f0a198e92993f9ec7a8f5d8ca7e349
Merge pull request #18758 from pravisankar/fix-clear-nodenetwork

Automatic merge from submit-queue.

Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master 

#This change fixes these 2 issues:
- Currently, clearing NodeNetworkUnavailable node condition only works
if we are successful in updating the node status during the first iteration.
Subsequent retries will not work because:
  1. knode != node
  2. node.Status is updated in memory
  3. UpdateNodeStatus(knode)
(3) will have no effect as in step (2) node.Status is updated but not knode.Status

- Node object passed to this method is pointer to an item in the informer
cache and it should not be modified directly.

 Avoid NodeNetworkUnavailable condition check for every node status update

- We know that kubelet sets NodeNetworkUnavailable condition when the node is
created/registered with api server.
- So we only need to call clearInitialNodeNetworkUnavailableCondition()
for the first time and not during subsequent node status update events.

Comment 4 Hongan Li 2018-05-31 07:57:53 UTC
no issue found during regression test on GCP with v3.10.0-0.54.0.

OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
kernel: Linux qe-310-crio-master-etcd-1 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 6 errata-xmlrpc 2018-07-30 19:10:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816