Bug 1550266 - SDN fails to clear NodeNetworkUnavailable node condition on GCP
Summary: SDN fails to clear NodeNetworkUnavailable node condition on GCP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: 3.10.0
Assignee: Ravi Sankar
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-28 21:33 UTC by Ravi Sankar
Modified: 2018-07-30 19:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: We may fail to clear NodeNetworkUnavailable condition sometimes on GCP Consequence: Node can not take pod traffic for longer period until the NodeNetworkUnavailable condition is removed. Fix: Fixed bug in clearing NodeNetworkUnavailable condition Result: Node should be able to handle pod traffic as expected.
Clone Of:
Environment:
Last Closed: 2018-07-30 19:10:04 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:10:30 UTC

Description Ravi Sankar 2018-02-28 21:33:06 UTC
Description of problem:
We found a bug in the implementation where clearing NodeNetworkUnavailable condition could fail when there a race between master and node trying to update the node status. Since we get node events for every kubelet node status update, eventually we will clear this condition.

Proposed https://github.com/openshift/origin/pull/18758 to fix this cleanly.

Version-Release number of selected component (if applicable):
oc v3.10.0-alpha.0+1d01229-4-dirty (also valid for older releases)
kubernetes v1.9.1+a0ce1bc657

How reproducible:
Not Always (easy with some instrumentation)

Verification/Testing:
Please ensure there is no regression on GCP with this fix.

Comment 1 Ravi Sankar 2018-02-28 21:33:45 UTC
https://github.com/openshift/origin/pull/18758

Comment 2 openshift-github-bot 2018-03-16 15:24:24 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/308bb2e8f4f0a198e92993f9ec7a8f5d8ca7e349
Merge pull request #18758 from pravisankar/fix-clear-nodenetwork

Automatic merge from submit-queue.

Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master 

#This change fixes these 2 issues:
- Currently, clearing NodeNetworkUnavailable node condition only works
if we are successful in updating the node status during the first iteration.
Subsequent retries will not work because:
  1. knode != node
  2. node.Status is updated in memory
  3. UpdateNodeStatus(knode)
(3) will have no effect as in step (2) node.Status is updated but not knode.Status

- Node object passed to this method is pointer to an item in the informer
cache and it should not be modified directly.

 Avoid NodeNetworkUnavailable condition check for every node status update

- We know that kubelet sets NodeNetworkUnavailable condition when the node is
created/registered with api server.
- So we only need to call clearInitialNodeNetworkUnavailableCondition()
for the first time and not during subsequent node status update events.

Comment 4 Hongan Li 2018-05-31 07:57:53 UTC
no issue found during regression test on GCP with v3.10.0-0.54.0.

OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
kernel: Linux qe-310-crio-master-etcd-1 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 6 errata-xmlrpc 2018-07-30 19:10:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.