Bug 1464653 - Nodes becomes NotReady, when status update connection is dropped/severed to master, causing node to wait on 15min default net/http timeout before trying again.
Nodes becomes NotReady, when status update connection is dropped/severed to m...
Status: ASSIGNED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Kubernetes (Show other bugs)
3.3.1
Unspecified Unspecified
unspecified Severity urgent
: ---
: 3.7.0
Assigned To: Seth Jennings
DeShuai Ma
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-24 05:24 EDT by Nicolas Nosenzo
Modified: 2017-08-16 12:24 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Nicolas Nosenzo 2017-06-24 05:24:54 EDT
Description of problem:

Nodes unexpectedly entered into a NodeNotReady state, with no information explaining why besides Kubelet stopped posting node status. Customer has confirmed there is a monitoring between hosts and masters via LB. No timeouts or problems were recorded in the last roundabout 16 hours.


Version-Release number of selected component (if applicable):
OCP 3.3.1.20-1

How reproducible:
Partially on Customer environment

Steps to Reproduce:
1.
2.
3.

Actual results:
Nodes suddenly going into a NodeNotReady status

Expected results:
Cluster working propperly

Additional info:

Added within the comments.
Comment 14 Nicolas Nosenzo 2017-06-27 04:44:37 EDT
@Seth, 

At the same time, in case we still thinking the problem should be on the LB, I'm wondering if we can just skip the LB, by replacing it for a HAproxy native solution, changing the DNS entries to make them point to the new LB, and just replacing the haproxy.cfg in the new LB to point the backend servers to the cluster masters.

Is the above something that might help us on isolating the source of this issue?
Comment 17 Ryan Howe 2017-06-27 13:10:31 EDT
Description of problem (refocus):

 A node stops posting status updates to the master due to the connection being severed. If the node does not receive a TCP FIN on the request, the node will wait on the 15min timeout default timeout set in net/http. After this time out the node will try to update its status again. 

Version-Release number of selected component (if applicable):
OCP 3.3.1.20-1

Actual results:

 Thu, 22 Jun 2017 15:46:03 +0200         NodeStatusUnknown               Kubelet stopped posting node status.

 Thu, 22 Jun 2017 16:01:05 +0200         KubeletReady                    kubelet is posting ready status

Expected results:

 A timeout to happen on the node that generates an error and retries sending the master an update status. This time out should fall some where between the nodes node-status-update-frequency (default 10s) and master-controllers node-monitor-grace-period duration (default 40s).
Comment 18 Nicolas Nosenzo 2017-06-28 09:35:55 EDT
Hi Seth, 
Do you think that replacing the current LB for a native haproxy and set  the connection timeouts for server/client to a lower value (default 5m) would cause any side effect on the cluster behaviour?
 
i.e.:
  
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          300s #5min  --> set it to 60s
    timeout server          300s #5min   --> set it to 60s
    timeout http-keep-alive 10s

Note You need to log in before you can comment on or make changes to this bug.