|Summary:||Nodes becomes NotReady, when status update connection is dropped/severed to master, causing node to wait on 15min default net/http timeout before trying again.|
|Product:||OpenShift Container Platform||Reporter:||Nicolas Nosenzo <nnosenzo>|
|Component:||Master||Assignee:||Maciej Szulik <maszulik>|
|Status:||CLOSED ERRATA||QA Contact:||Mike Fiedler <mifiedle>|
|Version:||3.3.1||CC:||aos-bugs, decarr, dsafford, ekuric, eparis, jkaur, jlee, jliggitt, jokerman, maszulik, mfojtik, mifiedle, misalunk, mmccomas, nnosenzo, rhowe, wsun, xtian|
|Fixed In Version:||Doc Type:||Bug Fix|
Cause: Node status information was getting rate limited during heavy traffic. Consequence: It happens that some nodes might be considered not ready. Fix: Use a separate connection for node healthiness. Result: Node status should be reported without any problems.
|:||1527389 (view as bug list)||Environment:|
|Last Closed:||2017-12-19 10:39:51 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:|
Description Nicolas Nosenzo 2017-06-24 09:24:54 UTC
Description of problem: Nodes unexpectedly entered into a NodeNotReady state, with no information explaining why besides Kubelet stopped posting node status. Customer has confirmed there is a monitoring between hosts and masters via LB. No timeouts or problems were recorded in the last roundabout 16 hours. Version-Release number of selected component (if applicable): OCP 126.96.36.199-1 How reproducible: Partially on Customer environment Steps to Reproduce: 1. 2. 3. Actual results: Nodes suddenly going into a NodeNotReady status Expected results: Cluster working propperly Additional info: Added within the comments.
Comment 14 Nicolas Nosenzo 2017-06-27 08:44:37 UTC
@Seth, At the same time, in case we still thinking the problem should be on the LB, I'm wondering if we can just skip the LB, by replacing it for a HAproxy native solution, changing the DNS entries to make them point to the new LB, and just replacing the haproxy.cfg in the new LB to point the backend servers to the cluster masters. Is the above something that might help us on isolating the source of this issue?
Comment 17 Ryan Howe 2017-06-27 17:10:31 UTC
Description of problem (refocus): A node stops posting status updates to the master due to the connection being severed. If the node does not receive a TCP FIN on the request, the node will wait on the 15min timeout default timeout set in net/http. After this time out the node will try to update its status again. Version-Release number of selected component (if applicable): OCP 188.8.131.52-1 Actual results: Thu, 22 Jun 2017 15:46:03 +0200 NodeStatusUnknown Kubelet stopped posting node status. Thu, 22 Jun 2017 16:01:05 +0200 KubeletReady kubelet is posting ready status Expected results: A timeout to happen on the node that generates an error and retries sending the master an update status. This time out should fall some where between the nodes node-status-update-frequency (default 10s) and master-controllers node-monitor-grace-period duration (default 40s).
Comment 18 Nicolas Nosenzo 2017-06-28 13:35:55 UTC
Hi Seth, Do you think that replacing the current LB for a native haproxy and set the connection timeouts for server/client to a lower value (default 5m) would cause any side effect on the cluster behaviour? i.e.: timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 300s #5min --> set it to 60s timeout server 300s #5min --> set it to 60s timeout http-keep-alive 10s
Comment 32 Jordan Liggitt 2017-10-06 14:19:25 UTC
Not fixed in rebase, fixed in 1.7.8 upstream, still pending cherry pick
Comment 35 zhou ying 2017-10-13 10:12:23 UTC
Nicolas Nosenzo: Could you please provider more details about the LB? thanks
Comment 36 zhou ying 2017-10-16 09:36:46 UTC
Nicolas Nosenzo, Maciej Szulik: With the old version:3.7.0-0.127.0, and haproxy HA, when I stop the master's api service, I couldn't reproduce the issue. Could you please help support more details, thanks.
Comment 37 Maciej Szulik 2017-10-17 07:50:13 UTC
It might be hard to reproduce, you'll need to generate load big enough to hit those limits. I'll defer to Nicolas for the reproducer.
Comment 38 zhou ying 2017-11-03 08:53:56 UTC
Need to test when 3.8 scalability lab is available.
Comment 40 Mike Fiedler 2017-11-09 12:17:45 UTC
This area was regression tested in a 300 node AWS cluster on 3.7.0-0.190.0. The originally reported problem was not reproduced. During the test, a cluster horizontal stress test was run and high stress logging testing was performed at rates over 75 million messages/hour. During this testing no NotReady nodes were seen. Additionally SVT has run its suite of network performance tests for 3.7 and no issues were seen. Marking this bug VERFIED for 3.7 and creating a card (internal board) for SVT to create a test case to explicitly test this area again in 3.8.
Comment 43 errata-xmlrpc 2017-11-28 21:58:46 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188
Comment 44 Nicolas Nosenzo 2017-12-15 12:41:24 UTC
Hi, re-opening this BZ, affected customer are concerned about whether this will be backported to 3.4-3.6. Is there any plans for doing so ? Thanks.
Comment 45 Michal Fojtik 2017-12-18 14:53:13 UTC
(In reply to Nicolas Nosenzo from comment #44) > Hi, re-opening this BZ, affected customer are concerned about whether this > will be backported to 3.4-3.6. Is there any plans for doing so ? > > Thanks. Can you please open a separate bug or clone this bug for 3.6?
Comment 46 Nicolas Nosenzo 2017-12-19 10:39:51 UTC
(In reply to Michal Fojtik from comment #45) > (In reply to Nicolas Nosenzo from comment #44) > > Hi, re-opening this BZ, affected customer are concerned about whether this > > will be backported to 3.4-3.6. Is there any plans for doing so ? > > > > Thanks. > > Can you please open a separate bug or clone this bug for 3.6? Done, https://bugzilla.redhat.com/show_bug.cgi?id=1527389 closing this one.