Bug 1869629
Summary: | [MSTR-991] When kube-controller-manager leader master is being shutdown, openshift-apiserver and login are always 1m30s unavailable | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> |
Component: | kube-controller-manager | Assignee: | Lukasz Szaszkiewicz <lszaszki> |
Status: | CLOSED CANTFIX | QA Contact: | Xingxing Xia <xxia> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.6 | CC: | aos-bugs, mfojtik |
Target Milestone: | --- | Keywords: | Reopened, UpcomingSprint |
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-07 13:16:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Xingxing Xia
2020-08-18 11:44:19 UTC
Per MSTR-991's "Description", tested baremetal env on OSP, got same result as this bug. Are you sure it is not the issue with how the script (watch_available.sh) reports unavailability ? Notice that there is a default timeout (60s) after which the request (oc new-project) will be marked as failed. So for example, what will the script report after running oc new-project when you delay it for 30s? I think that the request can be stuck for additional 60s when you hit the wrong replica within the first 40/60s. Assuming oc uses the default timeout. The timing explanation sounds reasonable. So closing the bug. Lukasz shared new realized info that switching KCM to the internal LB might affect the “recovery time after a hard shutdown”, caused it takes up to about 1m40s to acquire a lock, it looks like a misconfigured LB, it shouldn’t take more than 30 s for the LB to remove an unhealthy Kube API out of the pool. So I'm reopening it. He and maybe other Dev fellows are still looking into it. Since it seems to relate to KCM, I'm changing the component to KCM. I asked the AWS ELB team what's the maximum time taken by an LB to see an instance unhealthy and why I'm observing delay. Here is a response I received from them: I understand that you would like to know what is the maximum time taken by an NLB to mark an instance Unhealthy. You have defined "HealthCheckIntervalSeconds" as 10 and "UnhealthyThresholdCount" as 2 As per the configuration the NLB shouldn't take longer than 30 seconds, however, you are observing that is taking approximately 60 seconds for the NLB to mark the instance as Unhealthy. I would like to inform you that this is a known issue with the Network Load balancer and our internal ELB service team is aware of it and are working actively on a fix. Below is the wording from them highlighting the issue: Thank you for contacting Elastic Load Balancing about the length of time taken for a target to complete a healthy/Unhealthy state transition on your Network Load Balancer. We can confirm that the expected behavior of the Network Load Balancer health check system includes an additional delay of approximately 10-15 seconds in the time between when a target is detected as changing health state and when the new state is reflected in the load balancer. This is due to the distributed nature of the health check system. This system aggregates data about the health state of the target and distributes it to the Network Load Balancer. For example, if you have configured a 30 second health check interval and there is a threshold of 3 health checks to change the state, a healthy target that becomes unhealthy may experience up to 129 seconds of new requests routed to it after failing its first health check. This is due to the possibility that it fails 1 second after passing, so the first period between health checks could be up to 29 seconds. Then there is the 90 seconds of health checks that must fail, in this example as we have configured 30 second health check intervals, and 3 as the health check failure threshold, and then the 10 seconds to aggregate and distribute the health state information. To reduce impact from connections routed to unhealthy target, we recommend clients retry connections on connection failures. Assuming the above is correct and it takes approximately 60s for an LB to observe a failure. We might see a delay when a node is marked as NotReady. As a consequence, the platform might appear as unavailable for the end-users. During that time (60s) anything can happen. The new KCM might be able to acquire a lock or not. Assuming it does acquire a lock, it might be stuck on populating a cache. The LIST request might fail and timeout (not sure if it is 30s or 60s) and we might need to retry. Given the above, I think that the worst-case scenario might be as follows. Assuming the default request timeout is 60s T+59s: the new KCM acquired a lock T+59s: we start populating the caches and fail T+119s: we try again T+120s: the caches are populated and the node controller starts. Note that this is slightly unrealistic but I don't know the actual population time. T+160s: the node controller makes the node as NotReady For the end-user, it might be even worse. Assuming the default request timeout is 60s. The following might happen. T+159s: the end-user creates a project and fails (oc new-project) T+219s: the end-user tries again and I did 4 more trails today. An instance was marked as unhealthy after ~52s on average (51s, 55s, 52s, 50s) I'm closing this issue as this is how an AWS LB works as of today. |