Description of problem: When kube-controller-manager leader master is being shutdown, openshift-apiserver and login are always 1 minute unavailable Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-18-015339 How reproducible: Always (Checked UPI GCP, IPI AWS) Steps to Reproduce: 1. Check masters, KCM leader, auth pod masters $ oc get no --context admin -l node-role.kubernetes.io/master NAME STATUS ROLES AGE VERSION ip-10-0-155-29.ap-northeast-1.compute.internal Ready master 7h16m v1.19.0-rc.2+99cb93a-dirty ip-10-0-178-19.ap-northeast-1.compute.internal Ready master 7h16m v1.19.0-rc.2+99cb93a-dirty ip-10-0-222-52.ap-northeast-1.compute.internal Ready master 7h16m v1.19.0-rc.2+99cb93a-dirty $ oc get cm kube-controller-manager -o yaml -n kube-system | grep leader control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"ip-10-0-155-29_1f49b58d-a42a-4c3a-8d2a-7d9c7ec0ab24","leaseDurationSeconds":15,"acquireTime":"2020-08-18T10:17:18Z","renewTime":"2020-08-18T11:06:51Z","leaderTransitions":9}' $ oc get po -o wide -n openshift-authentication --no-headers oauth-openshift-5479c46966-cx49k 1/1 Running 0 5m3s 10.130.0.35 ip-10-0-222-52.ap-northeast-1.compute.internal <none> <none> app=oauth-openshift,pod-template-hash=5479c46966 oauth-openshift-5479c46966-tttbr 1/1 Running 0 6h8m 10.129.0.45 ip-10-0-178-19.ap-northeast-1.compute.internal <none> <none> app=oauth-openshift,pod-template-hash=5479c46966 We can see ip-10-0-155-29 is KCM leader, auth pods are on other two masters ip-10-0-178-19 and ip-10-0-222-52. 2. In terminal A, run: $ curl -o scripts/watch-available.sh http://file.rdu.redhat.com/~xxia/watch-available.sh $ scripts/watch-available.sh |& tee scripts-results/watch-available.aws.log 3. In cloud console, select VM ip-10-0-155-29, click "Stop" button to shut down it. Then watch terminal A. Actual results: 3. Terminal A shows: ... 2020-08-18T19:12:39+08:00 oc login succeeded 2020-08-18T19:12:43+08:00 oc new-project succeeded 2020-08-18T19:12:44+08:00 oc delete project succeeded 2020-08-18T19:12:51+08:00 oc get no succeeded 2020-08-18T19:12:52+08:00 oc get cm succeeded 2020-08-18T19:12:56+08:00 oc login failed: error: unexpected EOF real 0m0.723s user 0m0.134s sys 0m0.036s ip-10-0-155-29.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty ip-10-0-178-19.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty ip-10-0-222-52.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty pod/apiserver-69c855d5d9-9ln7s 2/2 Running 0 6h20m 10.128.0.15 ip-10-0-155-29.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79nv 2/2 Running 0 6h21m 10.129.0.42 ip-10-0-178-19.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79t7 2/2 Running 0 6h21m 10.130.0.29 ip-10-0-222-52.ap-northeast-1.compute.internal <none> <none> endpoints/api 10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443 7h2m endpoints/check-endpoints 10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698 7h1m 2020-08-18T19:12:59+08:00 oc new-project failed: Error from server (Timeout): Timeout: request did not complete within requested timeout 34s real 0m35.038s user 0m0.146s sys 0m0.040s ip-10-0-155-29.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty ip-10-0-178-19.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty ip-10-0-222-52.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty pod/apiserver-69c855d5d9-9ln7s 2/2 Running 0 6h21m 10.128.0.15 ip-10-0-155-29.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79nv 2/2 Running 0 6h22m 10.129.0.42 ip-10-0-178-19.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79t7 2/2 Running 0 6h21m 10.130.0.29 ip-10-0-222-52.ap-northeast-1.compute.internal <none> <none> endpoints/api 10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443 7h3m endpoints/check-endpoints 10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698 7h2m 2020-08-18T19:13:46+08:00 oc delete project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request real 0m1.664s user 0m0.165s sys 0m0.042s ip-10-0-155-29.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty ip-10-0-178-19.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty ip-10-0-222-52.ap-northeast-1.compute.internal Ready master 7h5m v1.19.0-rc.2+99cb93a-dirty pod/apiserver-69c855d5d9-9ln7s 2/2 Running 0 6h21m 10.128.0.15 ip-10-0-155-29.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79nv 2/2 Running 0 6h22m 10.129.0.42 ip-10-0-178-19.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79t7 2/2 Running 0 6h22m 10.130.0.29 ip-10-0-222-52.ap-northeast-1.compute.internal <none> <none> endpoints/api 10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443 7h3m endpoints/check-endpoints 10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698 7h2m 2020-08-18T19:13:50+08:00 oc get no succeeded 2020-08-18T19:13:51+08:00 oc get cm succeeded 2020-08-18T19:14:01+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 400 real 0m5.227s user 0m0.157s sys 0m0.050s ip-10-0-155-29.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty ip-10-0-178-19.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty ip-10-0-222-52.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty pod/apiserver-69c855d5d9-9ln7s 2/2 Running 0 6h21m 10.128.0.15 ip-10-0-155-29.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79nv 2/2 Running 0 6h22m 10.129.0.42 ip-10-0-178-19.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79t7 2/2 Running 0 6h22m 10.130.0.29 ip-10-0-222-52.ap-northeast-1.compute.internal <none> <none> endpoints/api 10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443 7h3m endpoints/check-endpoints 10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698 7h3m 2020-08-18T19:14:08+08:00 oc new-project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request real 0m2.829s user 0m0.132s sys 0m0.032s ip-10-0-155-29.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty ip-10-0-178-19.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty ip-10-0-222-52.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty pod/apiserver-69c855d5d9-9ln7s 2/2 Running 0 6h21m 10.128.0.15 ip-10-0-155-29.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79nv 2/2 Running 0 6h22m 10.129.0.42 ip-10-0-178-19.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79t7 2/2 Running 0 6h22m 10.130.0.29 ip-10-0-222-52.ap-northeast-1.compute.internal <none> <none> endpoints/api 10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443 7h4m endpoints/check-endpoints 10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698 7h3m 2020-08-18T19:14:13+08:00 oc delete project succeeded 2020-08-18T19:14:14+08:00 oc get no succeeded 2020-08-18T19:14:15+08:00 oc get cm succeeded 2020-08-18T19:14:19+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 500 real 0m3.726s user 0m0.174s sys 0m0.042s ip-10-0-155-29.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty ip-10-0-178-19.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty ip-10-0-222-52.ap-northeast-1.compute.internal Ready master 7h6m v1.19.0-rc.2+99cb93a-dirty pod/apiserver-69c855d5d9-9ln7s 2/2 Running 0 6h22m 10.128.0.15 ip-10-0-155-29.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79nv 2/2 Running 0 6h22m 10.129.0.42 ip-10-0-178-19.ap-northeast-1.compute.internal <none> <none> pod/apiserver-69c855d5d9-m79t7 2/2 Running 0 6h22m 10.130.0.29 ip-10-0-222-52.ap-northeast-1.compute.internal <none> <none> endpoints/api 10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443 7h4m endpoints/check-endpoints 10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698 7h3m 2020-08-18T19:14:24+08:00 oc new-project succeeded 2020-08-18T19:14:26+08:00 oc delete project succeeded ... Let's grep, can see from 2020-08-18T19:12:56+08:00 ~ 2020-08-18T19:14:19+08:00, openshift-apiserver and login are unavailable about 1.5 minutes: $ grep "^2020-08-18" scripts-results/watch-available.aws.log ...succeeded lines... 2020-08-18T19:12:39+08:00 oc login succeeded 2020-08-18T19:12:43+08:00 oc new-project succeeded 2020-08-18T19:12:44+08:00 oc delete project succeeded 2020-08-18T19:12:51+08:00 oc get no succeeded 2020-08-18T19:12:52+08:00 oc get cm succeeded 2020-08-18T19:12:56+08:00 oc login failed: error: unexpected EOF 2020-08-18T19:12:59+08:00 oc new-project failed: Error from server (Timeout): Timeout: request did not complete within requested timeout 34s 2020-08-18T19:13:46+08:00 oc delete project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request 2020-08-18T19:13:50+08:00 oc get no succeeded 2020-08-18T19:13:51+08:00 oc get cm succeeded 2020-08-18T19:14:01+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 400 2020-08-18T19:14:08+08:00 oc new-project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request 2020-08-18T19:14:13+08:00 oc delete project succeeded 2020-08-18T19:14:14+08:00 oc get no succeeded 2020-08-18T19:14:15+08:00 oc get cm succeeded 2020-08-18T19:14:19+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 500 2020-08-18T19:14:24+08:00 oc new-project succeeded 2020-08-18T19:14:26+08:00 oc delete project succeeded ...succeeded lines... Expected results: 3. Per MSTR-991's "Description": "A non-clean shutdown of masters should not lead to API (OpenShift and Kubernetes) errors beyond ... a minute" Additional info:
Per MSTR-991's "Description", tested baremetal env on OSP, got same result as this bug.
Are you sure it is not the issue with how the script (watch_available.sh) reports unavailability ? Notice that there is a default timeout (60s) after which the request (oc new-project) will be marked as failed. So for example, what will the script report after running oc new-project when you delay it for 30s? I think that the request can be stuck for additional 60s when you hit the wrong replica within the first 40/60s. Assuming oc uses the default timeout.
The timing explanation sounds reasonable. So closing the bug.
Lukasz shared new realized info that switching KCM to the internal LB might affect the “recovery time after a hard shutdown”, caused it takes up to about 1m40s to acquire a lock, it looks like a misconfigured LB, it shouldn’t take more than 30 s for the LB to remove an unhealthy Kube API out of the pool. So I'm reopening it. He and maybe other Dev fellows are still looking into it. Since it seems to relate to KCM, I'm changing the component to KCM.
I asked the AWS ELB team what's the maximum time taken by an LB to see an instance unhealthy and why I'm observing delay. Here is a response I received from them: I understand that you would like to know what is the maximum time taken by an NLB to mark an instance Unhealthy. You have defined "HealthCheckIntervalSeconds" as 10 and "UnhealthyThresholdCount" as 2 As per the configuration the NLB shouldn't take longer than 30 seconds, however, you are observing that is taking approximately 60 seconds for the NLB to mark the instance as Unhealthy. I would like to inform you that this is a known issue with the Network Load balancer and our internal ELB service team is aware of it and are working actively on a fix. Below is the wording from them highlighting the issue: Thank you for contacting Elastic Load Balancing about the length of time taken for a target to complete a healthy/Unhealthy state transition on your Network Load Balancer. We can confirm that the expected behavior of the Network Load Balancer health check system includes an additional delay of approximately 10-15 seconds in the time between when a target is detected as changing health state and when the new state is reflected in the load balancer. This is due to the distributed nature of the health check system. This system aggregates data about the health state of the target and distributes it to the Network Load Balancer. For example, if you have configured a 30 second health check interval and there is a threshold of 3 health checks to change the state, a healthy target that becomes unhealthy may experience up to 129 seconds of new requests routed to it after failing its first health check. This is due to the possibility that it fails 1 second after passing, so the first period between health checks could be up to 29 seconds. Then there is the 90 seconds of health checks that must fail, in this example as we have configured 30 second health check intervals, and 3 as the health check failure threshold, and then the 10 seconds to aggregate and distribute the health state information. To reduce impact from connections routed to unhealthy target, we recommend clients retry connections on connection failures.
Assuming the above is correct and it takes approximately 60s for an LB to observe a failure. We might see a delay when a node is marked as NotReady. As a consequence, the platform might appear as unavailable for the end-users. During that time (60s) anything can happen. The new KCM might be able to acquire a lock or not. Assuming it does acquire a lock, it might be stuck on populating a cache. The LIST request might fail and timeout (not sure if it is 30s or 60s) and we might need to retry. Given the above, I think that the worst-case scenario might be as follows. Assuming the default request timeout is 60s T+59s: the new KCM acquired a lock T+59s: we start populating the caches and fail T+119s: we try again T+120s: the caches are populated and the node controller starts. Note that this is slightly unrealistic but I don't know the actual population time. T+160s: the node controller makes the node as NotReady For the end-user, it might be even worse. Assuming the default request timeout is 60s. The following might happen. T+159s: the end-user creates a project and fails (oc new-project) T+219s: the end-user tries again and
I did 4 more trails today. An instance was marked as unhealthy after ~52s on average (51s, 55s, 52s, 50s)
I'm closing this issue as this is how an AWS LB works as of today.