Bug 1869629

Summary:	[MSTR-991] When kube-controller-manager leader master is being shutdown, openshift-apiserver and login are always 1m30s unavailable
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	kube-controller-manager	Assignee:	Lukasz Szaszkiewicz <lszaszki>
Status:	CLOSED CANTFIX	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.6	CC:	aos-bugs, mfojtik
Target Milestone:	---	Keywords:	Reopened, UpcomingSprint
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-07 13:16:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2020-08-18 11:44:19 UTC

Description of problem:
When kube-controller-manager leader master is being shutdown, openshift-apiserver and login are always 1 minute unavailable

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-18-015339

How reproducible:
Always (Checked UPI GCP, IPI AWS)

Steps to Reproduce:
1. Check masters, KCM leader, auth pod masters
$ oc get no --context admin -l node-role.kubernetes.io/master
NAME                                             STATUS   ROLES    AGE     VERSION
ip-10-0-155-29.ap-northeast-1.compute.internal   Ready    master   7h16m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-178-19.ap-northeast-1.compute.internal   Ready    master   7h16m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-222-52.ap-northeast-1.compute.internal   Ready    master   7h16m   v1.19.0-rc.2+99cb93a-dirty
$ oc get cm kube-controller-manager -o yaml -n kube-system | grep leader
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"ip-10-0-155-29_1f49b58d-a42a-4c3a-8d2a-7d9c7ec0ab24","leaseDurationSeconds":15,"acquireTime":"2020-08-18T10:17:18Z","renewTime":"2020-08-18T11:06:51Z","leaderTransitions":9}'
$ oc get po -o wide -n openshift-authentication --no-headers
oauth-openshift-5479c46966-cx49k   1/1     Running   0          5m3s   10.130.0.35   ip-10-0-222-52.ap-northeast-1.compute.internal   <none>           <none>            app=oauth-openshift,pod-template-hash=5479c46966
oauth-openshift-5479c46966-tttbr   1/1     Running   0          6h8m   10.129.0.45   ip-10-0-178-19.ap-northeast-1.compute.internal   <none>           <none>            app=oauth-openshift,pod-template-hash=5479c46966

We can see ip-10-0-155-29 is KCM leader, auth pods are on other two masters ip-10-0-178-19 and ip-10-0-222-52.

2. In terminal A, run:
$ curl -o scripts/watch-available.sh http://file.rdu.redhat.com/~xxia/watch-available.sh
$ scripts/watch-available.sh |& tee scripts-results/watch-available.aws.log

3. In cloud console, select VM ip-10-0-155-29, click "Stop" button to shut down it.
Then watch terminal A.

Actual results:
3. Terminal A shows:
...
2020-08-18T19:12:39+08:00 oc login succeeded
2020-08-18T19:12:43+08:00 oc new-project succeeded
2020-08-18T19:12:44+08:00 oc delete project succeeded
2020-08-18T19:12:51+08:00 oc get no succeeded
2020-08-18T19:12:52+08:00 oc get cm succeeded
2020-08-18T19:12:56+08:00 oc login failed: error: unexpected EOF

real	0m0.723s
user	0m0.134s
sys	0m0.036s
ip-10-0-155-29.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-178-19.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-222-52.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
pod/apiserver-69c855d5d9-9ln7s   2/2   Running   0     6h20m   10.128.0.15   ip-10-0-155-29.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79nv   2/2   Running   0     6h21m   10.129.0.42   ip-10-0-178-19.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79t7   2/2   Running   0     6h21m   10.130.0.29   ip-10-0-222-52.ap-northeast-1.compute.internal   <none>   <none>
endpoints/api               10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443      7h2m
endpoints/check-endpoints   10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698   7h1m
2020-08-18T19:12:59+08:00 oc new-project failed: Error from server (Timeout): Timeout: request did not complete within requested timeout 34s

real	0m35.038s
user	0m0.146s
sys	0m0.040s
ip-10-0-155-29.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-178-19.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-222-52.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
pod/apiserver-69c855d5d9-9ln7s   2/2   Running   0     6h21m   10.128.0.15   ip-10-0-155-29.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79nv   2/2   Running   0     6h22m   10.129.0.42   ip-10-0-178-19.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79t7   2/2   Running   0     6h21m   10.130.0.29   ip-10-0-222-52.ap-northeast-1.compute.internal   <none>   <none>
endpoints/api               10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443      7h3m
endpoints/check-endpoints   10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698   7h2m
2020-08-18T19:13:46+08:00 oc delete project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request

real	0m1.664s
user	0m0.165s
sys	0m0.042s
ip-10-0-155-29.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-178-19.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-222-52.ap-northeast-1.compute.internal   Ready   master   7h5m   v1.19.0-rc.2+99cb93a-dirty
pod/apiserver-69c855d5d9-9ln7s   2/2   Running   0     6h21m   10.128.0.15   ip-10-0-155-29.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79nv   2/2   Running   0     6h22m   10.129.0.42   ip-10-0-178-19.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79t7   2/2   Running   0     6h22m   10.130.0.29   ip-10-0-222-52.ap-northeast-1.compute.internal   <none>   <none>
endpoints/api               10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443      7h3m
endpoints/check-endpoints   10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698   7h2m
2020-08-18T19:13:50+08:00 oc get no succeeded
2020-08-18T19:13:51+08:00 oc get cm succeeded
2020-08-18T19:14:01+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 400

real	0m5.227s
user	0m0.157s
sys	0m0.050s
ip-10-0-155-29.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-178-19.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-222-52.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
pod/apiserver-69c855d5d9-9ln7s   2/2   Running   0     6h21m   10.128.0.15   ip-10-0-155-29.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79nv   2/2   Running   0     6h22m   10.129.0.42   ip-10-0-178-19.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79t7   2/2   Running   0     6h22m   10.130.0.29   ip-10-0-222-52.ap-northeast-1.compute.internal   <none>   <none>
endpoints/api               10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443      7h3m
endpoints/check-endpoints   10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698   7h3m
2020-08-18T19:14:08+08:00 oc new-project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request

real	0m2.829s
user	0m0.132s
sys	0m0.032s
ip-10-0-155-29.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-178-19.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-222-52.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
pod/apiserver-69c855d5d9-9ln7s   2/2   Running   0     6h21m   10.128.0.15   ip-10-0-155-29.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79nv   2/2   Running   0     6h22m   10.129.0.42   ip-10-0-178-19.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79t7   2/2   Running   0     6h22m   10.130.0.29   ip-10-0-222-52.ap-northeast-1.compute.internal   <none>   <none>
endpoints/api               10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443      7h4m
endpoints/check-endpoints   10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698   7h3m
2020-08-18T19:14:13+08:00 oc delete project succeeded
2020-08-18T19:14:14+08:00 oc get no succeeded
2020-08-18T19:14:15+08:00 oc get cm succeeded
2020-08-18T19:14:19+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 500

real	0m3.726s
user	0m0.174s
sys	0m0.042s
ip-10-0-155-29.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-178-19.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-222-52.ap-northeast-1.compute.internal   Ready   master   7h6m   v1.19.0-rc.2+99cb93a-dirty
pod/apiserver-69c855d5d9-9ln7s   2/2   Running   0     6h22m   10.128.0.15   ip-10-0-155-29.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79nv   2/2   Running   0     6h22m   10.129.0.42   ip-10-0-178-19.ap-northeast-1.compute.internal   <none>   <none>
pod/apiserver-69c855d5d9-m79t7   2/2   Running   0     6h22m   10.130.0.29   ip-10-0-222-52.ap-northeast-1.compute.internal   <none>   <none>
endpoints/api               10.128.0.15:8443,10.129.0.42:8443,10.130.0.29:8443      7h4m
endpoints/check-endpoints   10.128.0.15:17698,10.129.0.42:17698,10.130.0.29:17698   7h3m
2020-08-18T19:14:24+08:00 oc new-project succeeded
2020-08-18T19:14:26+08:00 oc delete project succeeded
...

Let's grep, can see from 2020-08-18T19:12:56+08:00 ~ 2020-08-18T19:14:19+08:00, openshift-apiserver and login are unavailable about 1.5 minutes:
$ grep "^2020-08-18" scripts-results/watch-available.aws.log
...succeeded lines...
2020-08-18T19:12:39+08:00 oc login succeeded
2020-08-18T19:12:43+08:00 oc new-project succeeded
2020-08-18T19:12:44+08:00 oc delete project succeeded
2020-08-18T19:12:51+08:00 oc get no succeeded
2020-08-18T19:12:52+08:00 oc get cm succeeded
2020-08-18T19:12:56+08:00 oc login failed: error: unexpected EOF
2020-08-18T19:12:59+08:00 oc new-project failed: Error from server (Timeout): Timeout: request did not complete within requested timeout 34s
2020-08-18T19:13:46+08:00 oc delete project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request
2020-08-18T19:13:50+08:00 oc get no succeeded
2020-08-18T19:13:51+08:00 oc get cm succeeded
2020-08-18T19:14:01+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 400
2020-08-18T19:14:08+08:00 oc new-project failed: Error from server (ServiceUnavailable): the server is currently unable to handle the request
2020-08-18T19:14:13+08:00 oc delete project succeeded
2020-08-18T19:14:14+08:00 oc get no succeeded
2020-08-18T19:14:15+08:00 oc get cm succeeded
2020-08-18T19:14:19+08:00 oc login failed: Error from server (InternalError): Internal error occurred: unexpected response: 500
2020-08-18T19:14:24+08:00 oc new-project succeeded
2020-08-18T19:14:26+08:00 oc delete project succeeded
...succeeded lines...

Expected results:
3. Per MSTR-991's "Description": "A non-clean shutdown of masters should not lead to API (OpenShift and Kubernetes) errors beyond ... a minute"

Additional info:

Comment 1 Xingxing Xia 2020-08-19 14:40:08 UTC

Per MSTR-991's "Description", tested baremetal env on OSP, got same result as this bug.

Comment 2 Lukasz Szaszkiewicz 2020-08-25 10:25:29 UTC

Are you sure it is not the issue with how the script (watch_available.sh) reports unavailability ? Notice that there is a default timeout (60s) after which the request (oc new-project) will be marked as failed.
So for example, what will the script report after running oc new-project when you delay it for 30s?

I think that the request can be stuck for additional 60s when you hit the wrong replica within the first 40/60s. Assuming oc uses the default timeout.

Comment 3 Xingxing Xia 2020-08-25 10:56:40 UTC

The timing explanation sounds reasonable. So closing the bug.

Comment 4 Xingxing Xia 2020-09-01 11:00:50 UTC

Lukasz shared new realized info that switching KCM to the internal LB might affect the “recovery time after a hard shutdown”, caused it takes up to about 1m40s to acquire a lock, it looks like a misconfigured LB, it shouldn’t take more than 30 s for the LB to remove an unhealthy Kube API out of the pool. So I'm reopening it. He and maybe other Dev fellows are still looking into it. 
Since it seems to relate to KCM, I'm changing the component to KCM.

Comment 5 Lukasz Szaszkiewicz 2020-09-02 07:34:51 UTC

I asked the AWS ELB team what's the maximum time taken by an LB to see an instance unhealthy and why I'm observing delay. Here is a response I received from them:

I understand that you would like to know what is the maximum time taken by an NLB to mark an instance Unhealthy.
You have defined "HealthCheckIntervalSeconds" as 10 and "UnhealthyThresholdCount" as 2
As per the configuration the NLB shouldn't take longer than 30 seconds, however, you are observing that is taking approximately 60 seconds for the NLB to mark the instance as Unhealthy.

I would like to inform you that this is a known issue with the Network Load balancer and our internal ELB service team is aware of it and are working actively on a fix. Below is the wording from them highlighting the issue:

Thank you for contacting Elastic Load Balancing about the length of time taken for a target to complete a healthy/Unhealthy state transition on your Network Load Balancer. We can confirm that the expected behavior of the Network Load Balancer health check system includes an additional delay of approximately 10-15 seconds in the time between when a target is detected as changing health state and when the new state is reflected in the load balancer. This is due to the distributed nature of the health check system. This system aggregates data about the health state of the target and distributes it to the Network Load Balancer.

For example, if you have configured a 30 second health check interval and there is a threshold of 3 health checks to change the state, a healthy target that becomes unhealthy may experience up to 129 seconds of new requests routed to it after failing its first health check. This is due to the possibility that it fails 1 second after passing, so the first period between health checks could be up to 29 seconds. Then there is the 90 seconds of health checks that must fail, in this example as we have configured 30 second health check intervals, and 3 as the health check failure threshold, and then the 10 seconds to aggregate and distribute the health state information.

To reduce impact from connections routed to unhealthy target, we recommend clients retry connections on connection failures.

Comment 6 Lukasz Szaszkiewicz 2020-09-02 10:10:19 UTC

Assuming the above is correct and it takes approximately 60s for an LB to observe a failure. We might see a delay when a node is marked as NotReady.
As a consequence, the platform might appear as unavailable for the end-users.

During that time (60s) anything can happen. The new KCM might be able to acquire a lock or not.
Assuming it does acquire a lock, it might be stuck on populating a cache. The LIST request might fail and timeout (not sure if it is 30s or 60s) and we might need to retry.

Given the above, I think that the worst-case scenario might be as follows. Assuming the default request timeout is 60s

T+59s:  the new KCM acquired a lock
T+59s:  we start populating the caches and fail
T+119s: we try again
T+120s: the caches are populated and the node controller starts. Note that this is slightly unrealistic but I don't know the actual population time.
T+160s: the node controller makes the node as NotReady

For the end-user, it might be even worse. Assuming the default request timeout is 60s. The following might happen.

T+159s: the end-user creates a project and fails (oc new-project)
T+219s: the end-user tries again and

Comment 7 Lukasz Szaszkiewicz 2020-09-02 13:41:46 UTC

I did 4 more trails today. An instance was marked as unhealthy after ~52s on average (51s, 55s, 52s, 50s)

Comment 8 Lukasz Szaszkiewicz 2020-09-07 13:16:11 UTC

I'm closing this issue as this is how an AWS LB works as of today.