Bug 2019215 - openshift-apiserver takes 60s to terminate, increasing node eviction time. [NEEDINFO]
Summary: openshift-apiserver takes 60s to terminate, increasing node eviction time.
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Abu Kashem
QA Contact: Rahul Gangwar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-01 21:08 UTC by Zack Zlotnik
Modified: 2022-03-14 15:53 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
jerzhang: needinfo? (akashem)


Attachments (Terms of Use)
Detailed breakdown of apiserver eviction frequency (493.82 KB, text/plain)
2021-11-01 21:08 UTC, Zack Zlotnik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-openshift-apiserver-operator pull 486 0 None open Bug 2019215: enable shutdown-send-retry-after for apiserver 2021-12-02 19:35:41 UTC

Description Zack Zlotnik 2021-11-01 21:08:27 UTC
Created attachment 1839039 [details]
Detailed breakdown of apiserver eviction frequency

Description of problem:
We (the MCO team) noticed that the openshift-apiserver does not consistently evict when an OpenShift cluster upgrade is being performed. We can see this in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1450721786428657664 as well as in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1450721797476454400. In both of those cases, the resulting CI job timed out:

2021-10-20 06:37:48	I1020 10:33:08.974199  462599 drain.go:73] Draining failed with: [error when waiting for pod "apiserver-7c65dfd584-gs789" terminating: global timeout reached: 1m30s, error when waiting for pod "apiserver-5d59559f8-q8zmx" terminating: global timeout reached: 1m30s], retrying
2021-10-20 06:37:48	I1020 10:32:17.711200  462599 daemon.go:334] evicting pod openshift-apiserver/apiserver-7c65dfd584-gs789
2021-10-20 06:37:48	I1020 10:32:12.922018  462599 daemon.go:334] evicting pod openshift-oauth-apiserver/apiserver-5d59559f8-q8zmx
2021-10-20 06:37:48	E1020 10:32:12.710804  462599 daemon.go:334] error when evicting pods/"apiserver-7c65dfd584-gs789" -n "openshift-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2021-10-20 06:37:48	I1020 10:32:12.701273  462599 daemon.go:334] evicting pod openshift-apiserver/apiserver-7c65dfd584-gs789
2021-10-20 06:37:48	I1020 10:32:07.946806  462599 daemon.go:319] Evicted pod openshift-authentication/oauth-openshift-579f5b5d49-6kcl6
2021-10-20 06:37:48	E1020 10:32:07.921239  462599 daemon.go:334] error when evicting pods/"apiserver-5d59559f8-q8zmx" -n "openshift-oauth-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2021-10-20 06:37:48	I1020 10:32:07.884443  462599 daemon.go:334] evicting pod openshift-oauth-apiserver/apiserver-5d59559f8-q8zmx
2021-10-20 06:37:48	E1020 10:32:07.700361  462599 daemon.go:334] error when evicting pods/"apiserver-7c65dfd584-gs789" -n "openshift-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2021-10-20 06:37:48	I1020 10:32:07.691467  462599 daemon.go:334] evicting pod openshift-apiserver/apiserver-7c65dfd584-gs789
2021-10-20 06:37:48	I1020 10:32:05.812220  462599 daemon.go:319] Evicted pod openshift-console/console-5dc55565db-8b78r
2021-10-20 06:37:48	E1020 10:32:02.883697  462599 daemon.go:334] error when evicting pods/"apiserver-5d59559f8-q8zmx" -n "openshift-oauth-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2021-10-20 06:37:48	I1020 10:32:02.874716  462599 daemon.go:334] evicting pod openshift-oauth-apiserver/apiserver-5d59559f8-q8zmx
2021-10-20 06:37:48	E1020 10:32:02.690437  462599 daemon.go:334] error when evicting pods/"apiserver-7c65dfd584-gs789" -n "openshift-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.


Version-Release number of selected component (if applicable):
4.9 -> 4.10

How reproducible:
Seems fairly reproducible in CI.

Steps to Reproduce:
1.
2.
3.

Actual results:
We expected the openshift-apiserver pods to eventually be cleanly evicted when the MCD applies the updated MachineConfigs to the cluster.

Expected results:
We consistently see errors involving pod disruption budgets and / or eviction timeouts from the MCD in which always involve the openshift-apiserver. By comparison, the etcd-quorum-guard pod eventually becomes evicted and doesn't time out.

Additional info:

One can find more detailed logs for the above errors here: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221634616000000%22,%221634875199000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade%2F1450721797476454400%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-mqxld%5C%22%22,%22refId%22:%22A%22%7D%5D.

Here are the etcd-quorum-guard logs for comparison: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221634616000000%22,%221634875199000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade%2F1450721797476454400%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-mqxld%5C%22%20%7C%3D%20%5C%22etcd-quorum-guard%5C%22%22,%22refId%22:%22A%22%7D%5D

I suspect that this might be caused by the pause behavior found in the openshift-apiserver pods (as found in the event logs) where some delay is incurred between the time a SIGTERM is given and the openshift-apiserver pod shuts down. However, I'm not 100% sure if this is the cause.

While this behavior does not consistently cause a given CI run to fail, we do think it contributes to CI instability as shown above (those two jobs specifically have this as the root cause). I looked at all of the upgrade jobs and discovered that the eviction behavior was fairly consistent across the 4.9 -> 4.10 upgrade jobs, regardless of platform, variant, etc. I've attached a YAML file which contains a detailed breakdown of which CI jobs we observed this on. Each occurrence also includes links to the Loki logs for both the MCD pod as well as the openshift-apiserver pods.

Comment 2 Stefan Schimanski 2021-11-02 08:16:38 UTC
About the case in the description:

10:27:03	openshift-apiserver	replicaset-controller	apiserver-7c65dfd584	SuccessfulCreate Created pod: apiserver-7c65dfd584-wg4jx
10:31:39	openshift-apiserver		apiserver-7c65dfd584-wg4jx	Scheduled Successfully assigned openshift-apiserver/apiserver-7c65dfd584-wg4jx to ci-op-ig61ynq6-82914-fgzq9-master-1

So for some reason the third pods takes minutes to schedule. The is probably the reason why apiserver-7c65dfd584-gs789 cannot be evicted.

During that time master-1 is actually not ready:

  The master nodes not ready: node \"ci-op-ig61ynq6-82914-fgzq9-master-1\" not ready since 2021-10-20 10:29:36 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.

In other words: when considering a reasonable eviction timeout, consider two components:

  1. the time it takes for a process to cleanly shutdown (this is up to 70s for openshift-apiserver)
  2. the time it takes for PDB backed pods on a freshly ready (10:31:16 in upper case) other master node to start up. Start-up = being scheduled and time to startup.

About (2), in the given case I see:

  - 10:31:39 apiserver-7c65dfd584-wg4jx being scheduled to master-1
  - 10:32:02 apiserver-7c65dfd584-wg4jx startup, trying to connect to etcd
  - 10:32:13 apiserver-7c65dfd584-wg4jx starting listening on port (not ready yet)
  - 10:32:14 apiserver-7c65dfd584-wg4jx still busy creating RBAC rules
  - 10:32:16 apiserver-7c65dfd584-wg4jx ready

So assuming this is normal startup behaviour (it looks like it is), we have to take 40s (or a minute or more to be on the safe side) into account for starting up the pod on the other, freshly started node into account.

=> 60s + 70s, at least 130s. 90s is clearly not enough.

Comment 3 Yu Qi Zhang 2021-11-02 16:02:04 UTC
Some quick items to clarify:

1. The 90s timeout is not a hard fail. The 90s is the timeout at which point we retry the drain operation. Drains are only considered failing after 1 hour of failure to drain nodes in the latest releases
2. This apiserver pdb (drain slowness + start slowness) started happening post upgrade to 4.8 as far as I can tell, some data points:

a) 4.6.48 -> 4.6.49 has no error[1] for apiserver, total drain time of all pods was 75s on a master node
b) 4.6 -> 4.7 on AWS has no error[2] for apiserver, total drain time 73s
c) starting in 4.7 -> 4.8 we no longer capture the previous MCD logs[3] but we can use loki [4] to see that drains are now taking 153s as a result of drain slowness, and we see the pdb issue both for etcd-quorum-guard and apiserver, although etcd does eventually resolve whereas apiserver ends up as an error. The interesting part here is that in those logs, apiserver-668796cd65-xjrwc is the one failing to drain, but it never actually drains. It simply is no longer there (we assume) on the next drain run. An apiserver pod "apiserver-86bd7fb688-xswmz" is drained successfully, which makes we wonder if apiserver-668796cd65-xjrwc is a phantom pod of some sort? Or is it normal to have multiple apiserver pods?

To summarize as of 4.8 apiserver is taking twice the amount of time to drain, for which a lot of times it actually never successfully completes, but rather is silently gone on the next drain retry operation. This means that the pod always drains in the short window between retries always (feels unlikely), so we were wondering if somehow instead the status reporting is having issues or if there is some type of undrainable apiserver pod dangling. If this is expected behaviour, then we should maybe be bumping timeouts post-4.8 upgrade jobs, since the MCO will be taking longer time to complete that upgrade (in many of the jobs failing on MCO, we actually don't fail, but rather hit a timeout since the master nodes are slow to upgrade).

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1453312292669100032/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-cwtv7_machine-config-daemon.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1455363312848998400/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-6tnpl_machine-config-daemon.log
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1455363312848998400/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-6tnpl_machine-config-daemon.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1455328097904627712/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-6zfbr_machine-config-daemon.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1455328097904627712/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-bk8pk_machine-config-daemon.log
[4] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade%2F1455328097904627712%5C%22%7D%20%7C%20unpack%20pod_name%3D%5C%22machine-config-daemon-bk8pk%5C%22%22,%22refId%22:%22A%22%7D%5D

Comment 4 Yu Qi Zhang 2021-11-29 16:50:25 UTC
Given that this:

a) is consistently reproduceable
b) is only an issue with the apiserver pods from our observation and aggregation
c) appears to also affect graceful shutdown efforts

Moving back to the apiserver team and upping severity. Please see comment 3 for the analysis on why we believe this is an issue from the apiserver pod side. If you have any questions/concerns, please feel free to reach out to me. Thanks!

Comment 5 Abu Kashem 2021-11-30 14:32:40 UTC
jerzhang@redhat.com,

I am sending it back to Machine Config Operator for the following

- take one job where you think eviction timed out for openshift apiserver
- establish a timeline of the eviction activities, we would like to see when it kicked off eviction process, when it timed out and the activities in between
- correlate the above timeline with openshift apiserver logs/events. 

Please share the CI job URL with us so we can take a deeper look. This way we can collaborate more efficiently and get to the root cause.

Comment 6 Yu Qi Zhang 2021-11-30 15:44:08 UTC
Hi Abu,

> - take one job where you think eviction timed out for openshift apiserver

Will use a latest 4.10 nightly upgrade job to help focus the discussion: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1465486284658053120 

> - establish a timeline of the eviction activities, we would like to see when it kicked off eviction process, when it timed out and the activities in between

We will look at the upgrade for this master node specifically: ip-10-0-254-172.us-east-2.compute.internal

First, the MCD logs on the node, during the upgrade:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486284658053120%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-45m5n%5C%22%22,%22refId%22:%22A%22%7D%5D

The apiserver pod on that node starts to evict at 02:09:18:

I1130 02:09:18.314096    4629 daemon.go:333] evicting pod openshift-apiserver/apiserver-6679c4cf97-xtjjv

It fails to drain immediately due to PDB:
	
E1130 02:09:18.338848    4629 daemon.go:333] error when evicting pods/"apiserver-6679c4cf97-xtjjv" -n "openshift-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

This retries 4 more times until eventually:
	
I1130 02:10:49.200913    4629 drain.go:91] Draining failed with: error when waiting for pod "apiserver-6679c4cf97-xtjjv" terminating: global timeout reached: 1m30s, retrying

So it times out after 90s as we define the overall drain timeout. The drain restarts, and succeeds, but notice that the pod in question apiserver-6679c4cf97-xtjjv never actually reports draining success

The apiserver pod is no longer in the must-gather since it is from a previous reboot. Looking at Loki again, for that specific pod,

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486284658053120%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22apiserver-6679c4cf97-xtjjv%5C%22%22,%22refId%22:%22A%22%7D%5D

It appears to have been failing for 40 minutes (?): 2021-11-29 21:09:02	
W1130 02:09:01.874010       1 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {10.0.153.225:2379 10.0.153.225 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.0.153.225:2379: connect: connection refused". Reconnecting...

Before eventually getting a SIGTERM and shutting down, 

This behaviour is consistent with jobs in 4.8, 4.9, 4.10 but we do not see this in previous versions, despite us using the same drain process. Our questions are mainly:

1. Why this happens starting in 4.8
2. Is this expected based on the analysis above (especially the apiserver pod errors)
3. Why the pod never reports drain success despite seemingly having terminated at 02:10:02 which is at least 47 seconds before our timeout

Let me know if you have any further questions

Comment 8 Abu Kashem 2021-12-01 17:22:13 UTC
> It appears to have been failing for 40 minutes (?): 2021-11-29 21:09:02	
> W1130 02:09:01.874010       1 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {10.0.153.225:2379 10.0.153.225 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.0.153.225:2379: connect: connection refused". Reconnecting...

This message is a noise, it has a warning label. I think the server was up and running during this window.


My analysis follows:

> I1130 02:09:18.314096    4629 daemon.go:333] evicting pod openshift-apiserver/apiserver-6679c4cf97-xtjjv
The eviction started at '02:09:18'. Now let's construct a timeline for the 'openshift-apiserver/apiserver-6679c4cf97-xtjjv' pod from logs:

> - W1130 01:30:29.805021:      first entry in log
The first log entry is at '01:30:29'

> - I1130 01:30:31.024869       1 genericapiserver.go:389] [graceful-termination] waiting for shutdown to be initiated
At '01:30:31' the server starts listening 

> - I1130 02:09:52.202768       1 genericapiserver.go:355] "[graceful-termination] shutdown event" name="ShutdownInitiated"
The server starts graceful shutdown at '02:09:52' after receiving the TERM signal

> - I1130 02:10:02.226923       1 genericapiserver.go:387] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"
The last entry is at '02:10:02'

From this we can infer:
- The server received the TERM signal about 34 seconds after the eviction started (02:09:52 - 02:09:18)
- It took the server about 10s to gracefully terminate after receiving the TERM signal (02:10:02 - 02:09:52)


As to why the pod received the TERM signal 34 seconds after the eviction was initiated - I loaded the metrics and broke it down by instance for openshift apiserver pods. 
I noticed that when the eviction was kicked off (02:09:18), there were only two pods reporting metrics. 
At 02:10:15 I see a third apiserver instance starts reporting, and then the pod being evicted receives TERM signal at 02:09:52. 
metrics scraping happens every 30s, so the third pod became ready around the same time the pod being evicted received the TERM signal. 
See the graph attached here https://bugzilla.redhat.com/attachment.cgi?id=1844374


The following is from your analysis:
> This retries 4 more times until eventually:	
> I1130 02:10:49.200913    4629 drain.go:91] Draining failed with: error when waiting for pod "apiserver-6679c4cf97-xtjjv" terminating: global timeout reached: 1m30s, retrying
> So it times out after 90s as we define the overall drain timeout. The drain restarts, and succeeds, but notice that the pod in question apiserver-6679c4cf97-xtjjv never actually reports draining success

I can't place it in the timeline, it shows 90s. but following the logs of the pod, we can see the eviction process took 45s.
So far looking at the pod logs, it took 45s for the pod to terminate after the eviction process kicked off. I used the loki UI you provided to inspect the log, feel free to double check my analysis.


Also, i want to reiterate Stefan's suggestion:
> In other words: when considering a reasonable eviction timeout, consider two components:
>
>  1. the time it takes for a process to cleanly shutdown (this is up to 70s for openshift-apiserver)
>  2. the time it takes for PDB backed pods on a freshly ready (10:31:16 in upper case) other master node to start up. Start-up = being scheduled and time to startup.

Comment 9 Abu Kashem 2021-12-01 18:03:36 UTC
> 1. Why this happens starting in 4.8
We have added PDB for openshift apiserver beginning with 4.9. https://github.com/openshift/cluster-openshift-apiserver-operator/blob/release-4.9/bindata/v3.11.0/openshift-apiserver/pdb.yaml. This does not explain why you are seeing this on 4.8. 


> 2. Is this expected based on the analysis above (especially the apiserver pod errors)
pod errors are noise, so we can ignore it. Based on the analysis I presented above, this is expected behavior.

> 3. Why the pod never reports drain success despite seemingly having terminated at 02:10:02 which is at least 47 seconds before our timeout
That's a good question, I would recommend adding debug/trace to MCD operator, inspecting other logs to get to the root cause. 

Having said that, I am assigning it back to you for now. If your follow up investigation points to openshift apiserver or anything related to control plane please feel free to assign it back to us.

Comment 10 Yu Qi Zhang 2021-12-01 18:13:41 UTC
Hi Abu,

Thank you for your analysis,

> From this we can infer:
> - The server received the TERM signal about 34 seconds after the eviction started (02:09:52 - 02:09:18)
> - It took the server about 10s to gracefully terminate after receiving the TERM signal (02:10:02 - 02:09:52)

I browsed some more logs and found also instances where the timeout was not reached, so we can construct a more clear timeline:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486287069777920%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-7j92d%5C%22%5Cn%22,%22refId%22:%22A%22%7D%5D

From this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1465486287069777920

The drain started at 01:52:20 and finished at 01:53:37 (77s total), looking at the corresponding pod log:

2021-11-29 20:52:36	
I1130 01:52:36.207381       1 genericapiserver.go:387] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"

Was the last line, meaning that it should have finished at 01:52:36 (16s) but then proceeded to take an additional 61 seconds before actually terminating.

Comparing that to e.g. the oauth pod, "openshift-oauth-apiserver/apiserver-6cf79cbc5b-f8hrt", which has its last log at 

2021-11-29 20:53:22	
I1130 01:53:22.426689       1 genericapiserver.go:709] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"kube-system", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationGracefulTerminationFinished' All pending requests processed

compared to the MCD completion:

2021-11-29 20:57:53	
I1130 01:53:23.768936    5161 daemon.go:318] Evicted pod openshift-oauth-apiserver/apiserver-6cf79cbc5b-f8hrt

only 1 second from termination to finish drain.

>  1. the time it takes for a process to cleanly shutdown (this is up to 70s for openshift-apiserver)
Do you know why the apiserver pod takes an additional ~60s to complete shutdown compared to when the pod appears to have terminated? I am not sure how to add a trace in the MCD for that case.

>  2. the time it takes for PDB backed pods on a freshly ready (10:31:16 in upper case) other master node to start up. Start-up = being scheduled and time to startup.
In the above linked job, I checked all 3 MCDs and there was no PDB failures (all 3 nodes drained in 70s), compared to the previous job where all 3 were blocked on apiserver PDBs for some time. Both are runs from last 2 days for 4.10. Could you help me take a look at why this run was much faster? Was the apiserver faster to boot up? Sorry I do not know how to construct the diagram to see apiserver status.

Comment 11 Abu Kashem 2021-12-01 20:53:51 UTC
jerzhang@redhat.com,

can you share:
- openshift apiserver pod name 
- loki url for the openshift apiserver pod

Thanks!

Comment 12 Yu Qi Zhang 2021-12-01 21:22:34 UTC
Hi Abu,

Of course. Will list all 3 pods and loki urls for them:

1. openshift-apiserver/apiserver-6946bdd965-kf298 : Last log at 01:52:36, drain success at 01:53:37 (61s)

  apiserver log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486287069777920%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%20%3D%20%5C%22apiserver-6946bdd965-kf298%5C%22%22,%22refId%22:%22A%22%7D%5D

  MCD log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486287069777920%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%20%3D%20%5C%22machine-config-daemon-7j92d%5C%22%22,%22refId%22:%22A%22%7D%5D

2. openshift-apiserver/apiserver-6946bdd965-dw7g8: This one actually has additional logs that I didn't notice before:

I1130 02:04:54.541105       1 genericapiserver.go:406] [graceful-termination] apiserver is exiting	
I1130 02:03:54.546638       1 genericapiserver.go:387] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"

MCD drain finished: 02:04:55 - this checks out. So apiserver appears to normally take a minute to exit, but we don't manage to save those logs normally?

  apiserver log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486287069777920%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%20%3D%20%5C%22apiserver-6946bdd965-dw7g8%5C%22%22,%22refId%22:%22A%22%7D%5D

  MCD log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486287069777920%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%20%3D%20%5C%22machine-config-daemon-7ngld%5C%22%22,%22refId%22:%22A%22%7D%5D


3. openshift-apiserver/apiserver-6946bdd965-cb6sj: last log at 01:58:17, drain success at 01:59:18, consistent with the above

  apiserver log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486287069777920%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%20%3D%20%5C%22apiserver-6946bdd965-cb6sj%5C%22%22,%22refId%22:%22A%22%7D%5D 

  MCD log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade%2F1465486287069777920%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%20%3D%20%5C%22machine-config-daemon-f2bch%5C%22%22,%22refId%22:%22A%22%7D%5D



So to summarize, all 3 nodes exhibit the same (from shutdown event to apiserver exit takes 60s, but generally isn't logged except the one case I found above).

And the other interesting thing about this CI run https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1465486287069777920 is that there was no PDB errors, implying that the apiserver pod running post-reboot starts faster(?) compared to the previous cluster logs.

Comment 13 Abu Kashem 2021-12-02 19:30:29 UTC
jerzhang@redhat.com,

Thanks for the information. I am going to use pod 'apiserver-6946bdd965-dw7g8' and try to explain the activities to make sense of things.

> I1130 02:03:43.601102    4759 daemon.go:333] evicting pod openshift-apiserver/apiserver-6946bdd965-dw7g8
(from MCD log) eviction started at '02:03:43'

> I1130 02:03:44.500493       1 genericapiserver.go:349] "[graceful-termination] shutdown event" name="ShutdownInitiated"
> I1130 02:03:44.499956       1 cmd.go:97] Received SIGTERM or SIGINT signal, shutting down controller.
The pod receives TERM signal and initiates graceful termination (working as expected)

> I1130 02:04:54.541105       1 genericapiserver.go:406] [graceful-termination] apiserver is exiting
> I1130 02:03:54.546638       1 genericapiserver.go:387] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"
The Pod terminates at '02:04:54' 

> I1130 02:04:55.453570    4759 daemon.go:318] Evicted pod openshift-apiserver/apiserver-6946bdd965-dw7g8
MCD decides that the pod has been evicted at '02:04:55'

Now let's inspect the audit logs to see when
- kubelet issued a DELETE for this pod, and
- when MCD issued a GET and receives a 404

> $ grep -rih "apiserver-6946bdd965-dw7g8" audit/logs/audit_logs/kube-apiserver/*audit* | jq '. | [.] | sort_by(.requestReceivedTimestamp) | .[] | select(.objectRef.name == "apiserver-6946bdd965-dw7g8") | .requestReceivedTimestamp + " " + .verb + " " + (.responseStatus.code|tostring) + " " + .requestURI + " " + .user.username'
>
> "2021-11-30T02:04:55.245249Z delete 200 /api/v1/namespaces/openshift-apiserver/pods/apiserver-6946bdd965-dw7g8 system:node:ip-10-0-170-65.us-west-2.compute.internal"
> "2021-11-30T02:04:55.451912Z get 404 /api/v1/namespaces/openshift-apiserver/pods/apiserver-6946bdd965-dw7g8 system:serviceaccount:openshift-machine-config-operator:machine-config-daemon"
>

- kubelet issued a DELETE for the pod at '02:04:55', 1s after the apiserver process terminated ('02:04:54') - this is expected
- MCD issues a GET for the pod, gets a 404 - this is expected


So everything seems to be working as expected here, other instances you have pointed out where we could not construct a proper timeline is due to missing log lines in loki. you can see here that the last line of apiserver log should be:
> [graceful-termination] apiserver is exiting
https://github.com/openshift/openshift-apiserver/blob/master/vendor/k8s.io/apiserver/pkg/server/genericapiserver.go#L406


as to why the apiserver pod always take 60s to terminate - this is a known issue and I believe the following PR will take it from a "fixed 60s" to "up to 60s".
https://github.com/openshift/cluster-openshift-apiserver-operator/pull/486
The above PR cannot merge now, we have to wait for 1.23 rebase to complete since it depends on https://github.com/kubernetes/kubernetes/pull/101257.


In addition, I would recommend that you increase the eviction timeout to cover the worst case, as Stefan suggested
> In other words: when considering a reasonable eviction timeout, consider two components:
>
>  1. the time it takes for a process to cleanly shutdown (this is up to 70s for openshift-apiserver)
>  2. the time it takes for PDB backed pods on a freshly ready (10:31:16 in upper case) other master node to start up. Start-up = being scheduled and time to startup.

Comment 14 Abu Kashem 2021-12-02 19:34:13 UTC
Once https://github.com/openshift/cluster-openshift-apiserver-operator/pull/486 merges we should expect node drain to take much less time. (unless there are other pods on the node that exhibit the same issue)

So let's do some tests after this PR merges

Comment 15 Yu Qi Zhang 2021-12-03 20:35:52 UTC
Hi Abu,

Thank you for the PR!

> In addition, I would recommend that you increase the eviction timeout to cover the worst case, as Stefan suggested

the eviction timeout is currently 1 hour. The 90 second triggers the retry per eviction cycle. It just so happens that the apiserver is the only one hitting the 90s retry, causing some additional reboot time, which was not the case in previous versions, prompting us to open this bug. We will consider modifying that if necessary.

Comment 16 Yu Qi Zhang 2022-01-24 19:42:53 UTC
Going to move this out of 4.10. @akashem@redhat.com do you have a rough idea of when we can land the 1.23 bump

Comment 17 Kirsten Garrison 2022-03-07 21:29:56 UTC
k8s was bumped to 1.23 in 4.10 but the PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/486 is still open.

I'm moving this to the api-server team for visibility as the related PR doesn't live in the MCO and seems to be getting lost in the shuffle as long as the BZ stays in the MCO list.

Please try to merge the fix so we can see if it resolves this BZ.

Thanks!

Comment 20 Rahul Gangwar 2022-03-14 15:53:11 UTC
oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-03-13-055724|grep -i openshift-apiserver-operator
  cluster-openshift-apiserver-operator           https://github.com/openshift/cluster-openshift-apiserver-operator           34b630d464d07e915eda01683b23fd05447ad55c

git log --date local --pretty="%h %an %cd - %s" 34b630d464d  |grep -i #486  
34b630d4 OpenShift Merge Robot Wed Mar 9 18:52:01 2022 - Merge pull request #486 from tkashem/graceful

oc get pod -n openshift-apiserver 
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-59d596f48b-72vsn   2/2     Running   0          29m
apiserver-59d596f48b-7cqb4   2/2     Running   0          26m
apiserver-59d596f48b-fqbxh   2/2     Running   0          27m

cat test.sh
date
oc delete pod/apiserver-59d596f48b-72vsn -n openshift-apiserver
date

Mon Mar 14 20:52:26 IST 2022
pod "apiserver-59d596f48b-72vsn" deleted
Mon Mar 14 20:53:40 IST 2022

Time taken terminate pod is more than 60 seconds which is 74 seconds. @Abukashem Please confirm, is this expected? Assigning back to confirm.


Note You need to log in before you can comment on or make changes to this bug.