Bug 1943804
Summary: | API server on AWS takes disruption between 70s and 110s after pod begins termination via external LB | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | ||||
Component: | kube-apiserver | Assignee: | Lukasz Szaszkiewicz <lszaszki> | ||||
Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.8 | CC: | aojeagar, aos-bugs, mfojtik, obulatov, sttts, wking, xxia, yunjiang | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | TechnicalReleaseBlocker | ||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-07-27 22:56:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1791162, 1801885, 1845411, 1869788, 1869790, 1929396 | ||||||
Attachments: |
|
Description
Clayton Coleman
2021-03-27 15:28:24 UTC
Clayton is right, AWS is broken. I did the following manual test which demonstrates that new connections are being forwarded to an API that undergoes a termination. 1. installed a 4.8 cluster (4.8.0-0.ci-2021-03-28-220420) 2. taken one API out of the pool (mv kube-apiserver-pod.yaml ../). To increase my chances of hitting an API server. 3. run the following script (which creates a new connection to a server on each run): while; do curl -H "User-Agent: james" -H "Authorization: Bearer REDACTED" https://api.polmonkmkimp.group-b.devcluster.openshift.com:6443/api/v1/namespaces -k -m 2 -o /dev/null -sw "$(date +%T) %{http_code} %{time_total}s\n"; sleep 1; done 4. sent a SIGTERM to one server: kill -s SIGTERM $PID here's what I have found out: requests were received around ~T+70s (--shutdown-delay-duration) which is VERY late and indicates that AWS didn't take the API server out of rotation. E0: T+0s: SIGTERM sent to 10-0-131-183 (date; kill -s SIGTERM 143136; Mon Mar 29 12:16:54 UTC 2021) E1: T+0s: 10-0-131-183 receives a TERM signal (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationStart")') { "apiVersion": "v1", "eventTime": null, "firstTimestamp": null, "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal", "namespace": "openshift-kube-apiserver" }, "kind": "Event", "lastTimestamp": null, "message": "Received signal to terminate, becoming unready, but keeping serving", "metadata": { "creationTimestamp": "2021-03-29T12:16:54Z", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf5163faf315", "namespace": "openshift-kube-apiserver", "resourceVersion": "88811", "uid": "ea1efadc-3cdf-4275-a83f-cfd8ccc7472d" }, "reason": "TerminationStart", "reportingComponent": "", "reportingInstance": "", "source": { "component": "apiserver", "host": "ip-10-0-131-183" }, "type": "Normal" } E2: T+0s: logs from 10-0-131-183 filtered by the user agent (oc logs -n openshift-kube-apiserver kube-apiserver-ip-10-0-131-183.ec2.internal -f | grep "James") I0329 12:17:00.565819 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="232.707358ms" userAgent="james" srcIP="10.0.51.233:58801" resp=200 I0329 12:17:02.404500 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="232.509186ms" userAgent="james" srcIP="10.0.51.233:21409" resp=200 I0329 12:17:08.536342 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="226.236527ms" userAgent="james" srcIP="10.0.51.233:23306" resp=200 I0329 12:17:10.368416 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="231.572429ms" userAgent="james" srcIP="10.0.51.233:40110" resp=200 I0329 12:17:19.529428 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="223.988728ms" userAgent="james" srcIP="10.0.51.233:18542" resp=200 I0329 12:17:23.198021 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="232.905483ms" userAgent="james" srcIP="10.0.51.233:46005" resp=200 I0329 12:17:25.039396 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="233.728848ms" userAgent="james" srcIP="10.0.51.233:10436" resp=200 I0329 12:17:26.857096 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="227.374887ms" userAgent="james" srcIP="10.0.51.233:43878" resp=200 I0329 12:17:28.701398 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="223.498453ms" userAgent="james" srcIP="10.0.51.233:64204" resp=200 I0329 12:17:30.530245 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="229.98221ms" userAgent="james" srcIP="10.0.51.233:54945" resp=200 I0329 12:17:32.333707 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="224.751331ms" userAgent="james" srcIP="10.0.51.233:50888" resp=200 I0329 12:17:35.986866 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.489524ms" userAgent="james" srcIP="10.0.51.233:5798" resp=200 I0329 12:17:37.805527 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="229.850606ms" userAgent="james" srcIP="10.0.51.233:63211" resp=200 I0329 12:17:41.469994 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="233.488308ms" userAgent="james" srcIP="10.0.51.233:51799" resp=200 I0329 12:17:43.304633 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="231.870691ms" userAgent="james" srcIP="10.0.51.233:31088" resp=200 I0329 12:17:46.961367 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.730156ms" userAgent="james" srcIP="10.0.51.233:44535" resp=200 I0329 12:17:50.612538 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="242.212359ms" userAgent="james" srcIP="10.0.51.233:29328" resp=200 W0329 12:17:52.200501 18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:37342, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup. I0329 12:17:52.432302 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="231.829746ms" userAgent="james" srcIP="10.0.51.233:37342" resp=200 W0329 12:17:54.026533 18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:27541, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup. I0329 12:17:54.255197 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.686402ms" userAgent="james" srcIP="10.0.51.233:27541" resp=200 W0329 12:17:55.839230 18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:50792, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup. I0329 12:17:56.065143 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="225.954251ms" userAgent="james" srcIP="10.0.51.233:50792" resp=200 W0329 12:17:59.508938 18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:10138, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup. I0329 12:17:59.738239 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="229.355668ms" userAgent="james" srcIP="10.0.51.233:10138" resp=200 W0329 12:18:01.342323 18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:54300, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup. I0329 12:18:01.575489 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="233.194447ms" userAgent="james" srcIP="10.0.51.233:54300" resp=200 W0329 12:18:03.170068 18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:28110, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup. I0329 12:18:03.396975 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="226.930343ms" userAgent="james" srcIP="10.0.51.233:28110" resp=200 error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="" E3: T+70s: TerminationMinimalShutdownDurationFinished recived on 12:18:04 (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationMinimalShutdownDurationFinished")') { "apiVersion": "v1", "eventTime": null, "firstTimestamp": null, "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal", "namespace": "openshift-kube-apiserver" }, "kind": "Event", "lastTimestamp": null, "message": "The minimal shutdown duration of 1m10s finished", "metadata": { "creationTimestamp": "2021-03-29T12:18:04Z", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf61b0f72d2d", "namespace": "openshift-kube-apiserver", "resourceVersion": "89139", "uid": "4eab8165-cb40-4a7a-b7e7-5401938b82f9" }, "reason": "TerminationMinimalShutdownDurationFinished", "reportingComponent": "", "reportingInstance": "", "source": { "component": "apiserver", "host": "ip-10-0-131-183" }, "type": "Normal" } E4: T+70s: TerminationStoppedServing recived on 12:18:04 (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationStoppedServing")') { "apiVersion": "v1", "eventTime": null, "firstTimestamp": null, "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal", "namespace": "openshift-kube-apiserver" }, "kind": "Event", "lastTimestamp": null, "message": "Server has stopped listening", "metadata": { "creationTimestamp": "2021-03-29T12:18:04Z", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf61b1990110", "namespace": "openshift-kube-apiserver", "resourceVersion": "89140", "uid": "84d567bd-f653-4fbd-b542-82fff0c72d81" }, "reason": "TerminationStoppedServing", "reportingComponent": "", "reportingInstance": "", "source": { "component": "apiserver", "host": "ip-10-0-131-183" }, "type": "Normal" } E4: T+118s: TerminationGracefulTerminationFinished recived on 12:19:04 (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationGracefulTerminationFinished")') { "apiVersion": "v1", "eventTime": null, "firstTimestamp": null, "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal", "namespace": "openshift-kube-apiserver" }, "kind": "Event", "lastTimestamp": null, "message": "All pending requests processed", "metadata": { "creationTimestamp": "2021-03-29T12:19:04Z", "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf6fa9f8ef82", "namespace": "openshift-kube-apiserver", "resourceVersion": "89452", "uid": "c6bddd39-5611-4e19-a7fe-3fcbf204ef02" }, "reason": "TerminationGracefulTerminationFinished", "reportingComponent": "", "reportingInstance": "", "source": { "component": "apiserver", "host": "ip-10-0-131-183" }, "type": "Normal" } Did a few more tests, in the last run the server received a request at T0 + 211s [root@ip-10-0-131-183 ~]# date; kill -s SIGTERM 278562 Mon Mar 29 16:25:23 UTC 2021 ❯ oc logs -n openshift-kube-apiserver kube-apiserver-ip-10-0-131-183.ec2.internal -f | grep "james" I0329 16:25:52.937653 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="222.902219ms" userAgent="james" srcIP="10.0.91.230:54834" resp=200 I0329 16:25:54.747723 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="223.178316ms" userAgent="james" srcIP="10.0.37.169:63291" resp=200 I0329 16:25:56.586519 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="236.856737ms" userAgent="james" srcIP="10.0.78.223:20031" resp=200 I0329 16:26:00.841269 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.263823ms" userAgent="james" srcIP="10.0.51.233:62462" resp=200 I0329 16:26:03.026039 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="462.656103ms" userAgent="james" srcIP="10.0.25.113:57850" resp=200 I0329 16:26:05.218424 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="456.940913ms" userAgent="james" srcIP="10.0.25.113:11272" resp=200 I0329 16:26:07.373366 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="442.235736ms" userAgent="james" srcIP="10.0.25.113:46960" resp=200 I0329 16:26:09.554608 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="462.029928ms" userAgent="james" srcIP="10.0.25.113:16254" resp=200 I0329 16:26:13.770448 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="460.215337ms" userAgent="james" srcIP="10.0.25.113:6017" resp=200 I0329 16:26:15.948896 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="458.875829ms" userAgent="james" srcIP="10.0.25.113:22118" resp=200 I0329 16:26:18.110111 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="447.566662ms" userAgent="james" srcIP="10.0.25.113:14208" resp=200 I0329 16:26:24.324797 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="446.271584ms" userAgent="james" srcIP="10.0.25.113:61408" resp=200 I0329 16:26:26.470754 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="444.751335ms" userAgent="james" srcIP="10.0.25.113:53187" resp=200 I0329 16:26:32.744479 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.035465ms" userAgent="james" srcIP="10.0.25.113:24460" resp=200 I0329 16:26:41.314015 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="448.624921ms" userAgent="james" srcIP="10.0.25.113:50744" resp=200 I0329 16:26:43.463580 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="446.781681ms" userAgent="james" srcIP="10.0.25.113:13908" resp=200 I0329 16:26:49.771985 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="439.832331ms" userAgent="james" srcIP="10.0.25.113:60474" resp=200 I0329 16:26:56.056219 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="461.870335ms" userAgent="james" srcIP="10.0.25.113:28137" resp=200 I0329 16:26:58.227994 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="444.899661ms" userAgent="james" srcIP="10.0.25.113:40385" resp=200 I0329 16:27:00.399969 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.621468ms" userAgent="james" srcIP="10.0.25.113:48095" resp=200 I0329 16:27:04.588044 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="438.775726ms" userAgent="james" srcIP="10.0.25.113:61575" resp=200 I0329 16:27:08.826198 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="494.197819ms" userAgent="james" srcIP="10.0.25.113:16214" resp=200 I0329 16:27:10.960414 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="447.225468ms" userAgent="james" srcIP="10.0.25.113:5935" resp=200 I0329 16:27:17.268094 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="464.328062ms" userAgent="james" srcIP="10.0.25.113:51641" resp=200 I0329 16:27:21.469696 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.002004ms" userAgent="james" srcIP="10.0.25.113:18644" resp=200 I0329 16:27:23.650813 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="457.71601ms" userAgent="james" srcIP="10.0.25.113:13542" resp=200 I0329 16:27:27.865922 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="446.575823ms" userAgent="james" srcIP="10.0.25.113:44279" resp=200 I0329 16:27:30.018900 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="451.042525ms" userAgent="james" srcIP="10.0.25.113:38352" resp=200 I0329 16:27:36.330192 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="492.799839ms" userAgent="james" srcIP="10.0.25.113:45080" resp=200 I0329 16:27:38.470673 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="440.340293ms" userAgent="james" srcIP="10.0.25.113:22339" resp=200 I0329 16:27:46.864711 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="440.568683ms" userAgent="james" srcIP="10.0.25.113:17841" resp=200 I0329 16:27:49.003416 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="440.534977ms" userAgent="james" srcIP="10.0.25.113:47367" resp=200 I0329 16:27:51.176509 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="463.568847ms" userAgent="james" srcIP="10.0.25.113:18046" resp=200 I0329 16:27:53.334598 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="447.761414ms" userAgent="james" srcIP="10.0.25.113:61739" resp=200 I0329 16:28:08.160325 18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.933349ms" userAgent="james" srcIP="10.0.6.16:30366" resp=200 I'm going to open a support ticket with AWS. Created attachment 1767396 [details]
cloudwatch metrics for the latest run
Since the PR 1079 increases termination timeouts for AWS, we need to verify on AWS and non-AWS clusters. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-21-233425 True False 24h Cluster version is 4.8.0-0.nightly-2021-05-21-233425 $ oc get infrastructures.config.openshift.io -o yaml | grep -A1 " platformSpec" platformSpec: type: GCP $ oc -n openshift-kube-apiserver get cm config -o json | jq -r '.data."config.yaml"' | jq '.apiServerArguments."shutdown-delay-duration"' [ "70s" ] $ oc get pod -n openshift-kube-apiserver kube-apiserver-ci-ln-d6ygqnb-f76d1-d6kbg-master-1 -o yaml | grep 135s exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=135s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP} -v=2 --permit-address-sharing - Kube-apiserver rolling out, $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]' Checked kube-apiserver operator logs, the kube-apiserver rolling out started and ended time, from 03:44:35 to 03:55:27, see below, I0525 03:44:35.707678 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3748cc54-2961-4e73-98c8-2f4e24a89b9d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/kube-apiserver-pod -n openshift-kube-apiserver: cause by changes in data.forceRedeploymentReason ... I0525 03:55:27.483033 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3748cc54-2961 -4e73-98c8-2f4e24a89b9d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Pro gressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 9"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revis ion 8; 2 nodes are at revision 9" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9" - Etcd encryption, $ oc patch apiserver/cluster -p '{"spec":{"encryption":{"type":"aescbc"}}}' --type merge apiserver.config.openshift.io/cluster patched Checked the kube-apiserver operator logs, the encryption started at 04:10:25, ... I0525 04:10:25.175851 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3748cc54-2961 -4e73-98c8-2f4e24a89b9d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'EncryptionKeyCreated' Secret "encryption-key-openshift-kube-apiserver-1" succ essfully created: ["key-does-not-exist"] ... $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.lastTransitionTime}{"\n"}{.reason}{"\n"}{.message}{"\n"}' 2021-05-25T04:34:59Z EncryptionCompleted All resources encrypted: secrets, configmaps The encryption ended at 4:34:59, took about 24 minutes. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ $ oc get infrastructures.config.openshift.io -o yaml | grep -A1 " platformSpec" platformSpec: aws: {} $ oc -n openshift-kube-apiserver get cm config -o json | jq -r '.data."config.yaml"' | jq '.apiServerArguments."shutdown-delay-duration"' [ "210s" ] $ oc get pod -n openshift-kube-apiserver kube-apiserver-ip-10-0-49-171.us-east-2.compute.internal -o yaml | grep 275s exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=275s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP} -v=2 --permit-address-sharing $ oc exec -n openshift-kube-apiserver-operator kube-apiserver-operator-6fc449958c-sl7p8 -- date; oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]' Tue May 25 05:29:06 UTC 2021 kubeapiserver.operator.openshift.io/cluster patched Checked kube-apiserver operator logs, the kube-apiserver rolling out started and ended time, from 05:29:09 to 05:48:50, about 19m41s I0525 05:29:09.535071 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"8beb5658-6257 -4e15-89cc-d7e17af29b3f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/kube-apiserver-pod -n openshift-kube-apis erver: 3767 cause by changes in data.forceRedeploymentReason ... I0525 05:48:50.210507 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"8beb5658-6257 -4e15-89cc-d7e17af29b3f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Pro gressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 18"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 17; 2 nodes are at revision 18" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 18" $ oc exec -n openshift-kube-apiserver-operator kube-apiserver-operator-6fc449958c-sl7p8 -- date; oc patch apiserver/cluster -p '{"spec":{"encryption":{"type":"aescbc"}}}' --type merge Tue May 25 06:38:20 UTC 2021 apiserver.config.openshift.io/cluster patched Checked the kube-apiserver operator logs, the encryption started at 06:38:21, ... I0525 06:38:21.862745 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"8beb5658-6257 -4e15-89cc-d7e17af29b3f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'EncryptionKeyCreated' Secret "encryption-key-openshift-kube-apiserver-1" succ essfully created: ["key-does-not-exist"] ... $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.lastTransitionTime}{"\n"}{.reason}{"\n"}{.message}{"\n"}' 2021-05-25T07:21:02Z EncryptionCompleted All resources encrypted: secrets, configmaps The encryption ended at 07:21:02, took about 43 minutes. From above results, the kube-apiserver rolling out and etcd encryption on AWS took much more time than GCP, it is as expected, so move the bug VERIFIED, *** Bug 1845412 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |