Bug 1943804 - API server on AWS takes disruption between 70s and 110s after pod begins termination via external LB
Summary: API server on AWS takes disruption between 70s and 110s after pod begins term...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard: TechnicalReleaseBlocker
: 1845412 (view as bug list)
Depends On:
Blocks: 1791162 1801885 1845411 1869788 1869790 1929396
TreeView+ depends on / blocked
 
Reported: 2021-03-27 15:28 UTC by Clayton Coleman
Modified: 2021-10-09 04:38 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:56:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cloudwatch metrics for the latest run (113.83 KB, image/png)
2021-03-29 16:40 UTC, Lukasz Szaszkiewicz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 446 0 None closed Bug 1943804: stub for splitting encryption tests 2021-05-20 11:47:11 UTC
Github openshift cluster-authentication-operator pull 447 0 None closed Bug 1943804: splits the encryption tests 2021-05-20 11:47:11 UTC
Github openshift cluster-kube-apiserver-operator pull 1079 0 None closed Bug 1943804: increases termination timeouts for AWS 2021-07-20 04:32:49 UTC
Github openshift cluster-kube-apiserver-operator pull 1131 0 None closed Bug 1943804: split encryption tests 2021-05-20 11:47:13 UTC
Github openshift cluster-openshift-apiserver-operator pull 453 0 None closed Bug 1943804: stub for splitting encryption tests 2021-05-20 11:47:10 UTC
Github openshift cluster-openshift-apiserver-operator pull 454 0 None closed Bug 1943804: splits the encryption tests 2021-05-20 11:47:10 UTC
Github openshift library-go pull 1076 0 None closed Bug 1943804: increases the maximum time for migration for the encryption tests on AWS 2021-05-20 11:47:14 UTC
Github openshift release pull 18554 0 None closed Bug 1943804: new CI job for the encryption rotation tests 2021-05-20 11:47:15 UTC
Github openshift release pull 18615 0 None closed Bug 1943804: kas-o: moves the encryption tests to GCP 2021-05-20 11:47:15 UTC
Github openshift release pull 18620 0 None closed Bug 1943804: oas-o a new CI job for the encryption rotation tests 2021-05-20 11:47:12 UTC
Github openshift release pull 18628 0 None closed Bug 1943804: authn-o a new CI job for the encryption rotation tests 2021-05-20 11:47:17 UTC
Github openshift release pull 18714 0 None Merged Bug 1943804: moves the encryption tests to GCP 2021-11-04 22:46:37 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:56:36 UTC

Description Clayton Coleman 2021-03-27 15:28:24 UTC
Sometime in the last few weeks we have regressed (or AWS has regressed?) the behavior of draining requests from the apiserver when it begins graceful shutdown. This occurs in the presence of no other disruption - starting a cluster, triggering graceful shutdown of an apiserver pod via TERM on PID 1, and polling API calls from outside the cluster is sufficient to reproduce the issue on any 4.8 AWS cluster.

From looking at the nightly upgrade runs this seems to have gotten substantially worse around 3/15, unfortunately that's also the time the crio regression around shutdown landed, so there's no start date that is obvious from our runs https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade.  We definitely had full green for a period in 4.7, so either we are taking longer (in a way that triggers some unusual behavior) or we are getting included by AWS when we don't expect to.

Will gather and report more data on this shortly.  In the meantime, the test will be marked flaky while we investigate.

Comment 1 Lukasz Szaszkiewicz 2021-03-29 12:52:37 UTC
Clayton is right, AWS is broken. I did the following manual test which demonstrates that new connections are being forwarded to an API that undergoes a termination.

1. installed a 4.8 cluster (4.8.0-0.ci-2021-03-28-220420)

2. taken one API out of the pool (mv kube-apiserver-pod.yaml ../). To increase my chances of hitting an API server.

3. run the following script (which creates a new connection to a server on each run): while; do curl -H "User-Agent: james" -H "Authorization: Bearer REDACTED" https://api.polmonkmkimp.group-b.devcluster.openshift.com:6443/api/v1/namespaces -k -m 2 -o /dev/null -sw "$(date +%T) %{http_code} %{time_total}s\n"; sleep 1; done

4. sent a SIGTERM to one server: kill -s SIGTERM $PID


here's what I have found out: requests were received around ~T+70s (--shutdown-delay-duration) which is VERY late and indicates that AWS didn't take the API server out of rotation.



E0: T+0s: SIGTERM sent to 10-0-131-183 (date; kill -s SIGTERM 143136; Mon Mar 29 12:16:54 UTC 2021)

E1: T+0s: 10-0-131-183 receives a TERM signal (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationStart")')
{
  "apiVersion": "v1",
  "eventTime": null,
  "firstTimestamp": null,
  "involvedObject": {
    "apiVersion": "v1",
    "kind": "Pod",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal",
    "namespace": "openshift-kube-apiserver"
  },
  "kind": "Event",
  "lastTimestamp": null,
  "message": "Received signal to terminate, becoming unready, but keeping serving",
  "metadata": {
    "creationTimestamp": "2021-03-29T12:16:54Z",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf5163faf315",
    "namespace": "openshift-kube-apiserver",
    "resourceVersion": "88811",
    "uid": "ea1efadc-3cdf-4275-a83f-cfd8ccc7472d"
  },
  "reason": "TerminationStart",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "apiserver",
    "host": "ip-10-0-131-183"
  },
  "type": "Normal"
}

E2: T+0s: logs from 10-0-131-183 filtered by the user agent (oc logs -n openshift-kube-apiserver kube-apiserver-ip-10-0-131-183.ec2.internal -f | grep "James")

I0329 12:17:00.565819      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="232.707358ms" userAgent="james" srcIP="10.0.51.233:58801" resp=200
I0329 12:17:02.404500      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="232.509186ms" userAgent="james" srcIP="10.0.51.233:21409" resp=200
I0329 12:17:08.536342      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="226.236527ms" userAgent="james" srcIP="10.0.51.233:23306" resp=200
I0329 12:17:10.368416      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="231.572429ms" userAgent="james" srcIP="10.0.51.233:40110" resp=200
I0329 12:17:19.529428      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="223.988728ms" userAgent="james" srcIP="10.0.51.233:18542" resp=200
I0329 12:17:23.198021      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="232.905483ms" userAgent="james" srcIP="10.0.51.233:46005" resp=200
I0329 12:17:25.039396      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="233.728848ms" userAgent="james" srcIP="10.0.51.233:10436" resp=200
I0329 12:17:26.857096      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="227.374887ms" userAgent="james" srcIP="10.0.51.233:43878" resp=200
I0329 12:17:28.701398      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="223.498453ms" userAgent="james" srcIP="10.0.51.233:64204" resp=200
I0329 12:17:30.530245      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="229.98221ms" userAgent="james" srcIP="10.0.51.233:54945" resp=200
I0329 12:17:32.333707      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="224.751331ms" userAgent="james" srcIP="10.0.51.233:50888" resp=200
I0329 12:17:35.986866      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.489524ms" userAgent="james" srcIP="10.0.51.233:5798" resp=200
I0329 12:17:37.805527      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="229.850606ms" userAgent="james" srcIP="10.0.51.233:63211" resp=200
I0329 12:17:41.469994      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="233.488308ms" userAgent="james" srcIP="10.0.51.233:51799" resp=200
I0329 12:17:43.304633      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="231.870691ms" userAgent="james" srcIP="10.0.51.233:31088" resp=200
I0329 12:17:46.961367      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.730156ms" userAgent="james" srcIP="10.0.51.233:44535" resp=200
I0329 12:17:50.612538      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="242.212359ms" userAgent="james" srcIP="10.0.51.233:29328" resp=200
W0329 12:17:52.200501      18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:37342, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup.
I0329 12:17:52.432302      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="231.829746ms" userAgent="james" srcIP="10.0.51.233:37342" resp=200
W0329 12:17:54.026533      18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:27541, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup.
I0329 12:17:54.255197      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.686402ms" userAgent="james" srcIP="10.0.51.233:27541" resp=200
W0329 12:17:55.839230      18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:50792, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup.
I0329 12:17:56.065143      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="225.954251ms" userAgent="james" srcIP="10.0.51.233:50792" resp=200
W0329 12:17:59.508938      18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:10138, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup.
I0329 12:17:59.738239      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="229.355668ms" userAgent="james" srcIP="10.0.51.233:10138" resp=200
W0329 12:18:01.342323      18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:54300, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup.
I0329 12:18:01.575489      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="233.194447ms" userAgent="james" srcIP="10.0.51.233:54300" resp=200
W0329 12:18:03.170068      18 patch_genericapiserver.go:84] Request to "/api/v1/namespaces" (source IP 10.0.51.233:28110, user agent "james") through a connection created very late in the graceful termination process (more than 80% has passed), possibly a sign for a broken load balancer setup.
I0329 12:18:03.396975      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="226.930343ms" userAgent="james" srcIP="10.0.51.233:28110" resp=200
error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""

E3: T+70s: TerminationMinimalShutdownDurationFinished recived on 12:18:04 (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationMinimalShutdownDurationFinished")')

{
  "apiVersion": "v1",
  "eventTime": null,
  "firstTimestamp": null,
  "involvedObject": {
    "apiVersion": "v1",
    "kind": "Pod",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal",
    "namespace": "openshift-kube-apiserver"
  },
  "kind": "Event",
  "lastTimestamp": null,
  "message": "The minimal shutdown duration of 1m10s finished",
  "metadata": {
    "creationTimestamp": "2021-03-29T12:18:04Z",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf61b0f72d2d",
    "namespace": "openshift-kube-apiserver",
    "resourceVersion": "89139",
    "uid": "4eab8165-cb40-4a7a-b7e7-5401938b82f9"
  },
  "reason": "TerminationMinimalShutdownDurationFinished",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "apiserver",
    "host": "ip-10-0-131-183"
  },
  "type": "Normal"
}

E4: T+70s: TerminationStoppedServing recived on 12:18:04 (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationStoppedServing")')
{
  "apiVersion": "v1",
  "eventTime": null,
  "firstTimestamp": null,
  "involvedObject": {
    "apiVersion": "v1",
    "kind": "Pod",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal",
    "namespace": "openshift-kube-apiserver"
  },
  "kind": "Event",
  "lastTimestamp": null,
  "message": "Server has stopped listening",
  "metadata": {
    "creationTimestamp": "2021-03-29T12:18:04Z",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf61b1990110",
    "namespace": "openshift-kube-apiserver",
    "resourceVersion": "89140",
    "uid": "84d567bd-f653-4fbd-b542-82fff0c72d81"
  },
  "reason": "TerminationStoppedServing",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "apiserver",
    "host": "ip-10-0-131-183"
  },
  "type": "Normal"
}


E4: T+118s: TerminationGracefulTerminationFinished recived on 12:19:04 (oc get events -n openshift-kube-apiserver --sort-by={.metadata.creationTimestamp} -ojson | jq '.items' | jq '.[] | select(.reason=="TerminationGracefulTerminationFinished")')
{
  "apiVersion": "v1",
  "eventTime": null,
  "firstTimestamp": null,
  "involvedObject": {
    "apiVersion": "v1",
    "kind": "Pod",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal",
    "namespace": "openshift-kube-apiserver"
  },
  "kind": "Event",
  "lastTimestamp": null,
  "message": "All pending requests processed",
  "metadata": {
    "creationTimestamp": "2021-03-29T12:19:04Z",
    "name": "kube-apiserver-ip-10-0-131-183.ec2.internal.1670cf6fa9f8ef82",
    "namespace": "openshift-kube-apiserver",
    "resourceVersion": "89452",
    "uid": "c6bddd39-5611-4e19-a7fe-3fcbf204ef02"
  },
  "reason": "TerminationGracefulTerminationFinished",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "apiserver",
    "host": "ip-10-0-131-183"
  },
  "type": "Normal"
}

Comment 2 Lukasz Szaszkiewicz 2021-03-29 16:33:47 UTC
Did a few more tests, in the last run the server received a request at T0 + 211s


[root@ip-10-0-131-183 ~]# date; kill -s SIGTERM 278562
Mon Mar 29 16:25:23 UTC 2021


❯ oc logs -n openshift-kube-apiserver kube-apiserver-ip-10-0-131-183.ec2.internal -f | grep "james"
I0329 16:25:52.937653      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="222.902219ms" userAgent="james" srcIP="10.0.91.230:54834" resp=200
I0329 16:25:54.747723      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="223.178316ms" userAgent="james" srcIP="10.0.37.169:63291" resp=200
I0329 16:25:56.586519      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="236.856737ms" userAgent="james" srcIP="10.0.78.223:20031" resp=200
I0329 16:26:00.841269      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.263823ms" userAgent="james" srcIP="10.0.51.233:62462" resp=200
I0329 16:26:03.026039      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="462.656103ms" userAgent="james" srcIP="10.0.25.113:57850" resp=200
I0329 16:26:05.218424      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="456.940913ms" userAgent="james" srcIP="10.0.25.113:11272" resp=200
I0329 16:26:07.373366      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="442.235736ms" userAgent="james" srcIP="10.0.25.113:46960" resp=200
I0329 16:26:09.554608      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="462.029928ms" userAgent="james" srcIP="10.0.25.113:16254" resp=200
I0329 16:26:13.770448      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="460.215337ms" userAgent="james" srcIP="10.0.25.113:6017" resp=200
I0329 16:26:15.948896      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="458.875829ms" userAgent="james" srcIP="10.0.25.113:22118" resp=200
I0329 16:26:18.110111      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="447.566662ms" userAgent="james" srcIP="10.0.25.113:14208" resp=200
I0329 16:26:24.324797      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="446.271584ms" userAgent="james" srcIP="10.0.25.113:61408" resp=200
I0329 16:26:26.470754      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="444.751335ms" userAgent="james" srcIP="10.0.25.113:53187" resp=200
I0329 16:26:32.744479      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.035465ms" userAgent="james" srcIP="10.0.25.113:24460" resp=200
I0329 16:26:41.314015      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="448.624921ms" userAgent="james" srcIP="10.0.25.113:50744" resp=200
I0329 16:26:43.463580      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="446.781681ms" userAgent="james" srcIP="10.0.25.113:13908" resp=200
I0329 16:26:49.771985      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="439.832331ms" userAgent="james" srcIP="10.0.25.113:60474" resp=200
I0329 16:26:56.056219      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="461.870335ms" userAgent="james" srcIP="10.0.25.113:28137" resp=200
I0329 16:26:58.227994      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="444.899661ms" userAgent="james" srcIP="10.0.25.113:40385" resp=200
I0329 16:27:00.399969      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.621468ms" userAgent="james" srcIP="10.0.25.113:48095" resp=200
I0329 16:27:04.588044      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="438.775726ms" userAgent="james" srcIP="10.0.25.113:61575" resp=200
I0329 16:27:08.826198      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="494.197819ms" userAgent="james" srcIP="10.0.25.113:16214" resp=200
I0329 16:27:10.960414      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="447.225468ms" userAgent="james" srcIP="10.0.25.113:5935" resp=200
I0329 16:27:17.268094      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="464.328062ms" userAgent="james" srcIP="10.0.25.113:51641" resp=200
I0329 16:27:21.469696      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="455.002004ms" userAgent="james" srcIP="10.0.25.113:18644" resp=200
I0329 16:27:23.650813      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="457.71601ms" userAgent="james" srcIP="10.0.25.113:13542" resp=200
I0329 16:27:27.865922      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="446.575823ms" userAgent="james" srcIP="10.0.25.113:44279" resp=200
I0329 16:27:30.018900      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="451.042525ms" userAgent="james" srcIP="10.0.25.113:38352" resp=200
I0329 16:27:36.330192      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="492.799839ms" userAgent="james" srcIP="10.0.25.113:45080" resp=200
I0329 16:27:38.470673      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="440.340293ms" userAgent="james" srcIP="10.0.25.113:22339" resp=200
I0329 16:27:46.864711      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="440.568683ms" userAgent="james" srcIP="10.0.25.113:17841" resp=200
I0329 16:27:49.003416      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="440.534977ms" userAgent="james" srcIP="10.0.25.113:47367" resp=200
I0329 16:27:51.176509      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="463.568847ms" userAgent="james" srcIP="10.0.25.113:18046" resp=200
I0329 16:27:53.334598      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="447.761414ms" userAgent="james" srcIP="10.0.25.113:61739" resp=200
I0329 16:28:08.160325      18 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/namespaces" latency="228.933349ms" userAgent="james" srcIP="10.0.6.16:30366" resp=200

Comment 3 Lukasz Szaszkiewicz 2021-03-29 16:34:24 UTC
I'm going to open a support ticket with AWS.

Comment 4 Lukasz Szaszkiewicz 2021-03-29 16:40:47 UTC
Created attachment 1767396 [details]
cloudwatch metrics for the latest run

Comment 18 Ke Wang 2021-05-25 08:23:59 UTC
Since the PR 1079 increases termination timeouts for AWS, we need to verify on AWS and non-AWS clusters.  

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-21-233425   True        False         24h     Cluster version is 4.8.0-0.nightly-2021-05-21-233425

$  oc get infrastructures.config.openshift.io -o yaml | grep -A1 " platformSpec"
    platformSpec:
      type: GCP

$ oc -n openshift-kube-apiserver get cm config -o json | jq -r '.data."config.yaml"' |   jq '.apiServerArguments."shutdown-delay-duration"'
[
  "70s"
]
	
$ oc get pod -n openshift-kube-apiserver kube-apiserver-ci-ln-d6ygqnb-f76d1-d6kbg-master-1 -o yaml | grep 135s
      exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=135s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP}  -v=2 --permit-address-sharing

- Kube-apiserver rolling out,
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'

Checked kube-apiserver operator logs, the kube-apiserver rolling out started and ended time, from 03:44:35 to 03:55:27, see below,

I0525 03:44:35.707678       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3748cc54-2961-4e73-98c8-2f4e24a89b9d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/kube-apiserver-pod -n openshift-kube-apiserver:
cause by changes in data.forceRedeploymentReason
...
I0525 03:55:27.483033       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3748cc54-2961     -4e73-98c8-2f4e24a89b9d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Pro     gressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 9"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revis     ion 8; 2 nodes are at revision 9" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9"

- Etcd encryption,
$  oc patch apiserver/cluster -p '{"spec":{"encryption":{"type":"aescbc"}}}' --type merge
apiserver.config.openshift.io/cluster patched

Checked the kube-apiserver operator logs, the encryption started at 04:10:25,
...
I0525 04:10:25.175851       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3748cc54-2961     -4e73-98c8-2f4e24a89b9d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'EncryptionKeyCreated' Secret "encryption-key-openshift-kube-apiserver-1" succ     essfully created: ["key-does-not-exist"]   
...

$ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.lastTransitionTime}{"\n"}{.reason}{"\n"}{.message}{"\n"}'
2021-05-25T04:34:59Z
EncryptionCompleted
All resources encrypted: secrets, configmaps

The encryption ended at 4:34:59, took about 24 minutes.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

$ oc get infrastructures.config.openshift.io -o yaml | grep -A1 " platformSpec"
    platformSpec:
      aws: {}

$ oc -n openshift-kube-apiserver get cm config -o json | jq -r '.data."config.yaml"' |   jq '.apiServerArguments."shutdown-delay-duration"'
[
  "210s"
]

$ oc get pod -n openshift-kube-apiserver kube-apiserver-ip-10-0-49-171.us-east-2.compute.internal -o yaml | grep 275s
      exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=275s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP}  -v=2 --permit-address-sharing

$ oc exec -n openshift-kube-apiserver-operator kube-apiserver-operator-6fc449958c-sl7p8 -- date; oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'
Tue May 25 05:29:06 UTC 2021
kubeapiserver.operator.openshift.io/cluster patched

Checked kube-apiserver operator logs, the kube-apiserver rolling out started and ended time, from 05:29:09 to 05:48:50, about 19m41s

I0525 05:29:09.535071       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"8beb5658-6257     -4e15-89cc-d7e17af29b3f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/kube-apiserver-pod -n openshift-kube-apis     erver:
3767 cause by changes in data.forceRedeploymentReason
...
I0525 05:48:50.210507       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"8beb5658-6257     -4e15-89cc-d7e17af29b3f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Pro     gressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 18"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 17; 2 nodes are at revision 18" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 18" 

$ oc exec -n openshift-kube-apiserver-operator kube-apiserver-operator-6fc449958c-sl7p8 -- date; oc patch apiserver/cluster -p '{"spec":{"encryption":{"type":"aescbc"}}}' --type merge
Tue May 25 06:38:20 UTC 2021
apiserver.config.openshift.io/cluster patched


Checked the kube-apiserver operator logs, the encryption started at 06:38:21,
...
I0525 06:38:21.862745       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"8beb5658-6257     -4e15-89cc-d7e17af29b3f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'EncryptionKeyCreated' Secret "encryption-key-openshift-kube-apiserver-1" succ     essfully created: ["key-does-not-exist"]   
...

$ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.lastTransitionTime}{"\n"}{.reason}{"\n"}{.message}{"\n"}'
2021-05-25T07:21:02Z
EncryptionCompleted
All resources encrypted: secrets, configmaps

The encryption ended at 07:21:02, took about 43 minutes.

From above results, the kube-apiserver rolling out and etcd encryption on AWS took much more time than GCP, it is as expected, so move the bug VERIFIED,

Comment 19 Michal Fojtik 2021-06-08 14:56:19 UTC
*** Bug 1845412 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2021-07-27 22:56:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.