test: [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1308308963082637312 I0922 08:37:17.056692 3165 test_context.go:429] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready Sep 22 08:37:17.113: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Sep 22 08:37:17.247: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready Sep 22 08:37:17.321: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) Sep 22 08:37:17.321: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. Sep 22 08:37:17.321: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start Sep 22 08:37:17.346: INFO: e2e test version: v0.0.0-master+$Format:%h$ Sep 22 08:37:17.361: INFO: kube-apiserver version: v1.19.0+f5121a6 Sep 22 08:37:17.390: INFO: Cluster IP family: ipv4 [BeforeEach] [Top Level] github.com/openshift/origin/test/extended/util/framework.go:1425 [BeforeEach] [Top Level] github.com/openshift/origin/test/extended/util/framework.go:1425 [BeforeEach] [Top Level] github.com/openshift/origin/test/extended/util/test.go:59 [BeforeEach] [sig-api-machinery][Feature:APIServer][Late] github.com/openshift/origin/test/extended/util/client.go:135 STEP: Creating a kubernetes client [BeforeEach] [sig-api-machinery][Feature:APIServer][Late] github.com/openshift/origin/test/extended/util/client.go:111 Sep 22 08:37:17.709: INFO: configPath is now "/tmp/configfile135850088" Sep 22 08:37:17.709: INFO: The user is now "e2e-test-terminating-kube-apiserver-q24v9-user" Sep 22 08:37:17.709: INFO: Creating project "e2e-test-terminating-kube-apiserver-q24v9" Sep 22 08:37:17.888: INFO: Waiting on permissions in project "e2e-test-terminating-kube-apiserver-q24v9" ... Sep 22 08:37:17.906: INFO: Waiting for ServiceAccount "default" to be provisioned... Sep 22 08:37:18.030: INFO: Waiting for ServiceAccount "deployer" to be provisioned... Sep 22 08:37:18.149: INFO: Waiting for ServiceAccount "builder" to be provisioned... Sep 22 08:37:18.273: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned... Sep 22 08:37:18.307: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned... Sep 22 08:37:18.340: INFO: Waiting for RoleBinding "system:deployers" to be provisioned... Sep 22 08:37:33.539: INFO: Project "e2e-test-terminating-kube-apiserver-q24v9" has been fully provisioned. [It] kubelet terminates kube-apiserver gracefully [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/apiserver/graceful_termination.go:20 [AfterEach] [sig-api-machinery][Feature:APIServer][Late] github.com/openshift/origin/test/extended/util/client.go:133 STEP: Collecting events from namespace "e2e-test-terminating-kube-apiserver-q24v9". STEP: Found 0 events. Sep 22 08:37:33.656: INFO: POD NODE PHASE GRACE CONDITIONS Sep 22 08:37:33.656: INFO: Sep 22 08:37:33.689: INFO: skipping dumping cluster info - cluster too large Sep 22 08:37:33.736: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-terminating-kube-apiserver-q24v9-user}, err: <nil> Sep 22 08:37:33.772: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-terminating-kube-apiserver-q24v9}, err: <nil> Sep 22 08:37:33.820: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens WsIj9qyeQACGSu1dMPeMcgAAAAAAAAAA}, err: <nil> [AfterEach] [sig-api-machinery][Feature:APIServer][Late] github.com/openshift/origin/test/extended/util/client.go:134 Sep 22 08:37:33.820: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-test-terminating-kube-apiserver-q24v9" for this suite. Sep 22 08:37:33.871: INFO: Running AfterSuite actions on all nodes Sep 22 08:37:33.871: INFO: Running AfterSuite actions on node 1 fail [github.com/onsi/ginkgo.0-origin.1+incompatible/internal/leafnodes/runner.go:64]: kube-apiserver reports a non-graceful termination: v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-apiserver-ip-10-0-210-234.ec2.internal.16370d2ec3a8d999", GenerateName:"", Namespace:"openshift-kube-apiserver", SelfLink:"/api/v1/namespaces/openshift-kube-apiserver/events/kube-apiserver-ip-10-0-210-234.ec2.internal.16370d2ec3a8d999", UID:"5ea1e09a-83d8-4769-b84b-1b8dd1ae9deb", ResourceVersion:"64686", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63736359490, loc:(*time.Location)(0x9af3ae0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"watch-termination", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc0020baba0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0020babc0)}}}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-210-234.ec2.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"NonGracefulTermination", Message:"Previous pod kube-apiserver-ip-10-0-210-234.ec2.internal started at 2020-09-22 08:04:53.782347207 +0000 UTC did not terminate gracefully", Source:v1.EventSource{Component:"apiserver", Host:"ip-10-0-210-234"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63736359487, loc:(*time.Location)(0x9af3ae0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63736359487, loc:(*time.Location)(0x9af3ae0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}. Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.
Trying to break down how this test works https://github.com/openshift/kubernetes/blob/master/cmd/watch-termination/main.go copied from o/o carried in https://github.com/openshift/kubernetes/pull/332 reworked in rebase to 1.19 https://github.com/openshift/kubernetes/pull/325 https://github.com/openshift/origin/blob/master/test/extended/apiserver/graceful_termination.go#L20-L40 The test just checks for the existence of an event with reason NonGracefulTermination. watch-termination is exec'ed by the kube-apiserver pod and actually wraps the execution of `hyperkube kube-apiserver` and emits the event https://github.com/openshift/cluster-kube-apiserver-operator/blob/0a69c78cea511a8c8e70b3910c637ffaf2a1dc36/bindata/v4.1.0/kube-apiserver/pod.yaml#L57 It seems like this test condition could be triggered by the kubelet not honoring the terminationGracePeriodSeconds (though we have an upstream test for this) _or_ the kube-apiserver is not terminating within its grace period and is being ungracefully killed, but it isn't the kublet's fault. Do we know which it is?
The also seems like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1882111
*** Bug 1882111 has been marked as a duplicate of this bug. ***
*** Bug 1883512 has been marked as a duplicate of this bug. ***
dup'ed bug 1883512 indicates higher rate of occurrence on metal-upi
Increasing the severity and bringing back into 4.6. to indicate the frequency of the failure on metal-upi: 30% failure rate versus the normal 93+% pass rate on aws, gcp, azure, openstack, metal-ipi.
We are looking into this. We see kube-apiserver not terminating and receiving requests far too late in the termination process. The former might be a regression (Golang 1.15?), the later is related to load balancers.
Move back to NEW and assigned back to nodes team. @Lukasz can you attach the test failures that are real?
@sjenning we have only found the one case where kube-apiserver kept running and then we got killed. These cases will not fail this test anymore, but another one watching GracefulTerminationTimeout events. The other test runs we looked at with NonGracefulTermination were real. Some of them showed that the process was started and killed 2s later. The pod API spec does not allow this behaviour, even if the pod is not ready. My assessment for priority and target release is that this should not block the release. But we have to keep digging, and reasses if we notice a critical root cause.
For example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-node-tuning-operator/158/pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws/1311886485250641920 Failed on "Previous pod kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal started at 2020-10-02 04:56:47.852086204 +0000 UTC did not terminate gracefully" In the termination logs of that node we found " I1002 04:56:46.586764 1 main.go:130] Deleting termination lock file "/var/log/kube-apiserver/.terminating" --- I1002 04:56:48.551676 1 main.go:155] Received signal terminated. Forwarding to sub-process "hyperkube". ... I1002 04:56:58.569124 18 httplog.go:94] "HTTP" verb="GET" URI="/readyz" latency="7.006882ms" userAgent="kube-apiserver/v1.19.0 (linux/amd64) kubernetes/1fc699e" srcIP="[::1]:47626" resp=0 W1002 04:56:58.582588 18 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.175.202:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.175.202:2379: connect: connection refused". Reconnecting... Previous pod kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal started at 2020-10-02 04:56:47.852086204 +0000 UTC did not terminate gracefully --- " which suggest that the new pod received the signal to terminate 2s after the start. In kubelet logs we found: " Oct 02 04:56:48.448044 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.448025 1500 config.go:278] Setting pods for source file Oct 02 04:56:48.448330 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.448310 1500 config.go:383] Receiving a new pod "kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal_openshift-kube-apiserver(cbaf1e3447758b0fd5a64e27b90d07df)" Oct 02 04:56:48.448614 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.448600 1500 kubelet.go:1844] SyncLoop (REMOVE, "file"): "kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal_openshift-kube-apiserver(bfad365b105ae5d6b142f99cc022b13c)" Oct 02 04:56:48.448701 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.448629 1500 kubelet_pods.go:1212] marking pod pending termination "bfad365b105ae5d6b142f99cc022b13c" Oct 02 04:56:48.448701 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.448669 1500 kubelet.go:1834] SyncLoop (ADD, "file"): "kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal_openshift-kube-apiserver(cbaf1e3447758b0fd5a64e27b90d07df)" Oct 02 04:56:48.448773 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.448724 1500 topology_manager.go:233] [topologymanager] Topology Admit Handler Oct 02 04:56:48.448914 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.448870 1500 kubelet_pods.go:1247] Killing unwanted pod "kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal" ... Oct 02 04:56:48.470144 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.470130 1500 kubelet.go:1566] Trying to delete pod kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal_openshift-kube-apiserver 17c5add8-4dba-4b33-8589-ee9c562f274e Oct 02 04:56:48.470144 ip-10-0-159-220 hyperkube[1500]: I1002 04:56:48.470146 1500 mirror_client.go:125] Deleting a mirror pod "kube-apiserver-ip-10-0-159-220.us-west-2.compute.internal_openshift-kube-apiserver" (uid (*types.UID)(0xc002304460)) " we would like to know why kubelet killed the pod 2s after the start and why it didn't wait TerminationGracePeriodSeconds
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-upi/1311384650140094464 Failed on "Previous pod kube-apiserver-control-plane-0 started at 2020-09-30 19:32:23.948560597 +0000 UTC did not terminate gracefully" In the termination logs of that node we found: " --- --- Previous pod kube-apiserver-control-plane-0 started at 2020-09-30 19:32:23.948560597 +0000 UTC did not terminate gracefully --- " In kubelet logs we found: " Sep 30 19:32:03.052045 control-plane-0 hyperkube[1554]: I0930 19:32:03.052015 1554 config.go:383] Receiving a new pod "kube-apiserver-control-plane-0_openshift-kube-apiserver(d11650bf506825db2daf49214b55233f)" ... Sep 30 19:32:39.112803 control-plane-0 hyperkube[1554]: I0930 19:32:39.112783 1554 config.go:383] Receiving a new pod "kube-apiserver-control-plane-0_openshift-kube-apiserver(935ad700ba6660dbb6b16e08d69161d2)" Sep 30 19:32:39.113374 control-plane-0 hyperkube[1554]: I0930 19:32:39.113356 1554 config.go:278] Setting pods for source file Sep 30 19:32:39.113447 control-plane-0 hyperkube[1554]: I0930 19:32:39.113382 1554 kubelet.go:1844] SyncLoop (REMOVE, "file"): "kube-apiserver-control-plane-0_openshift-kube-apiserver(d11650bf506825db2daf49214b55233f)" Sep 30 19:32:39.113447 control-plane-0 hyperkube[1554]: I0930 19:32:39.113412 1554 kubelet_pods.go:1212] marking pod pending termination "d11650bf506825db2daf49214b55233f" Sep 30 19:32:39.115834 control-plane-0 hyperkube[1554]: I0930 19:32:39.115806 1554 kubelet.go:1834] SyncLoop (ADD, "file"): "kube-apiserver-control-plane-0_openshift-kube-apiserver(935ad700ba6660dbb6b16e08d69161d2)" Sep 30 19:32:39.115967 control-plane-0 hyperkube[1554]: I0930 19:32:39.115900 1554 topology_manager.go:233] [topologymanager] Topology Admit Handler Sep 30 19:32:39.116340 control-plane-0 hyperkube[1554]: I0930 19:32:39.116310 1554 kubelet_pods.go:1247] Killing unwanted pod "kube-apiserver-control-plane-0" " we would like to know why kubelet killed the pod and if it waited TerminationGracePeriodSeconds before starting a new one.
per https://bugzilla.redhat.com/show_bug.cgi?id=1882750#c9 deferring to 4.7 and setting priority high
Since comment #9, we have seen a dramatic shift in frequency. Some of the top movers are release-openshift-ocp-installer-e2e-azure-4.6 88.00% (0.00%)(25 runs) 100.00% (0.00%)(39 runs) release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.6 90.00% (0.00%)(10 runs) 100.00% (0.00%)(6 runs) release-openshift-ocp-installer-e2e-aws-4.6 92.00% (0.00%)(50 runs) 97.06% (0.00%)(68 runs) periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere 93.94% (0.00%)(33 runs) 100.00% (0.00%)(35 runs) @sjenning. This suddenly got a lot more severe and would explain our upgrade 10% increase in availability downtime that happened at the same time. Given a rapid shift in frequency and a direct impact on on azure 4.6 to 4.6 upgrade API availability, I think this deserves more investigation.
*** Bug 1886563 has been marked as a duplicate of this bug. ***
Cross linking to https://bugzilla.redhat.com/show_bug.cgi?id=1886499 which is marked by a significant drop in availability of kube-apiserver on azure upgrades, 4.6 to 4.6 fails a lot more often. 4.5 to 4.6 fails a little more often.
Triage of https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4250/pull-ci-openshift-installer-master-e2e-aws/1314345355613048832: Master node 10.0.248.159 termination.log: I1008 23:52:27.380785 1 main.go:176] Received signal terminated. Forwarding to sub-process "hyperkube". ... I1008 23:52:37.330384 17 httplog.go:94] "HTTP" verb="GET" URI="/readyz" latency="2.792032ms" userAgent="kube-apiserver/v1.19.0 (linux/amd64) kubernetes/75c8d48" srcIP="[::1]:55198" resp=0 Previous pod kube-apiserver-ip-10-0-248-159.us-east-2.compute.internal started at 2020-10-08 23:52:27.031713422 +0000 UTC did not terminate gracefully Conclusion: this looks like a real non-graceful termination after exactly 10s.
Addition to #16: the signal was received before the pod got ready. But during the following 10s it became ready and received traffic.
*** Bug 1892536 has been marked as a duplicate of this bug. ***
I dunno why the subcomponent wasn't set. Taking a guess now. Pretty common in recent CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&search=kube-apiserver%20reports%20a%20non-graceful%20termination' | grep 'failures match' | sort periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-workers-rhel7 - 3 runs, 67% failed, 50% of failures match pull-ci-openshift-cloud-credential-operator-master-e2e-aws - 6 runs, 100% failed, 17% of failures match ... pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade - 6 runs, 17% failed, 100% of failures match release-openshift-ocp-installer-e2e-aws-4.7 - 6 runs, 50% failed, 67% of failures match release-openshift-ocp-installer-e2e-ovirt-4.7 - 7 runs, 71% failed, 20% of failures match release-openshift-okd-installer-e2e-aws-4.6 - 10 runs, 60% failed, 33% of failures match release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.5-stable-to-4.6-ci - 10 runs, 20% failed, 50% of failures match release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.8 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-upgrade - 32 runs, 19% failed, 67% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci - 4 runs, 50% failed, 150% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.6-stable-to-4.7-ci - 4 runs, 25% failed, 100% of failures match Not sure if we expect aws-4.7 to still be showing these with bug 1888015 landed almost a month ago.
I am not seeing the UID changing when the static pod is laid down again. One other issue with this for static pods is StopPodSandbox is only using 10 seconds for the sandbox container shutdowns. https://github.com/cri-o/cri-o/blob/928edf243bf18b3451d858f3132a6b76a64a6a03/server/sandbox_stop_linux.go#L75
Filling in a few details from a recent 4.7 cluster. Different UIDs are generated for each revision on disk sh-4.4# cat /etc/kubernetes/static-pod-resources/kube-apiserver-pod-10/kube-apiserver-pod.yaml | jq { "kind": "Pod", "apiVersion": "v1", "metadata": { "name": "kube-apiserver", "namespace": "openshift-kube-apiserver", "uid": "ecef213c-0269-414a-85dd-a84e3728f3b6", sh-4.4# cat /etc/kubernetes/static-pod-resources/kube-apiserver-pod-9/kube-apiserver-pod.yaml | jq { "kind": "Pod", "apiVersion": "v1", "metadata": { "name": "kube-apiserver", "namespace": "openshift-kube-apiserver", "uid": "0313bdc1-5cbb-4bf0-8120-3a9b35d5bff0", But the kubelet appears to be ignoring the UID specified on disk and creating its own. See from API (which the kubelet sets) annotations: kubectl.kubernetes.io/default-logs-container: kube-apiserver kubernetes.io/config.hash: e184e042-37e1-426c-89fb-46e7bc404b96 kubernetes.io/config.mirror: e184e042-37e1-426c-89fb-46e7bc404b96 ... resourceVersion: "548425" selfLink: /api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-150-183.us-west-1.compute.internal uid: f7baa88e-9d03-4c13-80a7-0131a1c9efa8 from disk (which the operator sets in the manifest sh-4.4# cat /etc/kubernetes/manifests/kube-apiserver-pod.yaml | jq { "kind": "Pod", "apiVersion": "v1", "metadata": { "name": "kube-apiserver", "namespace": "openshift-kube-apiserver", "uid": "e184e042-37e1-426c-89fb-46e7bc404b96", So I think it's being specified, but the kubelet may be creating duplicates.
Talked to Mrunal and debugged this further. When the static pod is updated the Kubelet sees a stop and start event. Kubelet sends down a StopPodSandbox and crio only waits for 10 seconds (in parallel) for each container. This is by design. We added support earlier in the year for graceful termination on static pod deletion, but updates are proving to be a different scenario.
When a static pod is created, a similar mirror pod is created within the API. These two objects will have different UIDs. The kubelet maintains a mapping between the static pod and mirror pod, vice versa.
This bug is also affecting at least https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#release-openshift-origin-installer-e2e-azure-4.7, which does block payload promotion. It's failing for this job aprox 10% of the time recently. example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.7/1328700621242503168
Master team added a UID to the static pod manifests recently. I am seeing the static pod be destroyed and recreated on an update. Do you see an improvement within CI?
(In reply to Ryan Phillips from comment #27) > Master team added a UID to the static pod manifests recently. I am seeing > the static pod be destroyed and recreated on an update. Do you see an > improvement within CI? No, actually the job I was starting to look at (release-openshift-origin-installer-e2e-azure-4.7) seems to be seeing this test case fail more recently than before. It failed aprox 7 in 75 jobs over the past week, but zero failures the week before that.
I'm still seeing this fail frequently in the last 24 hours: https://search.ci.openshift.org/?search=failed%3A.*%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully&maxAge=24h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job
(In reply to Russell Teague from comment #29) > I'm still seeing this fail frequently in the last 24 hours: I took a sample (14) of these failures after removing cases where there were too many other failures. I have notices few things: 1. The graceful termination failure does not happen in isolation. 2. Following failures occur very frequently with graceful termination failure: ": [sig-network] pods should successfully create sandboxes by getting pod" 11 out of 14 ": [sig-network] pods should successfully create sandboxes by reading container" 9 out of 14 ": [sig-network] pods should successfully create sandboxes by other" 8 out of 14 3. I did not see any graceful termination failure on its own i.e. it is always accompanied with other failures.
I tested this locally and seeing the following log with a clean exit, then a partial log, and an exit code of 1. The test looks at any non-zero exit code. I1121 02:36:31.517184 152 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I1121 02:36:38.407664 152 client.go:360] parsed scheme: "passthrough" I1121 02:36:38.407714 152 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.2.70:2379 <nil> 0 <nil>}] <nil> <nil>} I1121 02:36:38.407722 152 clientconn.go:948] ClientConn switching balancer to "pick_first" I1121 02:36:38.407983 152 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc02e03c380, {CONNECTING <nil>} I1121 02:36:38.414470 152 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc02e03c380, {READY <nil>} I1121 02:36:38.415207 152 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I1121 02:36:41.385931 152 client.go:360] parsed scheme: "passthrough" I1121 02:36:41.386102 152 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.2.71:2379 <nil> 0 <nil>}] <nil> <nil>} I1121 02:36:41.386120 152 clientconn.go:948] ClientConn switching balancer to "pick_first" I1121 02:36:41.386226 152 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc02e0f4ef0, {CONNECTING <nil>} I1121 02:36:41.392751 152 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc02e0f4ef0, {READY <nil>} I1121 02:36:41.393398 152 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I1121 02:36:46.453117 152 genericapiserver.go:667] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-master-0.ocp.b1b.in", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationGracefulTerminationFinished' All pending requests processed W1121 02:36:46.453335 152 genericapiserver.go:670] failed to create event openshift-kube-apiserver/kube-apiserver-master-0.ocp.b1b.in.164965604f209a93: Event "kube-apiserver-master-0.ocp.b1b.in.164965604f209a93" is invalid: [eventTime: Required value, source: Invalid value: "": needs to be unset] I1121 02:36:46.478210 1 main.go:198] Termination finished with exit code 0 --- I1121 02:41:56.954001 1 main.go:176] Received signal terminated. Forwarding to sub-process "hyperkube". I1121 02:41:56.954798 18 genericapiserver.go:667] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-master-0.ocp.b1b.in", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationStart' Received signal to terminate, becoming unready, but keeping serving W1121 02:41:56.954935 18 genericapiserver.go:670] failed to create event openshift-kube-apiserver/kube-apiserver-master-0.ocp.b1b.in.164965a89a77e485: Event "kube-apiserver-master-0.ocp.b1b.in.164965a89a77e485" is invalid: [eventTime: Required value, source: Invalid value: "": needs to be unset] I1121 02:41:56.954965 18 controller.go:181] Shutting down kubernetes service endpoint reconciler I1121 02:41:57.026911 1 main.go:176] Received signal terminated. Forwarding to sub-process "hyperkube". I1121 02:41:57.097255 1 main.go:198] Termination finished with exit code 1 I1121 02:41:57.097381 1 main.go:151] Deleting termination lock file "/var/log/kube-apiserver/.terminating" --- Previous pod kube-apiserver-master-0.ocp.b1b.in started at 2020-11-21 02:43:11.186755311 +0000 UTC did not terminate gracefully --- Additionally, the next log shows an exit code of 0, deleting the lock file, and still restarting with a did not terminate gracefully message: I1121 01:32:27.079341 19 genericapiserver.go:667] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-master-0.ocp.b1b.in", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationGracefulTerminationFinished' All pending requests processed W1121 01:32:27.079554 19 genericapiserver.go:670] failed to create event openshift-kube-apiserver/kube-apiserver-master-0.ocp.b1b.in.164961ddba863f43: Event "kube-apiserver-master-0.ocp.b1b.in.164961ddba863f43" is invalid: [eventTime: Required value, source: Invalid value: "": needs to be unset] I1121 01:32:27.109742 1 main.go:198] Termination finished with exit code 0 I1121 01:32:27.109876 1 main.go:151] Deleting termination lock file "/var/log/kube-apiserver/.terminating" --- Previous pod kube-apiserver-master-0.ocp.b1b.in started at 2020-11-21 01:33:33.919585476 +0000 UTC did not terminate gracefully ---
we are still seeing this failure. search results for 4.7: https://search.ci.openshift.org/?search=failed%3A.*%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully&maxAge=24h&context=1&type=build-log&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job is this closed as NOTABUG because of comment #30 which seems to say this failure is just fall-out from other troubles/failures in the job? If so, do we have other BZs for those things?
From comment #32: > 4. Our pod uses a resource on the host that cannot be shared (hostPort in our case). Since we have specified terminationGracePeriodSeconds (60) seconds in our pod yaml, our existing static pod won't be gone before that time, this also causes the new replacement static pod to fail to start, This is wrong. Pods in graceful termination must keep existing, and no other pod of the same name is allowed to launch. That's the invariant that pods have, how normal pods behave, and the same should be true for static pods. If the new pod is started before termination period is over, this is a bug.
If we don't have that invariant, health check probes will be broken. Imagine the graceful terminating pod keeps the /health port open. The new one is starting and blocks because it cannot get the port, but the old one happily answers /health on its behalf, kubelet will set the wrong status of the pod. We have a number of static pods, all of them have an undefined startup behaviour because they expect that the pod name is an exclusive resource. Kubelet breaks that assumption, leading to undefined behaviour and hard to debug bugs. We have to get this right. OpenShift depends on it.
*** Bug 1886920 has been marked as a duplicate of this bug. ***
*** Bug 1913532 has been marked as a duplicate of this bug. ***
*** Bug 1917133 has been marked as a duplicate of this bug. ***
*** Bug 1916880 has been marked as a duplicate of this bug. ***
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-22-134922 True False 5h Cluster version is 4.7.0-0.nightly-2021-01-22-134922 Connect to one master node, oc debug node/ip-xx-xx-xxx-182.us-east-2.compute.internal -For kube-apiserver, change one container requested memory size in kube-apiserver-pod.yaml, before change, check the current process ID of kube-apiserver. # ps -ef |grep ' kube-apiserver ' | grep -v grep | awk '{print $2}' 26614 26651 Wait for a while, the process id of kube-apiserver were changed. # ps -ef |grep ' kube-apiserver ' | grep -v grep | awk '{print $2}' 183910 183955 Checking the termination.log ... I0125 08:16:57.961330 18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0125 08:17:11.107619 18 genericapiserver.go:693] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-xx-xx-xxx-182.us-east-2.compute.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationGracefulTerminationFinished' All pending requests processed I0125 08:17:11.155293 1 main.go:198] Termination finished with exit code 0 In another terminal console, check if the kube-apiserver has been restarted. $ oc get pods -n openshift-kube-apiserver --show-labels -l apiserver NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-xx-xx-xxx-182.us-east-2.compute.internal 5/5 Running 18 5h6m apiserver=true,app=openshift-kube-apiserver,revision=5 ... Checked the pod describe, ... Normal Killing 3m57s (x2 over 83m) kubelet Stopping container kube-apiserver-insecure-readyz Normal Killing 3m57s (x2 over 83m) kubelet Stopping container kube-apiserver-cert-regeneration-controller Normal Killing 3m57s (x2 over 83m) kubelet Stopping container kube-apiserver-cert-syncer Normal Killing 3m57s (x2 over 83m) kubelet Stopping container kube-apiserver-check-endpoints Warning Unhealthy 3m57s kubelet Readiness probe failed: Get "https://xx.xx.xxx.182:17697/healthz": dial tcp xx.xx.xxx.182:17697: connect: connection refused Normal Killing 3m57s (x2 over 83m) kubelet Stopping container kube-apiserver Normal Pulled 3m57s (x4 over 5h6m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bdd57b27d1cbd43709ebf411a6519e251a9b6d63867abe5997b3417c12adc4bf" already present on machine Normal Killing 3m57s kubelet Container kube-apiserver-cert-syncer definition changed, will be restarted Normal Started 3m56s (x4 over 5h6m) kubelet Started container kube-apiserver-cert-syncer Normal Created 3m56s (x4 over 5h6m) kubelet Created container kube-apiserver-cert-syncer Normal Pulled 3m53s (x4 over 5h6m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bdd57b27d1cbd43709ebf411a6519e251a9b6d63867abe5997b3417c12adc4bf" already present on machine Normal Created 3m52s (x4 over 5h6m) kubelet Created container kube-apiserver-cert-regeneration-controller Normal Created 3m52s (x4 over 5h6m) kubelet Created container kube-apiserver-insecure-readyz Normal Pulled 3m52s (x4 over 5h6m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bdd57b27d1cbd43709ebf411a6519e251a9b6d63867abe5997b3417c12adc4bf" already present on machine Normal Started 3m52s (x4 over 5h6m) kubelet Started container kube-apiserver-cert-regeneration-controller Warning Unhealthy 2m47s (x7 over 82m) kubelet Readiness probe failed: Get "https://xx.xx.xxx.182:6443/healthz": dial tcp xx.xx.xxx.182:6443: connect: connection refused Mon Jan 25 17:38:45 [kewang@kewang-fedora env471]$ oc get pods -n openshift-kube-apiserver --show-labels -l apiserver Based on above results, the static pod was updated by respecting terminationGracePeriodSeconds, so move the bug VERIFIED.
This is still failing periodically in CI. Here are four recent jobs using payloads from 1/25 which is newer than the version @Kew was looking at. search.ci shows it as failing in 2.7% of all failed jobs in the past 1 day: https://search.ci.openshift.org/?search=failed%3A.*%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully&maxAge=24h&context=1&type=build-log&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job here are four recent jobs with the failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.7/1353632764955463680 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.7/1353632765123235840 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1353504728503291904 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1353484968273121280 Do we need to re-open this bug or file a new one?
I will try again with the latest payloads, not sure why this problem still arises。
Repeated https://bugzilla.redhat.com/show_bug.cgi?id=1882750#c51, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-26-024127 True False 46m Cluster version is 4.7.0-0.nightly-2021-01-26-024127 sh-4.4# vi kube-apiserver-pod.yaml #Changed some and saved. sh-4.4# date Tue Jan 26 09:05:32 UTC 2021 sh-4.4# ps -ef |grep ' kube-apiserver ' | grep -v grep | awk '{print $2}' 42598 42657 sh-4.4# date; ps -ef |grep ' kube-apiserver ' | grep -v grep | awk '{print $2}' Tue Jan 26 09:07:49 UTC 2021 85534 85574 After 135s(terminationGracePeriodSeconds), the new kube-apiserver was started, checked the kube-apiserver events, from 7m14s killed and 5m1s started, the process is as expected. ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 33m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:032b991ac3321b32478efc76c159ac5d918e751a63840f0cb52ffe6ec344341b" already present on machine Normal Created 33m kubelet Created container kube-apiserver-cert-syncer Normal Started 33m kubelet Started container kube-apiserver-cert-syncer Normal Pulled 33m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:032b991ac3321b32478efc76c159ac5d918e751a63840f0cb52ffe6ec344341b" already present on machine Normal Pulled 33m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:032b991ac3321b32478efc76c159ac5d918e751a63840f0cb52ffe6ec344341b" already present on machine Normal Started 33m kubelet Started container kube-apiserver-cert-regeneration-controller Normal Created 33m kubelet Created container kube-apiserver-cert-regeneration-controller Normal Pulled 33m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:032b991ac3321b32478efc76c159ac5d918e751a63840f0cb52ffe6ec344341b" already present on machine Normal Started 33m kubelet Started container kube-apiserver-insecure-readyz Normal Created 33m kubelet Created container kube-apiserver-insecure-readyz Normal Created 33m kubelet Created container kube-apiserver-check-endpoints Normal Started 33m kubelet Started container kube-apiserver-check-endpoints Normal Killing 7m14s kubelet Stopping container kube-apiserver Normal Killing 7m14s kubelet Stopping container kube-apiserver-check-endpoints Normal Killing 7m14s kubelet Stopping container kube-apiserver-cert-regeneration-controller Normal Killing 7m14s kubelet Stopping container kube-apiserver-insecure-readyz Normal Killing 7m14s kubelet Stopping container kube-apiserver-cert-syncer Normal SandboxChanged 5m4s kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 5m3s (x2 over 33m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68bc81cc71914f43dac3f6915497904aec847ee1ab8fee2ee156832efd5cd247" already present on machine Normal Pulled 5m2s (x2 over 33m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68bc81cc71914f43dac3f6915497904aec847ee1ab8fee2ee156832efd5cd247" already present on machine Normal Started 5m2s (x2 over 33m) kubelet Started container setup Normal Created 5m2s (x2 over 33m) kubelet Created container setup Normal Started 5m1s (x2 over 33m) kubelet Started container kube-apiserver Normal Created 5m1s (x2 over 33m) kubelet Created container kube-apiserver For the e2e case https://github.com/openshift/origin/pull/25568/files running in above https://bugzilla.redhat.com/show_bug.cgi?id=1882750#c52, I guess it's just a events check for keyword 'GracefulTerminationTimeout', but the root cause we didn't know what happened.
This is still happening, even with the different test. The underlying issue may not have been fully resolved, or the fix may have moved the issue to a different area of code.
We're still seeing this in aws-serial jobs, after https://bugzilla.redhat.com/show_bug.cgi?id=1925262 was verified. There may be a second unsolved issue here. Reopening. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.7/1359556162759954432
Still waiting on other patches... Marking as verified.
Proper bug for tracking this is https://bugzilla.redhat.com/show_bug.cgi?id=1927500
https://github.com/openshift/kubernetes/pull/558 merged and https://amd64.ocp.releases.ci.openshift.org/releasestream/4.8.0-0.nightly/release/4.8.0-0.nightly-2021-02-11-053857 includes it according to the changelog, but the test still failed in several jobs: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.8/1359791722200043520 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1359789235325898752 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.8-e2e-metal-ipi-upgrade/1359739147568287744 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.8-e2e-metal-ipi-upgrade/1359739160054730752 (updated https://bugzilla.redhat.com/show_bug.cgi?id=1926484 too)
*** Bug 1918930 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633