Bug 1921157

Summary: [sig-api-machinery] Kubernetes APIs remain available for new connections
Product: OpenShift Container Platform Reporter: Francesco Giudici <fgiudici>
Component: NodeAssignee: Ryan Phillips <rphillips>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, mfojtik, mmasters, nagrawal, tsweeney, wking, xxia
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-api-machinery] Kubernetes APIs remain available for new connections [sig-api-machinery] Kubernetes APIs remain available with reused connections [sig-api-machinery] OAuth APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available with reused connections [sig-api-machinery] OpenShift APIs remain available for new connections [sig-api-machinery] OpenShift APIs remain available with reused connections [sig-imageregistry] Image registry remain available operator conditions insights
Last Closed: 2021-03-03 18:15:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1845411    
Bug Blocks:    

Description Francesco Giudici 2021-01-27 15:51:17 UTC
test:
[sig-api-machinery] Kubernetes APIs remain available for new connections 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D+Kubernetes+APIs+remain+available+for+new+connections


Sample failure:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1354023402309947392

During upgrade, seems that the connectivity is lost. The effect is a failure of the test requiring keeping connection with the cluster.

---
API "kubernetes-api-available-new-connections" was unreachable during disruption for at least 16s of 1h0m4s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Jan 26 12:06:29.017 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-bip2nktq-71cad.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 54.147.14.33:6443: connect: connection refused
Jan 26 12:06:29.983 E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
Jan 26 12:06:30.038 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests
---

As the issue is with keeping connectivity with the cluster, all the sig-api-machinery tests and few more (added in the environment field) fail:
[sig-api-machinery] Kubernetes APIs remain available for new connections
[sig-api-machinery] Kubernetes APIs remain available with reused connections
[sig-api-machinery] OAuth APIs remain available for new connections
[sig-api-machinery] OAuth APIs remain available with reused connections
[sig-api-machinery] OpenShift APIs remain available for new connections
[sig-api-machinery] OpenShift APIs remain available with reused connections

Starting assigning to the OpenShift Update Service component for further investigation. The only info I was able to gather is that the issue happens when updating the cluster, so feel free to reassign to a different component if needed.

Comment 1 Francesco Giudici 2021-01-27 16:18:10 UTC
This looks like the 4.7 version of bug #1845411.
Seems a know issue, moving to the kube-apiserver component.

Comment 2 Stefan Schimanski 2021-01-28 12:07:35 UTC
I don't see why this is urgent. No customer escalation, not blocking every CI run.

Comment 3 Lukasz Szaszkiewicz 2021-01-28 12:20:17 UTC
Per discussion, we have on the slack I'm assigning to the node team.(https://coreos.slack.com/archives/CB48XQ4KZ/p1611761584134700)

We suspect that mco triggers a reboot and doesn't wait for kubelet to finish all running processes.



For example https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6-stable-to-4.7-ci/1354295675558301696

T0: At 06:45:46: machine-config initiating reboot: Node will reboot into config rendered-master 854a2b589ebed29fbad70d67f2e243c1

T1: At 06:45:46: Stopped Kubernetes Kubelet on ci-op-z52cbzhi-6d7cd-pz2jw-master-0

T2: At 06:45:58: systemd-shutdown was sending SIGTERM to remaining processes...

T3: At 06:45:58: kube-apiserver-ci-op-z52cbzhi-6d7cd-pz2jw-master-0: Received signal to terminate, becoming unready, but keeping serving (TerminationStart event)

T4: At 06:47:08 kube-apiserver-ci-op-z52cbzhi-6d7cd-pz2jw-master-0: The minimal shutdown duration of 1m10s finished (TerminationMinimalShutdownDurationFinished event)

T5: At 06:47:08 kube-apiserver-ci-op-z52cbzhi-6d7cd-pz2jw-master-0: Server has stopped listening (TerminationStoppedServing event)


T5 is the last event reported from that api server. At T5 the server might wait up to 60s for all requests to complete and then it fires TerminationGracefulTerminationFinished event.

ci-op-z52cbzhi-6d7cd-pz2jw-master-0-termination (audit-logs) file suggest the server was forcefully shut down no TerminationGracefulTerminationFinished reported


It seems that mco must wait for kubelet so that all processes finish and only then starting tearing other things like network or volumes).

Comment 4 Neelesh Agrawal 2021-02-03 20:30:35 UTC
Not a 4.7 blocker.

Comment 6 Miciah Dashiel Butler Masters 2021-02-19 05:13:21 UTC
I'm seeing the following tests fail consistently on the release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.6-to-4.7-ci job:

[sig-api-machinery] Kubernetes APIs remain available for new connections
[sig-api-machinery] Kubernetes APIs remain available with reused connections
[sig-api-machinery] OAuth APIs remain available for new connections
[sig-api-machinery] OAuth APIs remain available with reused connections
[sig-api-machinery] OpenShift APIs remain available for new connections
[sig-api-machinery] OpenShift APIs remain available with reused connections

See <https://search.ci.openshift.org/?search=%5C%5Bsig-api-machinery%5C%5D+OAuth+APIs+remain+available+for+new+connections&maxAge=168h&context=1&type=bug%2Bjunit&name=release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.6-to-4.7-ci&maxMatches=5&maxBytes=20971520&groupBy=job>.

Comment 8 Ryan Phillips 2021-03-03 18:15:34 UTC

*** This bug has been marked as a duplicate of bug 1928946 ***