1921157 – [sig-api-machinery] Kubernetes APIs remain available for new connections

Bug 1921157 - [sig-api-machinery] Kubernetes APIs remain available for new connections

Summary: [sig-api-machinery] Kubernetes APIs remain available for new connections

Keywords:
Status:	CLOSED DUPLICATE of bug 1928946
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Ryan Phillips
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Depends On:	1845411
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-27 15:51 UTC by Francesco Giudici
Modified:	2021-03-03 18:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-api-machinery] Kubernetes APIs remain available for new connections [sig-api-machinery] Kubernetes APIs remain available with reused connections [sig-api-machinery] OAuth APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available with reused connections [sig-api-machinery] OpenShift APIs remain available for new connections [sig-api-machinery] OpenShift APIs remain available with reused connections [sig-imageregistry] Image registry remain available operator conditions insights
Last Closed:	2021-03-03 18:15:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Francesco Giudici 2021-01-27 15:51:17 UTC

test:
[sig-api-machinery] Kubernetes APIs remain available for new connections 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D+Kubernetes+APIs+remain+available+for+new+connections


Sample failure:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1354023402309947392

During upgrade, seems that the connectivity is lost. The effect is a failure of the test requiring keeping connection with the cluster.

---
API "kubernetes-api-available-new-connections" was unreachable during disruption for at least 16s of 1h0m4s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Jan 26 12:06:29.017 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-bip2nktq-71cad.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 54.147.14.33:6443: connect: connection refused
Jan 26 12:06:29.983 E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
Jan 26 12:06:30.038 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests
---

As the issue is with keeping connectivity with the cluster, all the sig-api-machinery tests and few more (added in the environment field) fail:
[sig-api-machinery] Kubernetes APIs remain available for new connections
[sig-api-machinery] Kubernetes APIs remain available with reused connections
[sig-api-machinery] OAuth APIs remain available for new connections
[sig-api-machinery] OAuth APIs remain available with reused connections
[sig-api-machinery] OpenShift APIs remain available for new connections
[sig-api-machinery] OpenShift APIs remain available with reused connections

Starting assigning to the OpenShift Update Service component for further investigation. The only info I was able to gather is that the issue happens when updating the cluster, so feel free to reassign to a different component if needed.

Comment 1 Francesco Giudici 2021-01-27 16:18:10 UTC

This looks like the 4.7 version of bug #1845411.
Seems a know issue, moving to the kube-apiserver component.

Comment 2 Stefan Schimanski 2021-01-28 12:07:35 UTC

I don't see why this is urgent. No customer escalation, not blocking every CI run.

Comment 3 Lukasz Szaszkiewicz 2021-01-28 12:20:17 UTC

Per discussion, we have on the slack I'm assigning to the node team.(https://coreos.slack.com/archives/CB48XQ4KZ/p1611761584134700)

We suspect that mco triggers a reboot and doesn't wait for kubelet to finish all running processes.



For example https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6-stable-to-4.7-ci/1354295675558301696

T0: At 06:45:46: machine-config initiating reboot: Node will reboot into config rendered-master 854a2b589ebed29fbad70d67f2e243c1

T1: At 06:45:46: Stopped Kubernetes Kubelet on ci-op-z52cbzhi-6d7cd-pz2jw-master-0

T2: At 06:45:58: systemd-shutdown was sending SIGTERM to remaining processes...

T3: At 06:45:58: kube-apiserver-ci-op-z52cbzhi-6d7cd-pz2jw-master-0: Received signal to terminate, becoming unready, but keeping serving (TerminationStart event)

T4: At 06:47:08 kube-apiserver-ci-op-z52cbzhi-6d7cd-pz2jw-master-0: The minimal shutdown duration of 1m10s finished (TerminationMinimalShutdownDurationFinished event)

T5: At 06:47:08 kube-apiserver-ci-op-z52cbzhi-6d7cd-pz2jw-master-0: Server has stopped listening (TerminationStoppedServing event)


T5 is the last event reported from that api server. At T5 the server might wait up to 60s for all requests to complete and then it fires TerminationGracefulTerminationFinished event.

ci-op-z52cbzhi-6d7cd-pz2jw-master-0-termination (audit-logs) file suggest the server was forcefully shut down no TerminationGracefulTerminationFinished reported


It seems that mco must wait for kubelet so that all processes finish and only then starting tearing other things like network or volumes).

Comment 4 Neelesh Agrawal 2021-02-03 20:30:35 UTC

Not a 4.7 blocker.

Comment 6 Miciah Dashiel Butler Masters 2021-02-19 05:13:21 UTC

I'm seeing the following tests fail consistently on the release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.6-to-4.7-ci job:

[sig-api-machinery] Kubernetes APIs remain available for new connections
[sig-api-machinery] Kubernetes APIs remain available with reused connections
[sig-api-machinery] OAuth APIs remain available for new connections
[sig-api-machinery] OAuth APIs remain available with reused connections
[sig-api-machinery] OpenShift APIs remain available for new connections
[sig-api-machinery] OpenShift APIs remain available with reused connections

See <https://search.ci.openshift.org/?search=%5C%5Bsig-api-machinery%5C%5D+OAuth+APIs+remain+available+for+new+connections&maxAge=168h&context=1&type=bug%2Bjunit&name=release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.6-to-4.7-ci&maxMatches=5&maxBytes=20971520&groupBy=job>.

Comment 8 Ryan Phillips 2021-03-03 18:15:34 UTC


*** This bug has been marked as a duplicate of bug 1928946 ***

Note You need to log in before you can comment on or make changes to this bug.