1971046 – apiserver stops responding during an e2e run (non-graceful shutdown) on GCP

Bug 1971046 - apiserver stops responding during an e2e run (non-graceful shutdown) on GCP

Summary: apiserver stops responding during an e2e run (non-graceful shutdown) on GCP

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Bob Fournier
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:	tag-ci
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-11 18:04 UTC by Clayton Coleman
Modified:	2021-10-18 17:34 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:33:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2617	0	None	closed	Bug 1971046: templates/master/00-master/gcp/files/opt-libexec-openshift-gcp-routes: Stderr for curl errors	2021-07-21 08:14:46 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:34:08 UTC

Description Clayton Coleman 2021-06-11 18:04:23 UTC

About 25m into this e2e run on GCP 4.8, the apiserver stops responding:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403289869676974080

causing a number of test failures and flakes.


During e2e runs apiserver should remain available and have no disruption (except controlled and graceful disruption which should be invisible to users), needs investigation to determine why API server dropped out (looks like a crash, but may be environment).

Setting to urgent because there is no allowable reason this should ever happen and may represent a rare but serious failure mode, pending investigation.

Similar but not identical behavior in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403193044588564480 (apiserver drop out).

May occur on other platforms, but I don't see an exact duplicate from a quick scan.

Comment 1 W. Trevor King 2021-06-11 18:36:58 UTC

Suspicious jobs, excluding jobs where we expect more aggressive disruption:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=kube-apiserver-new-connection+started+failing' | grep 'failures match' | grep -v 'disruptive\|serial\|upgrade\|launch' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 15 runs, 20% failed, 200% of failures match = 40% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 13 runs, 23% failed, 67% of failures match = 15% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt (all) - 5 runs, 20% failed, 300% of failures match = 60% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi (all) - 12 runs, 75% failed, 11% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 8 runs, 63% failed, 40% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt (all) - 4 runs, 25% failed, 300% of failures match = 75% impact
pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-five-control-plane-replicas (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
pull-ci-openshift-console-operator-master-e2e-gcp (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
pull-ci-openshift-kubernetes-master-e2e-gcp (all) - 11 runs, 27% failed, 33% of failures match = 9% impact
pull-ci-openshift-operator-framework-olm-master-e2e-gcp (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 16 runs, 44% failed, 43% of failures match = 19% impact
rehearse-19156-pull-ci-openshift-ironic-inspector-image-master-e2e-metal-ipi (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-ocp-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact

Comment 2 W. Trevor King 2021-06-11 18:41:22 UTC

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&name=periodic-ci-openshift-release-master-ci-4.8-e2e-gcp$&search=kube-apiserver-new-connection+started+failing' | jq -r 'keys[]' | uniq
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403059842876182528
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403144331979657216
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403193044588564480
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403242655780966400
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403289869676974080
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403300727857614848

All of those have similar e2e-interval charts to the original comment 0 job.

Comment 3 Lukasz Szaszkiewicz 2021-06-14 12:11:46 UTC

From what I have seen the outages are usually brief, just a few seconds, and affect individual requests.

Usually, it fails on: i/o timeout or connection reset by peer

To debug i/o timeout we would have to enable at least logging on GCP
To debug connection reset by peer I would need TCP dumps

Comment 4 Lukasz Szaszkiewicz 2021-06-14 12:28:01 UTC

In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403193044588564480 the outage was ~1 second


Jun 11 04:12:42.382 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-iv87hgx3-f23e1.gcp-2.ci.openshift.org:6443/api/v1/namespaces/default": dial tcp 35.229.49.178:6443: i/o timeout
Jun 11 04:12:42.382 - 1s    E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
...
Jun 11 04:12:43.520 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests

Comment 5 Lukasz Szaszkiewicz 2021-06-14 13:07:12 UTC

@Clayton would it be okay to move it to the network team to help troubleshoot "read: connection reset by peer" error? After all, this seems to be on TCP level.

The issue is not that uncommon https://search.ci.openshift.org/?search=.*6443%3A+read%3A+connection+reset+by+peer&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 6 Stefan Schimanski 2021-06-21 06:14:33 UTC

Moving to MCO to improve the gcp-routes script to handle meta data service failure handling.

Comment 8 Ke Wang 2021-07-21 11:53:31 UTC

Since the PR was applied on MCO, move to MCO QA

Comment 12 errata-xmlrpc 2021-10-18 17:33:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.