Bug 1971046
Summary: | apiserver stops responding during an e2e run (non-graceful shutdown) on GCP | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Machine Config Operator | Assignee: | Bob Fournier <bfournie> |
Machine Config Operator sub component: | platform-none | QA Contact: | Victor Voronkov <vvoronko> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | high | CC: | aojeagar, aos-bugs, dmistry, mfojtik, rioliu, smilner, sttts, tsedovic, wking, xxia |
Version: | 4.8 | ||
Target Milestone: | --- | ||
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | tag-ci | ||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-18 17:33:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2021-06-11 18:04:23 UTC
Suspicious jobs, excluding jobs where we expect more aggressive disruption: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=kube-apiserver-new-connection+started+failing' | grep 'failures match' | grep -v 'disruptive\|serial\|upgrade\|launch' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 15 runs, 20% failed, 200% of failures match = 40% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 13 runs, 23% failed, 67% of failures match = 15% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt (all) - 5 runs, 20% failed, 300% of failures match = 60% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi (all) - 12 runs, 75% failed, 11% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 8 runs, 63% failed, 40% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt (all) - 4 runs, 25% failed, 300% of failures match = 75% impact pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-five-control-plane-replicas (all) - 4 runs, 25% failed, 100% of failures match = 25% impact pull-ci-openshift-console-operator-master-e2e-gcp (all) - 2 runs, 50% failed, 100% of failures match = 50% impact pull-ci-openshift-kubernetes-master-e2e-gcp (all) - 11 runs, 27% failed, 33% of failures match = 9% impact pull-ci-openshift-operator-framework-olm-master-e2e-gcp (all) - 2 runs, 50% failed, 100% of failures match = 50% impact pull-ci-openshift-origin-master-e2e-gcp (all) - 16 runs, 44% failed, 43% of failures match = 19% impact rehearse-19156-pull-ci-openshift-ironic-inspector-image-master-e2e-metal-ipi (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-ocp-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.8 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&name=periodic-ci-openshift-release-master-ci-4.8-e2e-gcp$&search=kube-apiserver-new-connection+started+failing' | jq -r 'keys[]' | uniq https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403059842876182528 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403144331979657216 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403193044588564480 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403242655780966400 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403289869676974080 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403300727857614848 All of those have similar e2e-interval charts to the original comment 0 job. From what I have seen the outages are usually brief, just a few seconds, and affect individual requests. Usually, it fails on: i/o timeout or connection reset by peer To debug i/o timeout we would have to enable at least logging on GCP To debug connection reset by peer I would need TCP dumps In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1403193044588564480 the outage was ~1 second Jun 11 04:12:42.382 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-iv87hgx3-f23e1.gcp-2.ci.openshift.org:6443/api/v1/namespaces/default": dial tcp 35.229.49.178:6443: i/o timeout Jun 11 04:12:42.382 - 1s E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests ... Jun 11 04:12:43.520 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests @Clayton would it be okay to move it to the network team to help troubleshoot "read: connection reset by peer" error? After all, this seems to be on TCP level. The issue is not that uncommon https://search.ci.openshift.org/?search=.*6443%3A+read%3A+connection+reset+by+peer&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Moving to MCO to improve the gcp-routes script to handle meta data service failure handling. Since the PR was applied on MCO, move to MCO QA Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |