Bug 1779938 - GCP: Kube API started failing ... request canceled (Client.Timeout exceeded while awaiting headers)
Summary: GCP: Kube API started failing ... request canceled (Client.Timeout exceeded w...
Keywords:
Status: CLOSED DUPLICATE of bug 1845410
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-05 03:52 UTC by W. Trevor King
Modified: 2020-06-30 10:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-18 10:03:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2019-12-05 03:52:10 UTC
4.3 release promotion CI [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/build-log.txt | grep 'Kube API started failing\|Kube API started responding' | sort | uniq
Dec 03 22:13:19.021 E kube-apiserver Kube API started failing: Get https://api.ci-op-jw2mh699-34698.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/kube-system?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Dec 03 22:13:19.299 I kube-apiserver Kube API started responding to GET requests

My impression was that we want 100% uptime for the Kube API and that even short flaps like that are things we want to fix.  But if we think they're actually fine (most clients will survive a sub-second outage and retry if they happen to hit the outage), then we should teach the monitor [2] to consider the duration of outage before complaining to reduce the noise.  Currently these outages are very common, showing up in 103 jobs from the past 24h [3] (although most of those outages might also be brief).

Spun off from bug 1779413, which has been re-purposed to look at samples.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513
[2]: https://github.com/openshift/origin/blob/9d9c044e53d4d27b64f9407f7596ba86a0f78e23/pkg/monitor/api.go#L78-L82
[3]: https://search.svc.ci.openshift.org/chart?search=Kube%20API%20started%20failing.*Client.Timeout%20exceeded%20while%20awaiting%20headers

Comment 7 Neelesh Agrawal 2020-05-05 15:26:36 UTC
Bumping-up severity as 10% of tests are failing with this error.

Comment 8 Stefan Schimanski 2020-05-19 09:49:01 UTC
https://github.com/openshift/machine-config-operator/pull/1670 might help. We will revisit this BZ when the PR merged.

Comment 9 Stefan Schimanski 2020-05-25 09:02:41 UTC
https://github.com/openshift/machine-config-operator/pull/1670 merged. Moving to modified.

Comment 12 Ke Wang 2020-06-03 02:49:00 UTC
This seems to be common problem, from below searching results, there are 11 bugs related to this, and matched 16.52% of failing runs and 12.26% of jobs, we don't see a significant decline. I don't think it was resolved by the PR https://github.com/openshift/machine-config-operator/pull/1670, so assign it back.

https://search.apps.build01.ci.devcluster.openshift.com/?search=Client.Timeout+exceeded+while+awaiting+headers&maxAge=12h&context=1&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 13 Stefan Schimanski 2020-06-03 08:01:38 UTC
We are on this topic, but it does not block 4.5 in the current form. Moving to 4.6.

Note that many of the runs in the search in #12 are not GCP related. This issue was about GCP only.

Comment 14 Stefan Schimanski 2020-06-18 10:03:16 UTC
We are tracking these in #1845410.

*** This bug has been marked as a duplicate of bug 1845410 ***


Note You need to log in before you can comment on or make changes to this bug.