Bug 1748434
| Summary: | KubeAPIErrorsHigh firing on daily base but at random times | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Reber <sreber> |
| Component: | kube-apiserver | Assignee: | Lukasz Szaszkiewicz <lszaszki> |
| Status: | CLOSED NOTABUG | QA Contact: | Xingxing Xia <xxia> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.11.0 | CC: | aabhishe, achyudanand.singh, agogala, aivaraslaimikis, alchan, alegrand, anpicker, aos-bugs, christian.heike, ckoep, cruhm, dahernan, dkulkarn, erich, erooth, fhirtz, fshaikh, hongkliu, jmatthew, kakkoyun, kelly.brown1, lars.erhardt.extern, lcosic, lszaszki, mf.flip, mfojtik, mloibl, mrhodes, nijoshi, pbertera, pkanthal, pkrupa, prdeshpa, rekhan, rkshirsa, rpalathi, rsandu, sferguso, simore, sttts, surbania, thomas.rumbaut, tmicheli, weihuang, wking |
| Target Milestone: | --- | Flags: | erich:
needinfo?
|
| Target Release: | 3.11.z | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-25 06:48:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hi there, I'm getting the same issue here: KubeAPIErrorsHigh with 100% of the requests, but the cluster is fully functional (etcd and kubernetes api working) and no related errors in logs. The only thing that I found in my case was this log entries that happens at some times repeatedly: atomic-openshift-node[515]: I0108 00:15:31.275274 515 prober.go:111] Readiness probe for "apiserver-tmn7j_kube-service-catalog(5f772144-822e-11e9-b13c-00505697c84d):apiserver" failed (failure): Get https://192.168.0.1:6443/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Which is the apiserver from the service catalog, not the kubernetes api... so I don't know if this is related. Do you have any more information needed to investigate it further? I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I am facing similar issue for one of my client. The Summary of the issue and logs are as below. Can you please confirm on this if you have any solution View in AlertManager [2] Resolved Labels alertname = KubeAPIErrorsHigh client = Swagger-Codegen/8.0.1/python cluster = np1 code = 500 contentType = application/json endpoint = https job = apiserver namespace = default prometheus = openshift-monitoring/k8s resource = customresourcedefinitions scope = cluster service = kubernetes severity = critical verb = PATCH Annotations message = API server is erroring for 100% of requests. Source Labels alertname = KubeAPIErrorsHigh client = Swagger-Codegen/8.0.1/python cluster = np1 code = 500 contentType = application/json endpoint = https job = apiserver namespace = default prometheus = openshift-monitoring/k8s resource = customresourcedefinitions scope = cluster service = kubernetes severity = warning verb = PATCH< br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" /> Annotations message = API server is erroring for 100% of requests. Source I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. @Achyudanand Singh: have you any update on this? We are facing te exactly the same issue after installing Cluster Application Migration (CAM) in the openshift-migration project on a OCP 3.11.200 cluster. It is unclear to us which component (CRD?) that is the root cause for these 500's. The API logs doesn't show any related logs (DEBUG_LOGLEVEL=6).
Query:
sum without(instance, pod) (rate(apiserver_request_count{code=~"^(?:5..)$",job="apiserver"}[5m])) / sum without(instance, pod) (rate(apiserver_request_count{job="apiserver"}[5m])) * 100 > 5
Returns:
{client="Swagger-Codegen/8.0.1/python",code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="customresourcedefinitions",scope="cluster",service="kubernetes",verb="PATCH"}
Alert:
alertname="KubeAPIErrorsHigh" client="Swagger-Codegen/8.0.1/python" code="500" contentType="application/json" endpoint="https" job="apiserver" namespace="default" resource="customresourcedefinitions" scope="cluster" service="kubernetes" severity="critical" verb="PATCH"
CRD's created at the point this alert appeared:
$ oc get crd | grep '06-30'
backups.velero.io 2020-06-30T14:04:32Z
backupstoragelocations.velero.io 2020-06-30T14:04:33Z
deletebackuprequests.velero.io 2020-06-30T14:04:33Z
downloadrequests.velero.io 2020-06-30T14:04:33Z
migclusters.migration.openshift.io 2020-06-30T14:04:38Z
mighooks.migration.openshift.io 2020-06-30T14:04:41Z
migmigrations.migration.openshift.io 2020-06-30T14:04:39Z
migplans.migration.openshift.io 2020-06-30T14:04:40Z
migrationcontrollers.migration.openshift.io 2020-06-30T14:03:11Z
migstorages.migration.openshift.io 2020-06-30T14:04:42Z
podvolumebackups.velero.io 2020-06-30T14:04:33Z
podvolumerestores.velero.io 2020-06-30T14:04:33Z
resticrepositories.velero.io 2020-06-30T14:04:33Z
restores.velero.io 2020-06-30T14:04:33Z
schedules.velero.io 2020-06-30T14:04:33Z
serverstatusrequests.velero.io 2020-06-30T14:04:33Z
volumesnapshotlocations.velero.io 2020-06-30T14:04:33Z
Sorry, I haven't had time to look into this issue this sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I haven't had time to look into this issue this sprint (again). I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Actually, I can add some detail to it.
We have correlated some of the errors with log entries of the API.
* {code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="WATCH"}:
E0803 08:34:09.311924 1 watch.go:233] unable to encode watch object: http2: stream closed (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc44b488760), encoder:(*versioning.codec)(0xc4cfa5f9e0), buf:(*bytes.Buffer)(0xc49757d500)})
* {code="500",contentType="application/vnd.kubernetes.protobuf",endpoint="https",job="apiserver",namespace="default",resource="buildconfigs",scope="namespace",service="kubernetes",subresource="instantiate",verb="POST"}
E0801 07:36:12.752566 1 status.go:64] apiserver received an error that is not an metav1.Status: storage.InvalidError{Errs:field.ErrorList{(*field.Error)(0xc480cb2480)}}
* {code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="deploymentconfigs",scope="namespace",service="kubernetes",subresource="log",verb="GET"}
E0803 07:50:17.680321 1 writers.go:74] error encountered while streaming results via websocket: tls: use of closed connection
My deductions from gathering information from various sources:
- etcd 3 has a 2MiB limit for bodies, the API might still have the old 1MiB limit set for the encoder, if that is the case, there's a mismatch between what can be stored and what can be retrieved
- IDK if this already applies (I have not read the API-server code, yet :D), but certain errors should map to certain other HTTP response code (e.g. client errors 4xx)
- it would be of great help if:
- the logger would be set to a context (as in logr.Logger) and use a generated request UUID (parallel threading)
- the mapped response code would be logged as well
Best regards,
Christian
Another customer is facing this issue after upgrading to OCP v4.5.6 Any thoughts? Hey guys, Any info on this issue in OCP 4.5? We have a couple of customers facing them. Thanks. I haven't had time to look into this issue this sprint (again, sorry). I’m adding UpcomingSprint label. I will revalue working on this issue in the upcoming sprint. "http2: stream closed" has a high chance of either be a networking issue or the client to die. Hi Stefan,
as clearly stated:
>> IDK if this already applies (I have not read the API-server code, yet :D), but certain errors should map to certain other HTTP response code (e.g. client errors 4xx)
If a http stream is closed by the client side, this should IMHO not map to an erroneous response code and/or an increase of the error-counts in the metrics.
Otherwise the metrics are rendered useless, since every closed http stream would increase the error counter on the server side, whereas the error is on the client side.
Ofc, the error might be caused by a network problem. But since we cannot distinguish a network problem on that layer from a simple connection cancellation on this level,
it might not be wise to count this as an error.
Best regards,
Heike
|
Description of problem: Customer is reporting `KubeAPIErrorsHigh` firing on daily base on one OpenShift Container Platform - Cluster (3.11.98-1.git.0.0cbaff3) on daily base. The timing is random and at some point in time, the environment is recovering. But so far we have no trace of where the problem is coming from. OpenShift Container Platform - API, Controllers and `etcd` logs look mostly clear (meaning we don't see a correlation to events in there and the alert firing). Also on system level, there is nothing obvious found and we are therefore wondering wher/why the HTTP 500 errors are reported: ~~~ apiserver_request_count{client="Browser",code="500",contentType="application/json",resource="deploymentconfigs",scope="namespace",subresource="log",verb="GET"} 7 apiserver_request_count{client="Browser",code="500",contentType="application/json",resource="resourcequotas",scope="namespace",subresource="",verb="LIST"} 1 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/json",resource="nodes",scope="cluster",subresource="",verb="GET"} 53 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="GET"} 13 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="nodes",scope="cluster",subresource="",verb="GET"} 12 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="persistentvolumeclaims",scope="namespace",subresource="",verb="GET"} 1 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="namespace",subresource="",verb="GET"} 75 apiserver_request_count{client="oc-3.11.98/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/json",resource="deploymentconfigs",scope="namespace",subresource="",verb="PATCH"} 3 ~~~ Version-Release number of selected component (if applicable): - 3.11.98 How reproducible: - N/A Steps to Reproduce: 1. N/A - only seen in one OpenShift Container Platform - Cluster at one customer Actual results: `KubeAPIErrorsHigh` firing due to HTTP 500 errors reported in metrics Expected results: It would be nice to have a runbook to help investigating when `KubeAPIErrorsHigh` is firing and overall the alert should not be triggered in a healthy and stable environment. Additional info: The OpenShift Container Platform - Cluster is functioning properly and no impact or problem was found so far. So we need to understand why `KubeAPIErrorsHigh` is firing and how to prevent that moving forward.