Description of problem: Customer is reporting `KubeAPIErrorsHigh` firing on daily base on one OpenShift Container Platform - Cluster (3.11.98-1.git.0.0cbaff3) on daily base. The timing is random and at some point in time, the environment is recovering. But so far we have no trace of where the problem is coming from. OpenShift Container Platform - API, Controllers and `etcd` logs look mostly clear (meaning we don't see a correlation to events in there and the alert firing). Also on system level, there is nothing obvious found and we are therefore wondering wher/why the HTTP 500 errors are reported: ~~~ apiserver_request_count{client="Browser",code="500",contentType="application/json",resource="deploymentconfigs",scope="namespace",subresource="log",verb="GET"} 7 apiserver_request_count{client="Browser",code="500",contentType="application/json",resource="resourcequotas",scope="namespace",subresource="",verb="LIST"} 1 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/json",resource="nodes",scope="cluster",subresource="",verb="GET"} 53 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="GET"} 13 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="nodes",scope="cluster",subresource="",verb="GET"} 12 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="persistentvolumeclaims",scope="namespace",subresource="",verb="GET"} 1 apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="namespace",subresource="",verb="GET"} 75 apiserver_request_count{client="oc-3.11.98/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/json",resource="deploymentconfigs",scope="namespace",subresource="",verb="PATCH"} 3 ~~~ Version-Release number of selected component (if applicable): - 3.11.98 How reproducible: - N/A Steps to Reproduce: 1. N/A - only seen in one OpenShift Container Platform - Cluster at one customer Actual results: `KubeAPIErrorsHigh` firing due to HTTP 500 errors reported in metrics Expected results: It would be nice to have a runbook to help investigating when `KubeAPIErrorsHigh` is firing and overall the alert should not be triggered in a healthy and stable environment. Additional info: The OpenShift Container Platform - Cluster is functioning properly and no impact or problem was found so far. So we need to understand why `KubeAPIErrorsHigh` is firing and how to prevent that moving forward.
Hi there, I'm getting the same issue here: KubeAPIErrorsHigh with 100% of the requests, but the cluster is fully functional (etcd and kubernetes api working) and no related errors in logs. The only thing that I found in my case was this log entries that happens at some times repeatedly: atomic-openshift-node[515]: I0108 00:15:31.275274 515 prober.go:111] Readiness probe for "apiserver-tmn7j_kube-service-catalog(5f772144-822e-11e9-b13c-00505697c84d):apiserver" failed (failure): Get https://192.168.0.1:6443/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Which is the apiserver from the service catalog, not the kubernetes api... so I don't know if this is related. Do you have any more information needed to investigate it further?
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I am facing similar issue for one of my client. The Summary of the issue and logs are as below. Can you please confirm on this if you have any solution View in AlertManager [2] Resolved Labels alertname = KubeAPIErrorsHigh client = Swagger-Codegen/8.0.1/python cluster = np1 code = 500 contentType = application/json endpoint = https job = apiserver namespace = default prometheus = openshift-monitoring/k8s resource = customresourcedefinitions scope = cluster service = kubernetes severity = critical verb = PATCH Annotations message = API server is erroring for 100% of requests. Source Labels alertname = KubeAPIErrorsHigh client = Swagger-Codegen/8.0.1/python cluster = np1 code = 500 contentType = application/json endpoint = https job = apiserver namespace = default prometheus = openshift-monitoring/k8s resource = customresourcedefinitions scope = cluster service = kubernetes severity = warning verb = PATCH< br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" /> Annotations message = API server is erroring for 100% of requests. Source
@Achyudanand Singh: have you any update on this? We are facing te exactly the same issue after installing Cluster Application Migration (CAM) in the openshift-migration project on a OCP 3.11.200 cluster. It is unclear to us which component (CRD?) that is the root cause for these 500's. The API logs doesn't show any related logs (DEBUG_LOGLEVEL=6). Query: sum without(instance, pod) (rate(apiserver_request_count{code=~"^(?:5..)$",job="apiserver"}[5m])) / sum without(instance, pod) (rate(apiserver_request_count{job="apiserver"}[5m])) * 100 > 5 Returns: {client="Swagger-Codegen/8.0.1/python",code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="customresourcedefinitions",scope="cluster",service="kubernetes",verb="PATCH"} Alert: alertname="KubeAPIErrorsHigh" client="Swagger-Codegen/8.0.1/python" code="500" contentType="application/json" endpoint="https" job="apiserver" namespace="default" resource="customresourcedefinitions" scope="cluster" service="kubernetes" severity="critical" verb="PATCH" CRD's created at the point this alert appeared: $ oc get crd | grep '06-30' backups.velero.io 2020-06-30T14:04:32Z backupstoragelocations.velero.io 2020-06-30T14:04:33Z deletebackuprequests.velero.io 2020-06-30T14:04:33Z downloadrequests.velero.io 2020-06-30T14:04:33Z migclusters.migration.openshift.io 2020-06-30T14:04:38Z mighooks.migration.openshift.io 2020-06-30T14:04:41Z migmigrations.migration.openshift.io 2020-06-30T14:04:39Z migplans.migration.openshift.io 2020-06-30T14:04:40Z migrationcontrollers.migration.openshift.io 2020-06-30T14:03:11Z migstorages.migration.openshift.io 2020-06-30T14:04:42Z podvolumebackups.velero.io 2020-06-30T14:04:33Z podvolumerestores.velero.io 2020-06-30T14:04:33Z resticrepositories.velero.io 2020-06-30T14:04:33Z restores.velero.io 2020-06-30T14:04:33Z schedules.velero.io 2020-06-30T14:04:33Z serverstatusrequests.velero.io 2020-06-30T14:04:33Z volumesnapshotlocations.velero.io 2020-06-30T14:04:33Z
Sorry, I haven't had time to look into this issue this sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I haven't had time to look into this issue this sprint (again). I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Actually, I can add some detail to it. We have correlated some of the errors with log entries of the API. * {code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="WATCH"}: E0803 08:34:09.311924 1 watch.go:233] unable to encode watch object: http2: stream closed (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc44b488760), encoder:(*versioning.codec)(0xc4cfa5f9e0), buf:(*bytes.Buffer)(0xc49757d500)}) * {code="500",contentType="application/vnd.kubernetes.protobuf",endpoint="https",job="apiserver",namespace="default",resource="buildconfigs",scope="namespace",service="kubernetes",subresource="instantiate",verb="POST"} E0801 07:36:12.752566 1 status.go:64] apiserver received an error that is not an metav1.Status: storage.InvalidError{Errs:field.ErrorList{(*field.Error)(0xc480cb2480)}} * {code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="deploymentconfigs",scope="namespace",service="kubernetes",subresource="log",verb="GET"} E0803 07:50:17.680321 1 writers.go:74] error encountered while streaming results via websocket: tls: use of closed connection My deductions from gathering information from various sources: - etcd 3 has a 2MiB limit for bodies, the API might still have the old 1MiB limit set for the encoder, if that is the case, there's a mismatch between what can be stored and what can be retrieved - IDK if this already applies (I have not read the API-server code, yet :D), but certain errors should map to certain other HTTP response code (e.g. client errors 4xx) - it would be of great help if: - the logger would be set to a context (as in logr.Logger) and use a generated request UUID (parallel threading) - the mapped response code would be logged as well Best regards, Christian
Another customer is facing this issue after upgrading to OCP v4.5.6 Any thoughts?
Hey guys, Any info on this issue in OCP 4.5? We have a couple of customers facing them. Thanks.
I haven't had time to look into this issue this sprint (again, sorry). I’m adding UpcomingSprint label. I will revalue working on this issue in the upcoming sprint.
"http2: stream closed" has a high chance of either be a networking issue or the client to die.
Hi Stefan, as clearly stated: >> IDK if this already applies (I have not read the API-server code, yet :D), but certain errors should map to certain other HTTP response code (e.g. client errors 4xx) If a http stream is closed by the client side, this should IMHO not map to an erroneous response code and/or an increase of the error-counts in the metrics. Otherwise the metrics are rendered useless, since every closed http stream would increase the error counter on the server side, whereas the error is on the client side. Ofc, the error might be caused by a network problem. But since we cannot distinguish a network problem on that layer from a simple connection cancellation on this level, it might not be wise to count this as an error. Best regards, Heike