Bug 1748434 - KubeAPIErrorsHigh firing on daily base but at random times [NEEDINFO]
Summary: KubeAPIErrorsHigh firing on daily base but at random times
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 3.11.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.11.z
Assignee: Lukasz Szaszkiewicz
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-03 14:57 UTC by Simon Reber
Modified: 2024-03-25 15:24 UTC (History)
45 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-25 06:48:21 UTC
Target Upstream Version:
Embargoed:
erich: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4415611 0 None None None 2019-09-15 18:36:27 UTC

Description Simon Reber 2019-09-03 14:57:56 UTC
Description of problem:

Customer is reporting `KubeAPIErrorsHigh` firing on daily base on one OpenShift Container Platform - Cluster (3.11.98-1.git.0.0cbaff3) on daily base.

The timing is random and at some point in time, the environment is recovering. But so far we have no trace of where the problem is coming from. OpenShift Container Platform - API, Controllers and `etcd` logs look mostly clear (meaning we don't see a correlation to events in there and the alert firing).

Also on system level, there is nothing obvious found and we are therefore wondering wher/why the HTTP 500 errors are reported:

~~~
apiserver_request_count{client="Browser",code="500",contentType="application/json",resource="deploymentconfigs",scope="namespace",subresource="log",verb="GET"} 7
apiserver_request_count{client="Browser",code="500",contentType="application/json",resource="resourcequotas",scope="namespace",subresource="",verb="LIST"} 1
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/json",resource="nodes",scope="cluster",subresource="",verb="GET"} 53
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="GET"} 13
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="nodes",scope="cluster",subresource="",verb="GET"} 12
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="persistentvolumeclaims",scope="namespace",subresource="",verb="GET"} 1
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="namespace",subresource="",verb="GET"} 75
apiserver_request_count{client="oc-3.11.98/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="500",contentType="application/json",resource="deploymentconfigs",scope="namespace",subresource="",verb="PATCH"} 3
~~~

Version-Release number of selected component (if applicable):
 - 3.11.98

How reproducible:
 - N/A

Steps to Reproduce:
1. N/A - only seen in one OpenShift Container Platform - Cluster at one customer

Actual results:

`KubeAPIErrorsHigh` firing due to HTTP 500 errors reported in metrics

Expected results:

It would be nice to have a runbook to help investigating when `KubeAPIErrorsHigh` is firing and overall the alert should not be triggered in a healthy and stable environment.

Additional info:

The OpenShift Container Platform - Cluster is functioning properly and no impact or problem was found so far. So we need to understand why `KubeAPIErrorsHigh` is firing and how to prevent that moving forward.

Comment 17 Hugo Cisneiros (Eitch) 2020-01-08 22:01:45 UTC
Hi there,

I'm getting the same issue here: KubeAPIErrorsHigh with 100% of the requests, but the cluster is fully functional (etcd and kubernetes api working) and no related errors in logs.

The only thing that I found in my case was this log entries that happens at some times repeatedly:

atomic-openshift-node[515]: I0108 00:15:31.275274     515 prober.go:111] Readiness probe for "apiserver-tmn7j_kube-service-catalog(5f772144-822e-11e9-b13c-00505697c84d):apiserver" failed (failure): Get https://192.168.0.1:6443/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Which is the apiserver from the service catalog, not the kubernetes api... so I don't know if this is related.

Do you have any more information needed to investigate it further?

Comment 22 Lukasz Szaszkiewicz 2020-05-20 09:49:12 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level.
I will revisit this bug next sprint.

Comment 23 Achyudanand Singh 2020-06-12 19:11:03 UTC
I am facing similar issue for one of my client. The Summary of the issue and logs are as below. Can you please confirm on this if you have any solution 


View in AlertManager 

[2] Resolved 
Labels
alertname = KubeAPIErrorsHigh
client = Swagger-Codegen/8.0.1/python
cluster = np1
code = 500
contentType = application/json
endpoint = https
job = apiserver
namespace = default
prometheus = openshift-monitoring/k8s
resource = customresourcedefinitions
scope = cluster
service = kubernetes
severity = critical
verb = PATCH 
Annotations
message = API server is erroring for 100% of requests.
Source

Labels
alertname = KubeAPIErrorsHigh
client = Swagger-Codegen/8.0.1/python
cluster = np1
code = 500
contentType = application/json
endpoint = https
job = apiserver
namespace = default
prometheus = openshift-monitoring/k8s
resource = customresourcedefinitions
scope = cluster
service = kubernetes
severity = warning
verb = PATCH< br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" /> Annotations
message = API server is erroring for 100% of requests.
Source

Comment 24 Lukasz Szaszkiewicz 2020-06-18 10:14:30 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 27 trumbaut 2020-07-02 11:09:53 UTC
@Achyudanand Singh: have you any update on this? We are facing te exactly the same issue after installing Cluster Application Migration (CAM) in the openshift-migration project on a OCP 3.11.200 cluster. It is unclear to us which component (CRD?) that is the root cause for these 500's. The API logs doesn't show any related logs (DEBUG_LOGLEVEL=6).

Query:
sum without(instance, pod) (rate(apiserver_request_count{code=~"^(?:5..)$",job="apiserver"}[5m])) / sum without(instance, pod) (rate(apiserver_request_count{job="apiserver"}[5m])) * 100 > 5

Returns:
{client="Swagger-Codegen/8.0.1/python",code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="customresourcedefinitions",scope="cluster",service="kubernetes",verb="PATCH"}

Alert:
alertname="KubeAPIErrorsHigh" client="Swagger-Codegen/8.0.1/python" code="500" contentType="application/json" endpoint="https" job="apiserver" namespace="default" resource="customresourcedefinitions" scope="cluster" service="kubernetes" severity="critical" verb="PATCH"

CRD's created at the point this alert appeared:
$ oc get crd | grep '06-30'
backups.velero.io                             2020-06-30T14:04:32Z
backupstoragelocations.velero.io              2020-06-30T14:04:33Z
deletebackuprequests.velero.io                2020-06-30T14:04:33Z
downloadrequests.velero.io                    2020-06-30T14:04:33Z
migclusters.migration.openshift.io            2020-06-30T14:04:38Z
mighooks.migration.openshift.io               2020-06-30T14:04:41Z
migmigrations.migration.openshift.io          2020-06-30T14:04:39Z
migplans.migration.openshift.io               2020-06-30T14:04:40Z
migrationcontrollers.migration.openshift.io   2020-06-30T14:03:11Z
migstorages.migration.openshift.io            2020-06-30T14:04:42Z
podvolumebackups.velero.io                    2020-06-30T14:04:33Z
podvolumerestores.velero.io                   2020-06-30T14:04:33Z
resticrepositories.velero.io                  2020-06-30T14:04:33Z
restores.velero.io                            2020-06-30T14:04:33Z
schedules.velero.io                           2020-06-30T14:04:33Z
serverstatusrequests.velero.io                2020-06-30T14:04:33Z
volumesnapshotlocations.velero.io             2020-06-30T14:04:33Z

Comment 28 Lukasz Szaszkiewicz 2020-07-10 08:01:37 UTC
Sorry, I haven't had time to look into this issue this sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 29 Lukasz Szaszkiewicz 2020-08-03 11:13:51 UTC
I haven't had time to look into this issue this sprint (again). I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 30 Christian Heike 2020-08-03 14:01:11 UTC
Actually, I can add some detail to it.

We have correlated some of the errors with log entries of the API.

* {code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="WATCH"}:
E0803 08:34:09.311924       1 watch.go:233] unable to encode watch object: http2: stream closed (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc44b488760), encoder:(*versioning.codec)(0xc4cfa5f9e0), buf:(*bytes.Buffer)(0xc49757d500)})

* {code="500",contentType="application/vnd.kubernetes.protobuf",endpoint="https",job="apiserver",namespace="default",resource="buildconfigs",scope="namespace",service="kubernetes",subresource="instantiate",verb="POST"}
E0801 07:36:12.752566       1 status.go:64] apiserver received an error that is not an metav1.Status: storage.InvalidError{Errs:field.ErrorList{(*field.Error)(0xc480cb2480)}}

* {code="500",contentType="application/json",endpoint="https",job="apiserver",namespace="default",resource="deploymentconfigs",scope="namespace",service="kubernetes",subresource="log",verb="GET"}
E0803 07:50:17.680321       1 writers.go:74] error encountered while streaming results via websocket: tls: use of closed connection


My deductions from gathering information from various sources:
- etcd 3 has a 2MiB limit for bodies, the API might still have the old 1MiB limit set for the encoder, if that is the case, there's a mismatch between what can be stored and what can be retrieved
- IDK if this already applies (I have not read the API-server code, yet :D), but certain errors should map to certain other HTTP response code (e.g. client errors 4xx)
- it would be of great help if:
  - the logger would be set to a context (as in logr.Logger) and use a generated request UUID (parallel threading)
  - the mapped response code would be logged as well

Best regards,
Christian

Comment 34 Fatima 2020-08-24 09:03:57 UTC
Another customer is facing this issue after upgrading to OCP v4.5.6

Any thoughts?

Comment 36 Fatima 2020-08-25 10:20:58 UTC
Hey guys, 

Any info on this issue in OCP 4.5?

We have a couple of customers facing them.

Thanks.

Comment 37 Lukasz Szaszkiewicz 2020-08-25 14:28:36 UTC
I haven't had time to look into this issue this sprint (again, sorry). I’m adding UpcomingSprint label. I will revalue working on this issue in the upcoming sprint.

Comment 43 Stefan Schimanski 2020-09-01 13:36:05 UTC
"http2: stream closed" has a high chance of either be a networking issue or the client to die.

Comment 51 Christian Heike 2020-09-09 11:08:57 UTC
Hi Stefan,

as clearly stated:
>> IDK if this already applies (I have not read the API-server code, yet :D), but certain errors should map to certain other HTTP response code (e.g. client errors 4xx)

If a http stream is closed by the client side, this should IMHO not map to an erroneous response code and/or an increase of the error-counts in the metrics.
Otherwise the metrics are rendered useless, since every closed http stream would increase the error counter on the server side, whereas the error is on the client side.
Ofc, the error might be caused by a network problem. But since we cannot distinguish a network problem on that layer from a simple connection cancellation on this level,
it might not be wise to count this as an error.

Best regards,
Heike


Note You need to log in before you can comment on or make changes to this bug.