1887745 – API server is throwing 5xx error code for 42.11% of requests for LIST events

Bug 1887745 - API server is throwing 5xx error code for 42.11% of requests for LIST events

Summary: API server is throwing 5xx error code for 42.11% of requests for LIST events

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-controller-manager
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Otávio Fernandes
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:	devex
Depends On:
Blocks:	1935362
TreeView+	depends on / blocked

Reported:	2020-10-13 09:16 UTC by Rutvik
Modified:	2024-06-13 23:12 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:25:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-controller-manager pull 145	0	None	closed	Bug 1887745: Listing All Events with Chunking	2021-01-18 07:49:20 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:26:00 UTC

Description Rutvik 2020-10-13 09:16:37 UTC

Description of problem:

We have come across another cluster where an issue similar to https://bugzilla.redhat.com/show_bug.cgi?id=1748434 has been rectified.
Since that BZ was closed, though of opening new one to understand our course of action fixing this in favor of v3.11.

---------
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/system:serviceaccount:openshift-infra:unidling-controller",cluster="abc",code="500",endpoint="https",instance=<lb_ip>:443",job="apiserver",namespace="default",resource="events",scope="cluster",service="kubernetes",verb="LIST"} 2772

apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/system:serviceaccount:openshift-infra:unidling-controller",cluster="abc",code="504",contentType="cluster",endpoint="https",instance="lb_ip:443",job="apiserver",namespace="default",resource="events",service="kubernetes",verb="LIST"} 2783

apiserver_request_count{client="felix-stats/v0.0.0 (linux/amd64) kubernetes/$Format",cluster="abc",code="500",contentType="application/json",endpoint="https",instance="lb_ip:443",job="apiserver",namespace="default",resource="events",scope="cluster",service="kubernetes",verb="LIST"}	1
---------

Also, the below errors were seen in the api logs
--------
E1012 16:29:41.699881       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
E1012 16:29:41.699929       1 writers.go:149] apiserver was unable to write a JSON response: http: Handler timeout
E1012 16:29:41.701173       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}
--------


Version-Release number of selected component (if applicable):
v3.11.146

How reproducible:
OpenshiftAPIErrorsHigh warnings are still noticed

Actual results:

KubeAPIErrorsHigh` firing due to HTTP 500, 504 errors reported in metrics


Expected results:
The alert should not be triggered especially when the api is working fine.

Comment 1 Stefan Schimanski 2020-10-13 09:38:52 UTC

Looks like the unidling-controller lists even over the whole cluster: /api/v1/events. This times out after a minute. The controller should not do that. It's expected to fail eventually for big clusters.

Comment 3 Adam Kaplan 2020-10-19 12:54:15 UTC

Is this a persistent issue? Events are supposed to be pruned approximately an hour after they were fired.

When deployments are stable the event count on the cluster should be manageable for the apiserver. Are there particular events that are firing repeatedly?

Comment 7 Rutvik 2020-10-22 09:05:00 UTC

Sharing the latest observations done by the customer:

---
There was a job which was failing for last 12 days and it generated 6230 job objects while failing. 
It also was permanently allocating all resources within their quota in the sandbox namespace as job in question was requesting 2GB of RAM for each job / pod (not very efficient while only curl was being executed).
Cronjob itself was being executed every minute and each execution left behind 1 job object + 4 crash-looping pods that filled up 8 GB RAM quota for the namespace.
All above was causing unwanted and unneeded stress to our  OpenShift platform. "oc get events" for this namespace was stuck for like 2 mins before retrieving result because of how huge query was with all those error quota messages and there were thousands of events with errors
---

Then I checked the SOS again for the controller and etcd logs to see if they have recorded any sort of failure due to this faulty job.

---
E1020 07:57:07.483445       1 job_controller.go:789] pods "pi-1603116120-kmlgv" is forbidden: exceeded quota: quota-svc-its-sandbox4-test, requested: limits.cpu=500m,limits.memory=2000Mi, used: limits.cpu=2,limits.memory=8000Mi, limited: limits.cpu=2,limits.memory=8Gi
---

# cat master-logs_controllers_controllers | grep -i  quota-svc-its-sandbox4-test | wc -l
75897

$ etcd_logs
---
2020-10-17 01:58:00.030504 W | etcdserver: request "header:<ID:16547760597330720220 username:\"abc.com\" auth_revision:1 > txn:<compare:<target:MOD key:\"/openshift.io/clusterresourcequotas/quota-svc-its-sandbox4-test\" mod_revision:1042719945 > success:<request_put:<key:\"/openshift.io/clusterresourcequotas/quota-svc-its-sandbox4-test\" value_size:1760 >> failure:<>>" with result "size:22" took too long (228.984198ms) to execute
---

They have deleted the job and now the API error alerts are stopped. We will be monitoring the situation.

Could such failure in the job create an impact on api<->etcd performance which then endup triggering such alerts?

Comment 8 Adam Kaplan 2020-10-22 17:21:57 UTC

> Could such failure in the job create an impact on api<->etcd performance which then endup triggering such alerts?

Absolutely. Every one of these job failures was creating multiple events. The unidling controller basically does "oc get events" in each reconciliation, and with sufficiently high event counts it can lead to timeouts or out of memory errors. In this case, 1 failing job per minute with 4 crash-looping pods per minute is at least 300 events per hour (perhaps an order of magnitude more).

The right way to work around this issue is to determine why there are so many events on the cluster - which in this case was a CronJob run amok.

Comment 16 Otávio Fernandes 2021-01-18 08:22:25 UTC

After PR #145 and comments here, removing "needinfo" flag.

Comment 18 errata-xmlrpc 2021-02-24 15:25:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.