1835686 – bad performance indicated by throttling requests in pod logs makes authentication degraded to reach openshift-apiserver

Bug 1835686 - bad performance indicated by throttling requests in pod logs makes authentication degraded to reach openshift-apiserver

Summary: bad performance indicated by throttling requests in pod logs makes authentica...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-14 10:18 UTC by Ke Wang
Modified:	2021-05-13 05:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-05 07:45:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ke Wang 2020-05-14 10:18:18 UTC

Description of problem: 
A large number of Throttling requests taints kube-apiserver operator log in a short time.

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-13-221558

OCP install: ipi on AWS

How reproducible:
always

Steps to Reproduce:
$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-745f6658c8-c9ggm | grep 'Throttling request'| wc -l
7030

Actual results:
1. To view the kube-apiserver operator log,  filled with lots of throttling requests, see below,
...
I0514 09:21:22.546693       1 request.go:621] Throttling request took 1.787306428s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps/config
I0514 09:21:23.546696       1 request.go:621] Throttling request took 1.795265499s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-...-187.us-east-2.compute.internal
I0514 09:21:24.746697       1 request.go:621] Throttling request took 1.790598242s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/serviceaccounts/localhost-recovery-client
I0514 09:21:25.746707       1 request.go:621] Throttling request took 1.595281175s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-config-managed/secrets?labelSelector=encryption.apiserver.operator.openshift.io%2Fcomponent%3Dopenshift-kube-apiserver
I0514 09:21:26.746713       1 request.go:621] Throttling request took 1.593812045s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=apiserver%3Dtrue
...

This seriously taints log reading.

Expected results:
Reduce the throttling request and do not taint the kube-apiserver operator log.

Additional info:

Comment 1 Ke Wang 2020-05-14 10:26:43 UTC

Discussed with xingxing, this could be a performance of etcd, did a quick checking on etcd pod, seems reading key took long.

$ oc logs -n openshift-etcd etcd-member-6fmwc-m-0.c.openshift-qe.internal -c etcd-member | grep -n 'took too long' | wc -l
301

...
77:2020-05-14 09:24:41.420117 W | etcdserver: request "header:<ID:2425876772913985041 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/secrets/openshift-kube-controller-manager/csr-signer-4\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/secrets/openshift-kube-controller-manager/csr-signer-4\" value_size:3098 >> failure:<>>" with result "size:16" took too long (215.750159ms) to execute
278:2020-05-14 09:24:41.420141 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca/configmap-cabundle-injector-lock\" " with result "range_response_count:1 size:472" took too long (330.950256ms) to execute
279:2020-05-14 09:24:41.420332 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-scheduler/installer-4-kewang-6fmwc-m-1.c.openshift-qe.internal\" " with result "range_response_count:1 size:1587" took too long (351.36024ms) to execute
280:2020-05-14 09:24:41.420580 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/kube-state-metrics\" " with result "range_response_count:1 size:2692" took too long (573.790763ms) to execute
281:2020-05-14 09:24:41.420658 W | etcdserver: read-only range request "key:\"/kubernetes.io/minions/\" range_end:\"/kubernetes.io/minions0\" " with result "range_response_count:5 size:25883" took too long (595.191608ms) to execute
282:2020-05-14 09:24:41.420729 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-sdn/openshift-network-controller\" " with result "range_response_count:1 size:454" took too long (576.390566ms) to execute
...

Comment 2 Ke Wang 2020-05-14 10:31:27 UTC

CC: mifiedle

Comment 3 Ke Wang 2020-05-14 10:57:21 UTC

Please ignore Comment 1, it's not for this bug's environment.

Discussed with xingxing, this could be a performance issue of etcd, did a quick checking on etcd pod, seems reading key took long, see below,

$ etcd_pod=$(oc get pods -n openshift-etcd | grep etcd  | head -1 | cut -d " " -f1)

$ oc logs -n openshift-etcd $etcd_pod -c etcd  | grep -n 'took too long' | wc -l
423

Pasted snipped log here,
...
681:2020-05-14 03:11:44.599784 W | etcdserver: read-only range request "key:\"/kubernetes.io/services/endpoints/openshift-authentication/oauth-openshift\" " with result "range_response_count:1 size:525" took too long (236.278694ms) to execute
682:2020-05-14 03:11:44.599945 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/openshift-state-metrics\" " with result "range_response_count:1 size:4866" took too long (271.452897ms) to execute
683:2020-05-14 03:11:44.600272 W | etcdserver: read-only range request "key:\"/kubernetes.io/cloudcredential.openshift.io/credentialsrequests/openshift-cloud-credential-operator/openshift-image-registry-openstack\" " with result "range_response_count:1 size:1601" took too long (203.700944ms) to execute
684:2020-05-14 03:11:44.600589 W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-kube-apiserver/user-serving-cert-003\" " with result "range_response_count:0 size:6" took too long (235.389736ms) to execute
...

In openshift-etcd-operator pod, found the lots of following errors, 
oc logs -n openshift-etcd-operator etcd-operator-ff88476b7-tffgl | grep 'transport is closing' | wc -l
27477

I0514 03:16:58.171427       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

Comment 5 Ke Wang 2020-06-05 03:46:17 UTC

Hit the bug again, we have one upgrade env OCP_4.3.24->4.4.7-> 4.5.0nightly. This caused a bad performance of system.

oc logs -n openshift-apiserver-operator openshift-apiserver-operator-67d856d6fd-tfpzb also shows many Throttling request took 1.787306428s

When we test running below command line,
while true; do   oc get oauthclients.oauth.openshift.io; date; echo; sleep 1; done

will intermmittently hit:
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get oauthclients.oauth.openshift.io)

authentication operator reports many related errors:
$ oc logs -n openshift-authentication-operator authentication-operator-66c7d588dd-76dtn | grep 'the server is currently unable to handle the request' | wc -l
1637

Comment 8 Stefan Schimanski 2020-06-05 07:45:48 UTC

Please don't hijack BZs with different topics. Start a new BZ instead. Here in #7 this is obviously about etcd performance, not about throttling events in some operator log.

Comment 9 Ke Wang 2020-06-05 09:08:46 UTC

Found one right bug 1825219 to track this issue.

Comment 10 kedar 2021-04-27 06:11:38 UTC

Hello team,

Observed the same issue in 4.6 cluster.

Noticed high CPU usage on nodes.

Following is the snippet of TOP command and etcd logs on master node.

1] TOP output

```````````````````````````````````````````````````````````````````````````````````````````````````````````````
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2427 root      20   0 6269864 494828  64316 S 181.8   0.8 848:07.26 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote+
   3671 nfsnobo+  20   0  721912  42556  11236 S  68.2   0.1  86:18.39 /bin/node_exporter --web.listen-address=127.0.0.1:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root --no-collector.wifi --no-collector.+
  25168 root      20   0 2672940 420468  30672 S  54.5   0.6   1205:42 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump --no-supervisor -r /usr/local/share/gems/gems/fluent-plugin-elasticsearch-4.1.1/lib/fluent/plugin/elast+
3073278 core      20   0   71620   6840   4712 R  22.7   0.0   0:00.09 top -bc
      1 root      20   0  267676  26908   9052 S  18.2   0.0 176:06.21 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
3072973 core      20   0  101548  17308   8132 S  13.6   0.0   0:00.78 /usr/lib/systemd/systemd --user
   1499 openvsw+  10 -10 1247940  98004  34688 S   9.1   0.1  57:47.21 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openv+
   2377 root      20   0 9706628 459056  35280 S   9.1   0.7  81:03.56 /usr/bin/crio --enable-metrics=true --metrics-port=9537
  26747 1004800+  20   0 1216088 101020  53224 S   9.1   0.2   4:15.34 node ./dist/server.js
    957 root      20   0  320776 172580 127028 S   4.5   0.3  25:39.76 /usr/lib/systemd/systemd-journald
   1368 root      20   0  223872  38252  36444 S   4.5   0.1   0:05.26 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
   1408 openvsw+  10 -10   81796  10256   4748 S   4.5   0.0   4:11.29 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private+
 643038 1002660+  20   0 4639344 209704  63028 S   4.5   0.3   1:17.29 /opt/rh/rh-dotnet21/root/usr/bin/dotnet ./app/publish/PapComprobanteApi.dll

`````````````````````````````````````````````````````````````````````````````````````````````````````````````````

2] etcd log

````````````````````````````````````````````````````````````````````````````````````````````````````````````````
W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/dev-cm-emailrecaudador/cm-batch-mailrecaudador-9-2wwm2\" " with result "range_response_count:1 size:5540" took too long (433.202098ms) to execute
2021-04-17 16:36:23.348229 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-lxmaster01\" " with result "range_response_count:1 size:15220" took too long (433.617051ms) to execute
2021-04-17 16:36:23.348472 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/kube-system/certificate-controller\" " with result "range_response_count:1 size:368" took too long (433.980821ms) to execute
2021-04-17 16:36:23.348759 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/dev-prueba-variable/legajos-okd-1-hqp25.1676a5a82c0c66dd\" " with result "range_response_count:1 size:926" took too long (454.855935ms) to execute
2021-04-17 16:36:23.348970 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterrolebindings/system:node-proxiers\" " with result "range_response_count:1 size:397" took too long (469.170154ms) to execute

````````````````````````````````````````````````````````````````````````````````````````````````````````````````

Comment 11 Ke Wang 2021-05-13 05:39:55 UTC

Hi ksalunkh,  I tried the latest 4.6 nightly and gave some workload to the cluster, after a while, checked the etcd and cluster, no longer found this issue. 

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-05-08-154328   True        False         4h25m   Cluster version is 4.6.0-0.nightly-2021-05-08-154328

Added 60 pods of client-server mode,
$ oc get po -A | grep -c clusterbuster 
60

If you think this still is an issue, you might open a new bug on etcd component and attach the must-gather.

Note You need to log in before you can comment on or make changes to this bug.