1686503 – Many ConfigMaps and Pods slow down cluster, until it becomes unavailable (since 1.12)

Bug 1686503 - Many ConfigMaps and Pods slow down cluster, until it becomes unavailable (since 1.12)

Summary: Many ConfigMaps and Pods slow down cluster, until it becomes unavailable (sin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Michal Fojtik
QA Contact:	Simon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-07 15:11 UTC by Tomáš Nožička
Modified:	2019-09-10 14:08 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:45:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	74412	0	None	None	None	2020-08-18 14:15:20 UTC
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:45:26 UTC

Description Tomáš Nožička 2019-03-07 15:11:56 UTC

https://github.com/kubernetes/kubernetes/issues/74412

---
There are two mitigations with current 1.12/1.13 versions:

start the apiserver with a higher --http2-max-streams-per-connection setting
start the kubelet with a config file that switches back to the pre-1.12 secret/configmap lookup method: configMapAndSecretChangeDetectionStrategy: "Cache"

Comment 3 Michal Fojtik 2019-03-08 10:33:20 UTC

Bumping apiserver max streams here: https://github.com/openshift/cluster-kube-apiserver-operator/pull/332

Comment 4 Michal Fojtik 2019-03-08 14:44:57 UTC

Not sure what QE can test here, but both PR's were merged so the mentioned kube issue should be mitigated for 4.0.
Moving to QE so they can set to VERIFIED.

Comment 7 zhou ying 2019-03-14 09:27:39 UTC

@Tomáš Nožička, @Michal Fojtik  , Could you please give me some advise for how to verify this issue ? thanks in advance.

Comment 8 Tomáš Nožička 2019-03-14 11:03:53 UTC

try something like this https://github.com/kubernetes/kubernetes/issues/74412#issuecomment-471456235

or as described here: https://github.com/kubernetes/kubernetes/issues/74412#issue-413387234

"""
For example, consider a scenario in which I schedule 400 jobs, each with its own ConfigMap, which print "Hello World" on a single-node cluster would.

On v1.11, it takes about 10 minutes for the cluster to process all jobs. New jobs can be scheduled.
On v1.12 and v1.13, it takes about 60 minutes for the cluster to process all jobs. After this, no new jobs can be scheduled.

What you expected to happen:

I did not expect this scenario to cause my nodes to become unavailable in Kubernetes 1.12 and 1.13, and would have expected the behavior which I observe in 1.11.
"""

Ideally make sure it breaks the cluster without this fix and then try the same with the fix.

Comment 9 zhou ying 2019-03-15 03:01:40 UTC

Hi Mike:

      Could you please help verify this issue ? Thanks in advance.

Comment 11 Simon 2019-03-19 17:58:55 UTC

Retest:

BUILD: 4.0.0-0.nightly-2019-03-19-004004 https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.nightly/release/4.0.0-0.nightly-2019-03-19-004004


oc version --short
Client Version: v4.0.6
Server Version: v1.12.4+befe71b


One worker node:
oc get machineset -n openshift-machine-api 
NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
skordas19-d7s2t-worker-us-east-2a   1         1         1       1           86m


Script: https://github.com/openshift/svt/pull/567


Test run:
Run #1: 300 jobs:
383s (6m 23s) 1.277s/job

Run #2: 300 jobs:
397s (6m 37s) 1.33s/job

Run #3: 300 jobs:
379s (6m 19s) 1.263s/job

Run #4: 600 jobs:
771s (12m 51s) 1.285s/job

Run #5: 600 jobs:
796s (13m 16s) 1.326s/job

Comment 13 errata-xmlrpc 2019-06-04 10:45:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.