Summary: | Many ConfigMaps and Pods slow down cluster, until it becomes unavailable (since 1.12) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Tomáš Nožička <tnozicka> |
Component: | Master | Assignee: | Michal Fojtik <mfojtik> |
Status: | CLOSED ERRATA | QA Contact: | Simon <skordas> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | aos-bugs, jokerman, mifiedle, mmccomas, rkrawitz, sjenning, xxia, yinzhou |
Target Milestone: | --- | ||
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:45:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: |
Description
Tomáš Nožička
2019-03-07 15:11:56 UTC
Bumping apiserver max streams here: https://github.com/openshift/cluster-kube-apiserver-operator/pull/332 Not sure what QE can test here, but both PR's were merged so the mentioned kube issue should be mitigated for 4.0. Moving to QE so they can set to VERIFIED. @Tomáš Nožička, @Michal Fojtik , Could you please give me some advise for how to verify this issue ? thanks in advance. try something like this https://github.com/kubernetes/kubernetes/issues/74412#issuecomment-471456235 or as described here: https://github.com/kubernetes/kubernetes/issues/74412#issue-413387234 """ For example, consider a scenario in which I schedule 400 jobs, each with its own ConfigMap, which print "Hello World" on a single-node cluster would. On v1.11, it takes about 10 minutes for the cluster to process all jobs. New jobs can be scheduled. On v1.12 and v1.13, it takes about 60 minutes for the cluster to process all jobs. After this, no new jobs can be scheduled. What you expected to happen: I did not expect this scenario to cause my nodes to become unavailable in Kubernetes 1.12 and 1.13, and would have expected the behavior which I observe in 1.11. """ Ideally make sure it breaks the cluster without this fix and then try the same with the fix. Hi Mike: Could you please help verify this issue ? Thanks in advance. Retest: BUILD: 4.0.0-0.nightly-2019-03-19-004004 https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.nightly/release/4.0.0-0.nightly-2019-03-19-004004 oc version --short Client Version: v4.0.6 Server Version: v1.12.4+befe71b One worker node: oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE skordas19-d7s2t-worker-us-east-2a 1 1 1 1 86m Script: https://github.com/openshift/svt/pull/567 Test run: Run #1: 300 jobs: 383s (6m 23s) 1.277s/job Run #2: 300 jobs: 397s (6m 37s) 1.33s/job Run #3: 300 jobs: 379s (6m 19s) 1.263s/job Run #4: 600 jobs: 771s (12m 51s) 1.285s/job Run #5: 600 jobs: 796s (13m 16s) 1.326s/job Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |