priority & fairness: Increase the concurrency share of workload-low priority level carry upstream PR: https://github.com/kubernetes/kubernetes/pull/95259 All workloads running using service account (except for the ones distinguished by p&f with a logically higher matching precedence) will match the `service-accounts` flow schema and be assigned to the `workload-low` priority and thus will have only `20` concurrency shares. (~10% of the total) On the other hand, `global-default` flow schema is assigned to `global-default` priority configuration and thus will have `100` concurrency shares (~50% of the total). If I am not mistaken, `global-default` goes pretty much unused since workloads running with user (not service account) will fall into this category and is not very common. Workload with service accounts do not have enough concurrency share and may starve. Increase the concurrency share of `workload-low` from `20` to `100` and reduce that of `global-default` from `100` to `20`. We have been asking customer to apply the patch manually: https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c56 > oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}' > oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}' This will get rid of the need for manual patch.
4.5 PR is open - https://github.com/openshift/origin/pull/25627.
waiting on 4.6 qe verification
> waiting on 4.6 qe verification Precisely speaking, waiting on 4.6 to become ON_QA. Without that ON_QA, that cannot be moved to VERIFIED. Though, Ke is already trying pre-merge verification on that 4.6 with comment there.
For this 4.5, also following the pre-merge verification process defined in Jira issue DPTP-660's Description, I tried: use the cluster-bot `build openshift/origin#25627`, then use the returned payload "4.5.0-0.ci.test-2020-12-11-040947-ci-ln-dr3j9rb" to install env. But the installations fail twice so far with: level=info msg="Bootstrap gather logs captured here \"/home/jenkins/ws/workspace/Launch Environment Flexy/workdir/install-dir/log-bundle-20201211062503.tar.gz\"" level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition" Checking `oc get co`, found: kube-scheduler 4.5.0-0.ci.test-2020-12-11-040947-ci-ln-dr3j9rb False True True 37m Checking kube-scheduler pods, found: openshift-kube-scheduler-ip-10-0-134-155.ap-northeast-2.compute.internal 1/2 CrashLoopBackOff 10 27m openshift-kube-scheduler-ip-10-0-167-25.ap-northeast-2.compute.internal 1/2 CrashLoopBackOff 9 24m openshift-kube-scheduler-ip-10-0-207-53.ap-northeast-2.compute.internal 1/2 CrashLoopBackOff 10 26m kube-scheduler pods YAML shows: I1211 06:29:57.779669 1 flags.go:33] FLAG: --write-config-to="" no kind "KubeSchedulerConfiguration" is registered for version "kubescheduler.config.k8s.io/v1beta1" in scheme "k8s.io/kubernetes/pkg/scheduler/apis/config/scheme/scheme.go:31" reason: Error startedAt: "2020-12-11T06:29:57Z" name: kube-scheduler ready: false restartCount: 10 Have no idea why hitting such errors. Above approach worked for me in other bugs' pre-merge verification env installation, though. Let me turn to try cluster-bot launching directly: launch openshift/origin#25627 aws. And check later.
(In reply to Xingxing Xia from comment #4) > Have no idea why hitting such errors. ... Let me turn to try cluster-bot launching directly: launch openshift/origin#25627 aws. And check later The cluster-bot `launch openshift/origin#25627 aws` returns: "your cluster failed to launch: pod never became available: container setup did not succeed, see logs for details (logs) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1337283387622166528 " where it also shows same CrashLoopBackOff installation failures: level=error msg="... \"kube-scheduler\" is not ready: CrashLoopBackOff ... ... level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition" Abu Kashem, thus, cannot launch a successful cluster with this bug's PR, let alone have a cluster to pre-merge verify this bug. The pre-merge verification therefore fails. Please check.
Found the reason, the errors are a bug 1887942 whose "origin" repo PR was merged on Oct 25 but your "origin" repo PR was opened on Oct 24.
The installation in-completion does not block running oc checking: $ oc get flowschema NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD AGE MISSINGPL exempt exempt 1 <none> 4h7m False openshift-apiserver-sar exempt 2 ByUser 3h47m False system-leader-election leader-election 100 ByUser 4h7m False workload-leader-election leader-election 200 ByUser 4h7m False system-nodes system 500 ByUser 4h7m False kube-controller-manager workload-high 800 ByNamespace 4h7m False kube-scheduler workload-high 800 ByNamespace 4h7m False kube-system-service-accounts workload-high 900 ByNamespace 4h7m False openshift-apiserver workload-high 1000 ByUser 3h47m False openshift-controller-manager workload-high 1000 ByUser 3h47m False openshift-oauth-server workload-high 1000 ByUser 3h47m False openshift-apiserver-operator openshift-control-plane-operators 2000 ByUser 3h47m False openshift-authentication-operator openshift-control-plane-operators 2000 ByUser 3h47m False openshift-etcd-operator openshift-control-plane-operators 2000 ByUser 3h47m False openshift-kube-apiserver-operator openshift-control-plane-operators 2000 ByUser 3h47m False openshift-monitoring-metrics workload-high 2000 ByUser 3h47m False service-accounts workload-low 9000 ByUser 4h7m False global-default global-default 9900 ByUser 4h7m False catch-all catch-all 10000 ByUser 4h7m False $ oc get PriorityLevelConfiguration NAME TYPE ASSUREDCONCURRENCYSHARES QUEUES HANDSIZE QUEUELENGTHLIMIT AGE catch-all Limited 1 <none> <none> <none> 4h8m exempt Exempt <none> <none> <none> <none> 4h8m global-default Limited 20 128 6 50 4h8m leader-election Limited 10 16 4 50 4h8m openshift-control-plane-operators Limited 10 128 6 50 3h48m system Limited 30 64 6 50 4h8m workload-high Limited 40 128 6 50 4h8m workload-low Limited 100 128 6 50 4h8m The bug pre-merge-verification passes as expected by the PR in terms of: workload-low is changed to 100 ACS, global-default is changed to 20 ACS.
xxia, looks good, also see https://bugzilla.redhat.com/show_bug.cgi?id=1891107#c8. Thanks for doing this!
(In reply to Abu Kashem from comment #0) > We have been asking customer to apply the patch manually: > https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c56 > > oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}' > > oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}' > > > This will get rid of the need for manual patch. In more detail, bug #1883589 was a 4.5 bug with 16 cases attached. https://access.redhat.com/solutions/5448851 documents recommended workarounds for the issue described. This bug will remove the need for the prioritylevelconfiguration patches in workaround 1.
The bug should be automatically moved to VERIFIED, but it didn't. Manually tested again in 4.5.0-0.nightly-2021-03-04-150339: $ oc get flowschema NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD AGE MISSINGPL exempt exempt 1 <none> 63m False openshift-apiserver-sar exempt 2 ByUser 47m False system-leader-election leader-election 100 ByUser 63m False workload-leader-election leader-election 200 ByUser 63m False system-nodes system 500 ByUser 63m False kube-controller-manager workload-high 800 ByNamespace 63m False kube-scheduler workload-high 800 ByNamespace 63m False kube-system-service-accounts workload-high 900 ByNamespace 63m False openshift-apiserver workload-high 1000 ByUser 47m False openshift-controller-manager workload-high 1000 ByUser 47m False openshift-oauth-server workload-high 1000 ByUser 47m False openshift-apiserver-operator openshift-control-plane-operators 2000 ByUser 47m False openshift-authentication-operator openshift-control-plane-operators 2000 ByUser 47m False openshift-etcd-operator openshift-control-plane-operators 2000 ByUser 47m False openshift-kube-apiserver-operator openshift-control-plane-operators 2000 ByUser 48m False openshift-monitoring-metrics workload-high 2000 ByUser 48m False service-accounts workload-low 9000 ByUser 63m False global-default global-default 9900 ByUser 63m False catch-all catch-all 10000 ByUser 63m False [xxia@pres 2021-03-05 16:07:49 CST my]$ [xxia@pres 2021-03-05 16:07:57 CST my]$ oc get PriorityLevelConfiguration NAME TYPE ASSUREDCONCURRENCYSHARES QUEUES HANDSIZE QUEUELENGTHLIMIT AGE catch-all Limited 1 <none> <none> <none> 63m exempt Exempt <none> <none> <none> <none> 63m global-default Limited 20 128 6 50 63m leader-election Limited 10 16 4 50 63m openshift-control-plane-operators Limited 10 128 6 50 49m system Limited 30 64 6 50 63m workload-high Limited 40 128 6 50 63m workload-low Limited 100 128 6 50 63m Now workload-low is changed to 100 ACS, global-default is changed to 20 ACS
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.34 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0714