Bug 1926724

Summary: p&f: add auto update for priority & fairness bootstrap configuration objects
Product: OpenShift Container Platform Reporter: Ke Wang <kewang>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: low    
Version: 4.7CC: aos-bugs, mfojtik, wlewis, xxia
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleReset
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1927358 1927397 1930005 (view as bug list) Environment:
Last Closed: 2021-12-01 13:35:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1927397    
Bug Blocks: 1927358, 1930005    

Description Ke Wang 2021-02-09 10:49:45 UTC
This bug was initially created as a copy of Bug #1891108

I am copying this bug because: 

https://bugzilla.redhat.com/show_bug.cgi?id=1891108#c8 and https://bugzilla.redhat.com/show_bug.cgi?id=1891108#c11, the issue is still on upgrade, need one new bug to track this.

+++ This bug was initially created as a clone of Bug #1891107 +++

+++ This bug was initially created as a clone of Bug #1891106 +++

priority & fairness: Increase the concurrency share of workload-low priority level

carry upstream PR: https://github.com/kubernetes/kubernetes/pull/95259

All workloads running using service account (except for the ones distinguished by p&f with a logically higher matching precedence) will match the `service-accounts` flow schema and be assigned to the `workload-low` priority and thus will have only `20` concurrency shares. (~10% of the total)

On the other hand, `global-default` flow schema is assigned to `global-default` priority configuration and thus will have `100` concurrency shares (~50% of the total). If I am not mistaken, `global-default` goes pretty much unused since workloads running with user (not service account) will fall into this category and is not very common. 

Workload with service accounts do not have enough concurrency share and may starve. Increase the concurrency share of `workload-low` from `20` to `100` and reduce that of `global-default` from `100` to `20`. 

We have been asking customer to apply the patch manually: https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c56
> oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}'
> oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}'

This will get rid of the need for manual patch.

Comment 2 Abu Kashem 2021-02-10 15:20:28 UTC
Okay, so I have opened a PR upstream to auto update p&f bootstrap configuration objects - https://github.com/kubernetes/kubernetes/pull/98028. This should resolve this BZ.

I also have opened a test PR in o/k 4.7 so you can do an early upgrade test and verify that this PR resolves the issue. Also, please go through the PR description and come up with a test plan. Please do let me know if you have any question.
> o/k PR: https://github.com/openshift/kubernetes/pull/563

This will go into a 4.7 Z stream, is that correct?

Comment 3 Abu Kashem 2021-02-10 15:30:23 UTC
setting it to 4.7.Z release

Comment 5 Michal Fojtik 2021-03-12 16:07:20 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 6 Ke Wang 2021-09-22 14:34:54 UTC
This bug's PR is dev-approved and not yet merged, so I'm following issue DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of issue OCPQE-815 by using the bot to build image with the open PR.  Here is the verification steps:
1. Fresh installed one OCP 4.6 cluster
2. Upgrade to 4.7 using image built with PR.

$ oc get clusterversion -o json|jq ".items[0].status.history"   
    "completionTime": "2021-09-22T11:47:56Z",
    "image": "registry.build01.ci.openshift.org/ci-ln-fwqtwkt/release:latest",
    "startedTime": "2021-09-22T10:46:14Z",
    "state": "Completed",
    "verified": false,
    "version": "4.7.0-0.ci.test-2021-09-22-071911-ci-ln-fwqtwkt-latest"
    "completionTime": "2021-09-22T09:00:58Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:a3a26cf19be8b991ab94337580bd693857474f07c961f180c6ba67683ab91b8c",
    "startedTime": "2021-09-22T08:35:03Z",
    "state": "Completed",
    "verified": false,
    "version": "4.6.45"

$ oc get FlowSchema
NAME                                PRIORITYLEVEL                       MATCHINGPRECEDENCE   DISTINGUISHERMETHOD   AGE     MISSINGPL
exempt                              exempt                              1                    <none>                5h52m   False
openshift-apiserver-sar             exempt                              2                    ByUser                5h47m   False
openshift-oauth-apiserver-sar       exempt                              2                    ByUser                5h47m   False
probes                              exempt                              2                    <none>                3h32m   False
system-leader-election              leader-election                     100                  ByUser                5h52m   False
workload-leader-election            leader-election                     200                  ByUser                5h52m   False
openshift-sdn                       system                              500                  ByUser                3h7m    False
system-nodes                        system                              500                  ByUser                5h52m   False
kube-controller-manager             workload-high                       800                  ByNamespace           5h52m   False
kube-scheduler                      workload-high                       800                  ByNamespace           5h52m   False
kube-system-service-accounts        workload-high                       900                  ByNamespace           5h52m   False
openshift-apiserver                 workload-high                       1000                 ByUser                5h47m   False
openshift-controller-manager        workload-high                       1000                 ByUser                5h47m   False
openshift-oauth-apiserver           workload-high                       1000                 ByUser                5h47m   False
openshift-oauth-server              workload-high                       1000                 ByUser                5h47m   False
openshift-apiserver-operator        openshift-control-plane-operators   2000                 ByUser                5h47m   False
openshift-authentication-operator   openshift-control-plane-operators   2000                 ByUser                5h47m   False
openshift-etcd-operator             openshift-control-plane-operators   2000                 ByUser                5h47m   False
openshift-kube-apiserver-operator   openshift-control-plane-operators   2000                 ByUser                5h47m   False
openshift-monitoring-metrics        workload-high                       2000                 ByUser                5h47m   False
service-accounts                    workload-low                        9000                 ByUser                5h52m   False
global-default                      global-default                      9900                 ByUser                5h52m   False
catch-all                           catch-all                           10000                ByUser                5h52m   False

$ oc get prioritylevelconfiguration workload-low -o jsonpath='{.spec.limited.assuredConcurrencyShares}'

$ oc get prioritylevelconfiguration global-default -o jsonpath='{.spec.limited.assuredConcurrencyShares}'

So the bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically or, if not working, by me manually.

Comment 7 Michal Fojtik 2021-09-22 15:28:16 UTC
The LifecycleStale keyword was removed because the needinfo? flag was reset.
The bug assignee was notified.

Comment 8 Ke Wang 2021-11-04 02:35:01 UTC
Based on above https://bugzilla.redhat.com/show_bug.cgi?id=1926724#c6, the bug was verified.

Comment 15 errata-xmlrpc 2021-12-01 13:35:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.