Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1902018

Summary: Many HTTP 429 error reported by kube-apiserver - problem disappeared after disabling APIPriorityAndFairness
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED NOTABUG QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: akashem, anowak, aos-bugs, mfojtik, oarribas, ocasalsa, sttts, wlewis, xxia
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-25 18:10:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Reber 2020-11-26 15:24:11 UTC
Description of problem:

Many HTTP 429 error reported by kube-apiserver, rendering OpenShift Container Platform 4.5.16 unusable (only way to recover is to reboot all OpenShift Container Platform - Control-Plane Node(s).

As we had knowledge about https://bugzilla.redhat.com/show_bug.cgi?id=1883589 we disabled APIPriorityAndFairness and did monitor the behavior to see if the problem did come back. So far it remains stable and thus the question is, why we still see problems, considering that improvements around APIPriorityAndFairness landed in OpenShift Container Platform 4.5.16

 + https://access.redhat.com/solutions/5448851


Version-Release number of selected component (if applicable):

 - 4.5.16

How reproducible:

 - N/A

Steps to Reproduce:
1. N/A

Actual results:

Many HTTP 429 being reported in the audit logs and OpenShift Container Platform - Control-Plane becomes unusable. Disabling APIPriorityAndFairness solved the problem

Expected results:

No problem being reported, even with APIPriorityAndFairness being enabled

Additional info:

Comment 11 Abu Kashem 2020-12-07 19:34:17 UTC
sreber,

I downloaded the prometheus dump, the tar ball seems to be corrupt. I get an `unexpected EOF` error when I try to extract it.

Comment 17 Abu Kashem 2021-01-08 15:36:03 UTC
sreber,
looking at the metrics, it looks like it is hitting the p&f panic bug.
To work around the issue, you can apply the following yaml, this exempts cluster workload from service-accounts. This should stabilize the cluster while you have p&f enabled. 
Also, we need to pin point the underlying root cause (the panic we are seeing). I will check the must-gather log to pin point the panic.


apiVersion: flowcontrol.apiserver.k8s.io/v1alpha1
kind: FlowSchema
metadata:
  name: exempt-service-accounts
spec:
  distinguisherMethod:
    type: ByUser
  matchingPrecedence: 10
  priorityLevelConfiguration:
    name: exempt
  rules:
  - nonResourceRules:
    - nonResourceURLs:
      - '*'
      verbs:
      - '*'
    resourceRules:
    - apiGroups:
      - '*'
      clusterScope: true
      namespaces:
      - '*'
      resources:
      - '*'
      verbs:
      - '*'
    subjects:
    - group:
        name: system:serviceaccounts
      kind: Group


Please delete this flowschema once you upgrade to the version with the fix.


Update:
- The PR that fixes the p&f panic issue has merged in upstream - https://github.com/kubernetes/kubernetes/pull/97206

We are back porting the fix to 4.5, 4.6 and 4.7:
> 4.5: https://github.com/openshift/origin/pull/25777
> 4.6: https://github.com/openshift/kubernetes/pull/502 and https://github.com/openshift/kubernetes/pull/501 and 
> 4.7: https://github.com/openshift/kubernetes/pull/509 and https://github.com/openshift/kubernetes/pull/508

The PRs for master/4.7 have merged and I have asked qe to expedite testing. The corresponding BZ for this is: https://bugzilla.redhat.com/show_bug.cgi?id=1912564