Bug 1908383 - OCP4.5 cluster unstable due to API Priority and Fairness feature
Summary: OCP4.5 cluster unstable due to API Priority and Fairness feature
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Abu Kashem
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-16 14:59 UTC by Angelo Gabrieli
Modified: 2024-06-13 23:44 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-19 09:41:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5880571 0 None None None 2021-03-16 18:52:33 UTC

Description Angelo Gabrieli 2020-12-16 14:59:06 UTC
Description of problem:

We can see same error logs and symptoms described in

https://access.redhat.com/solutions/5448851
https://bugzilla.redhat.com/show_bug.cgi?id=1883589

even on an upgraded Openshift 4.5.16 cluster:


2020-12-14T08:15:37.92824603Z E1214 08:15:37.926755       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
2020-12-14T08:15:37.92824603Z goroutine 95791021 [running]:
2020-12-14T08:15:37.92824603Z github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3f33ac0, 0xc00052e9c0)
2020-12-14T08:15:37.92824603Z   /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
2020-12-14T08:15:37.92824603Z github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc052b99bd8, 0x1, 0x1)
2020-12-14T08:15:37.92824603Z   /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
2020-12-14T08:15:37.92824603Z panic(0x3f33ac0, 0xc00052e9c0)
2020-12-14T08:15:37.92824603Z   /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2
...
2020-12-14T09:23:05.05668025Z E1214 09:23:05.056185       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
2020-12-14T09:23:05.05668025Z E1214 09:23:05.056331       1 errors.go:77] Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
2020-12-14T09:23:05.058194459Z E1214 09:23:05.056932       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
2020-12-14T09:23:05.058194459Z E1214 09:23:05.057034       1 errors.go:77] Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled


Cluster suddenly became unavailable and it requires reboot of the master nodes.



Version-Release number of selected component (if applicable):

OCP 4.5.16


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Abu Kashem 2021-01-11 15:08:20 UTC
agabriel,
can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours)
> sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)
share the screenshot of the graph here please.

Also, on the captured must-gather grep the kube-apiserver logs for "timeout.go" and share the results with us here
> grep -rni "timeout.go" {must-gather-folder}/logs/namespaces/openshift-kube-apiserver*


I suspect the customer cluster is running into an known issue, I have documented a workaround here
> https://bugzilla.redhat.com/show_bug.cgi?id=1902018#c17

Please share the must-gather and the prometheus dump in a google drive folder. I don't have access to support shell.

Comment 7 Abu Kashem 2021-01-11 15:21:44 UTC
agabriel,
Let's not apply the workaround just yet, let's look at the query results I requested ^ first and confirm.

Comment 14 Abu Kashem 2021-01-15 19:40:54 UTC
> can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours)
> sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)

I didn't see the result of the query in the attachment.
I have not yet downloaded the prometheus data capture, so can you share with me the query result?

Comment 16 Abu Kashem 2021-01-21 14:48:22 UTC
sople, agabriel,


can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours)
> sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)

Please share the screenshot of the prometheus query result for both clusters so I can confirm whether these clusters are hitting the p&f issue.

Comment 19 Abu Kashem 2021-01-26 17:26:15 UTC
I can think of the two following scenarios:
- A: traffic from certain workloads are being rejected due to lack of concurrency share.

- B: panics from the apiserver are causing p&f to reject requests
It's a known issue and the corresponding BZs are:
> 4.5 - https://bugzilla.redhat.com/show_bug.cgi?id=1912566 (it should land in the Z stream any day now, already qe verified)
> 4.6 - https://bugzilla.redhat.com/show_bug.cgi?id=1912564 (it has landed in 4.6.13)
> 4.7 - https://bugzilla.redhat.com/show_bug.cgi?id=1912563


A and B are not mutually exclusive, you can apply solutions for both on a cluster.

How do we identify this issue?
monitor the kube apiserver logs (all instances) with the following grep. Basically, we want to find out how frequent these panics are and whether these can trigger p&f to reject requests.
> grep -rni -E "timeout.go:(132|134)"

Also, the following prometheus queries will give you an indication of which workload is exceeding its concurrency share
> topk(25, sum(apiserver_flowcontrol_current_executing_requests) by (priorityLevel,instance))
> topk(25, sum(apiserver_flowcontrol_request_concurrency_limit) by (priorityLevel,instance))



Solution 'A': 
I would recommend applying the following (make sure priority and fairness is enabled):
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}'
> oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}'
> oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}'
 
It's always safe to apply the ^ on any cluster.


Steps for B:
If the fix has not landed in the Z stream yet or your cluster is not ready to upgrade to the Z stream with the fix yet, then you can use the following workaround. This will stabilize the cluster. 

Workaround 'B':
> 'oc apply' the following p&f rule as a temporary workaround.

apiVersion: flowcontrol.apiserver.k8s.io/v1alpha1
kind: FlowSchema
metadata:
  name: exempt-service-accounts
spec:
  distinguisherMethod:
    type: ByUser
  matchingPrecedence: 10
  priorityLevelConfiguration:
    name: exempt
  rules:
  - nonResourceRules:
    - nonResourceURLs:
      - '*'
      verbs:
      - '*'
    resourceRules:
    - apiGroups:
      - '*'
      clusterScope: true
      namespaces:
      - '*'
      resources:
      - '*'
      verbs:
      - '*'
    subjects:
    - group:
        name: system:serviceaccounts
      kind: Group


IMPORTANT: once the cluster upgrades to the Z stream with the fix, you need to delete the ^ p&f object
> oc delete flowschema exempt-service-accounts


Solution 'B':
(make sure p&f is enabled)
- upgrade to the latest Z stream with the fix

- delete the p&f rule created from "Workaround B"
> oc delete flowschema exempt-service-accounts

Comment 28 Stefan Schimanski 2021-03-19 09:41:05 UTC
The initial customer cases are closed. Do not piggyback on existing BZs with new customer cases. Create new BZs and suggest that they look similar. We will decide what to do. Reusing BZs to escalate is not acceptable.

Closing as the original customer cases closed.

Comment 29 Red Hat Bugzilla 2023-09-15 00:53:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.