Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1908383

Summary:	OCP4.5 cluster unstable due to API Priority and Fairness feature
Product:	OpenShift Container Platform	Reporter:	Angelo Gabrieli <agabriel>
Component:	kube-apiserver	Assignee:	Abu Kashem <akashem>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Ke Wang <kewang>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	4.5	CC:	abraj, akashem, aos-bugs, bjarolim, jkaur, mfojtik, oarribas, openshift-bugs-escalate, pamoedom, sople, sttts, wsun, xxia
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-03-19 09:41:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Angelo Gabrieli 2020-12-16 14:59:06 UTC

Description of problem:

We can see same error logs and symptoms described in

https://access.redhat.com/solutions/5448851
https://bugzilla.redhat.com/show_bug.cgi?id=1883589

even on an upgraded Openshift 4.5.16 cluster:


2020-12-14T08:15:37.92824603Z E1214 08:15:37.926755       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
2020-12-14T08:15:37.92824603Z goroutine 95791021 [running]:
2020-12-14T08:15:37.92824603Z github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3f33ac0, 0xc00052e9c0)
2020-12-14T08:15:37.92824603Z   /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
2020-12-14T08:15:37.92824603Z github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc052b99bd8, 0x1, 0x1)
2020-12-14T08:15:37.92824603Z   /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
2020-12-14T08:15:37.92824603Z panic(0x3f33ac0, 0xc00052e9c0)
2020-12-14T08:15:37.92824603Z   /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2
...
2020-12-14T09:23:05.05668025Z E1214 09:23:05.056185       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
2020-12-14T09:23:05.05668025Z E1214 09:23:05.056331       1 errors.go:77] Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
2020-12-14T09:23:05.058194459Z E1214 09:23:05.056932       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
2020-12-14T09:23:05.058194459Z E1214 09:23:05.057034       1 errors.go:77] Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled


Cluster suddenly became unavailable and it requires reboot of the master nodes.



Version-Release number of selected component (if applicable):

OCP 4.5.16


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Abu Kashem 2021-01-11 15:08:20 UTC

agabriel,
can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours)
> sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)
share the screenshot of the graph here please.

Also, on the captured must-gather grep the kube-apiserver logs for "timeout.go" and share the results with us here
> grep -rni "timeout.go" {must-gather-folder}/logs/namespaces/openshift-kube-apiserver*


I suspect the customer cluster is running into an known issue, I have documented a workaround here
> https://bugzilla.redhat.com/show_bug.cgi?id=1902018#c17

Please share the must-gather and the prometheus dump in a google drive folder. I don't have access to support shell.

Comment 7 Abu Kashem 2021-01-11 15:21:44 UTC

agabriel,
Let's not apply the workaround just yet, let's look at the query results I requested ^ first and confirm.

Comment 14 Abu Kashem 2021-01-15 19:40:54 UTC

> can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours)
> sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)

I didn't see the result of the query in the attachment.
I have not yet downloaded the prometheus data capture, so can you share with me the query result?

Comment 16 Abu Kashem 2021-01-21 14:48:22 UTC

sople, agabriel,


can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours)
> sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)

Please share the screenshot of the prometheus query result for both clusters so I can confirm whether these clusters are hitting the p&f issue.

Comment 19 Abu Kashem 2021-01-26 17:26:15 UTC

I can think of the two following scenarios:
- A: traffic from certain workloads are being rejected due to lack of concurrency share.

- B: panics from the apiserver are causing p&f to reject requests
It's a known issue and the corresponding BZs are:
> 4.5 - https://bugzilla.redhat.com/show_bug.cgi?id=1912566 (it should land in the Z stream any day now, already qe verified)
> 4.6 - https://bugzilla.redhat.com/show_bug.cgi?id=1912564 (it has landed in 4.6.13)
> 4.7 - https://bugzilla.redhat.com/show_bug.cgi?id=1912563


A and B are not mutually exclusive, you can apply solutions for both on a cluster.

How do we identify this issue?
monitor the kube apiserver logs (all instances) with the following grep. Basically, we want to find out how frequent these panics are and whether these can trigger p&f to reject requests.
> grep -rni -E "timeout.go:(132|134)"

Also, the following prometheus queries will give you an indication of which workload is exceeding its concurrency share
> topk(25, sum(apiserver_flowcontrol_current_executing_requests) by (priorityLevel,instance))
> topk(25, sum(apiserver_flowcontrol_request_concurrency_limit) by (priorityLevel,instance))



Solution 'A': 
I would recommend applying the following (make sure priority and fairness is enabled):
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}'
> oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}'
> oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}'
 
It's always safe to apply the ^ on any cluster.


Steps for B:
If the fix has not landed in the Z stream yet or your cluster is not ready to upgrade to the Z stream with the fix yet, then you can use the following workaround. This will stabilize the cluster. 

Workaround 'B':
> 'oc apply' the following p&f rule as a temporary workaround.

apiVersion: flowcontrol.apiserver.k8s.io/v1alpha1
kind: FlowSchema
metadata:
  name: exempt-service-accounts
spec:
  distinguisherMethod:
    type: ByUser
  matchingPrecedence: 10
  priorityLevelConfiguration:
    name: exempt
  rules:
  - nonResourceRules:
    - nonResourceURLs:
      - '*'
      verbs:
      - '*'
    resourceRules:
    - apiGroups:
      - '*'
      clusterScope: true
      namespaces:
      - '*'
      resources:
      - '*'
      verbs:
      - '*'
    subjects:
    - group:
        name: system:serviceaccounts
      kind: Group


IMPORTANT: once the cluster upgrades to the Z stream with the fix, you need to delete the ^ p&f object
> oc delete flowschema exempt-service-accounts


Solution 'B':
(make sure p&f is enabled)
- upgrade to the latest Z stream with the fix

- delete the p&f rule created from "Workaround B"
> oc delete flowschema exempt-service-accounts

Comment 28 Stefan Schimanski 2021-03-19 09:41:05 UTC

The initial customer cases are closed. Do not piggyback on existing BZs with new customer cases. Create new BZs and suggest that they look similar. We will decide what to do. Reusing BZs to escalate is not acceptable.

Closing as the original customer cases closed.

Comment 29 Red Hat Bugzilla 2023-09-15 00:53:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days