Description of problem: We can see same error logs and symptoms described in https://access.redhat.com/solutions/5448851 https://bugzilla.redhat.com/show_bug.cgi?id=1883589 even on an upgraded Openshift 4.5.16 cluster: 2020-12-14T08:15:37.92824603Z E1214 08:15:37.926755 1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started) 2020-12-14T08:15:37.92824603Z goroutine 95791021 [running]: 2020-12-14T08:15:37.92824603Z github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3f33ac0, 0xc00052e9c0) 2020-12-14T08:15:37.92824603Z /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 2020-12-14T08:15:37.92824603Z github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc052b99bd8, 0x1, 0x1) 2020-12-14T08:15:37.92824603Z /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 2020-12-14T08:15:37.92824603Z panic(0x3f33ac0, 0xc00052e9c0) 2020-12-14T08:15:37.92824603Z /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 ... 2020-12-14T09:23:05.05668025Z E1214 09:23:05.056185 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled 2020-12-14T09:23:05.05668025Z E1214 09:23:05.056331 1 errors.go:77] Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled 2020-12-14T09:23:05.058194459Z E1214 09:23:05.056932 1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled 2020-12-14T09:23:05.058194459Z E1214 09:23:05.057034 1 errors.go:77] Post https://172.18.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled Cluster suddenly became unavailable and it requires reboot of the master nodes. Version-Release number of selected component (if applicable): OCP 4.5.16 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
agabriel, can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours) > sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel) share the screenshot of the graph here please. Also, on the captured must-gather grep the kube-apiserver logs for "timeout.go" and share the results with us here > grep -rni "timeout.go" {must-gather-folder}/logs/namespaces/openshift-kube-apiserver* I suspect the customer cluster is running into an known issue, I have documented a workaround here > https://bugzilla.redhat.com/show_bug.cgi?id=1902018#c17 Please share the must-gather and the prometheus dump in a google drive folder. I don't have access to support shell.
agabriel, Let's not apply the workaround just yet, let's look at the query results I requested ^ first and confirm.
> can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours) > sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel) I didn't see the result of the query in the attachment. I have not yet downloaded the prometheus data capture, so can you share with me the query result?
sople, agabriel, can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours) > sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel) Please share the screenshot of the prometheus query result for both clusters so I can confirm whether these clusters are hitting the p&f issue.
I can think of the two following scenarios: - A: traffic from certain workloads are being rejected due to lack of concurrency share. - B: panics from the apiserver are causing p&f to reject requests It's a known issue and the corresponding BZs are: > 4.5 - https://bugzilla.redhat.com/show_bug.cgi?id=1912566 (it should land in the Z stream any day now, already qe verified) > 4.6 - https://bugzilla.redhat.com/show_bug.cgi?id=1912564 (it has landed in 4.6.13) > 4.7 - https://bugzilla.redhat.com/show_bug.cgi?id=1912563 A and B are not mutually exclusive, you can apply solutions for both on a cluster. How do we identify this issue? monitor the kube apiserver logs (all instances) with the following grep. Basically, we want to find out how frequent these panics are and whether these can trigger p&f to reject requests. > grep -rni -E "timeout.go:(132|134)" Also, the following prometheus queries will give you an indication of which workload is exceeding its concurrency share > topk(25, sum(apiserver_flowcontrol_current_executing_requests) by (priorityLevel,instance)) > topk(25, sum(apiserver_flowcontrol_request_concurrency_limit) by (priorityLevel,instance)) Solution 'A': I would recommend applying the following (make sure priority and fairness is enabled): > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}' > oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}' > oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}' It's always safe to apply the ^ on any cluster. Steps for B: If the fix has not landed in the Z stream yet or your cluster is not ready to upgrade to the Z stream with the fix yet, then you can use the following workaround. This will stabilize the cluster. Workaround 'B': > 'oc apply' the following p&f rule as a temporary workaround. apiVersion: flowcontrol.apiserver.k8s.io/v1alpha1 kind: FlowSchema metadata: name: exempt-service-accounts spec: distinguisherMethod: type: ByUser matchingPrecedence: 10 priorityLevelConfiguration: name: exempt rules: - nonResourceRules: - nonResourceURLs: - '*' verbs: - '*' resourceRules: - apiGroups: - '*' clusterScope: true namespaces: - '*' resources: - '*' verbs: - '*' subjects: - group: name: system:serviceaccounts kind: Group IMPORTANT: once the cluster upgrades to the Z stream with the fix, you need to delete the ^ p&f object > oc delete flowschema exempt-service-accounts Solution 'B': (make sure p&f is enabled) - upgrade to the latest Z stream with the fix - delete the p&f rule created from "Workaround B" > oc delete flowschema exempt-service-accounts
The initial customer cases are closed. Do not piggyback on existing BZs with new customer cases. Create new BZs and suggest that they look similar. We will decide what to do. Reusing BZs to escalate is not acceptable. Closing as the original customer cases closed.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days