Bug 1883589

Summary:

Cascading failure across all openshift-apiservers renders cluster unusable

Product:

OpenShift Container Platform

Reporter:

Pedro Amoedo <pamoedom>

Component:

kube-apiserver

Assignee:

Abu Kashem <akashem>

Status:

CLOSED NOTABUG

QA Contact:

Xingxing Xia <xxia>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.5

CC:

agawand, aivaraslaimikis, akaris, akashem, alchan, aos-bugs, bleanhar, bmontgom, braander, bshirren, cblecker, cpassare, cruhm, dahernan, ehashman, erich, ffranz, gshereme, gwest, hgomes, inecas, jcromer, jeder, jhuddles, jkaur, jlucky, jminter, jnordell, jwang, kelly.brown1, lmohanty, lszaszki, mabajodu, mfojtik, mike, mleonard, mtleilia, mvardhan, mwoodson, nmalik, nstielau, ocasalsa, openshift-bugs-escalate, palonsor, palshure, pbergene, ppostler, pstrick, rcarrier, rdiscala, rsandu, sburke, sdodson, sople, sreber, sreichar, sttts, tahonen, tmanor, travi, wking, xxia, yjoseph

Target Milestone:

---

Keywords:

ServiceDeliveryBlocker

Target Release:

4.7.0

Flags:

sreichar: needinfo-

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-12-11 11:30:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1885353, 1885356, 1886442

Bug Blocks:

Attachments:

Description	Flags
apf configuration for 4.5	none
apf-configuration-with-oauth-server-45.yaml	none
apf-configuration-with-oauth-server-with-exempt-45.yaml	none

Description Pedro Amoedo 2020-09-29 16:03:22 UTC

Description of problem:

Production cluster goes down after a few hours running, the API ratio failure increases beyond 60% causing cacade failures in the rest of operators. The openshift-apiserver pod logs shows the following error:

~~~
E0929 06:01:07.495783       1 errors.go:77] the server has received too many requests and has asked us to try again later (post subjectaccessreviews.authorization.k8s.io)
E0929 06:01:08.387416       1 webhook.go:199] Failed to make webhook authorizer request: Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
E0929 06:01:08.387716       1 errors.go:77] Post https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled
E0929 06:01:08.387975       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
goroutine 11240557 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x27ebb80, 0xc0004c6ed0)
        @/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc004fadbd8, 0x1, 0x1)
        @/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x27ebb80, 0xc0004c6ed0)
        runtime/panic.go:679 +0x1b2
k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc00a019a40, 0xc00beabd60)
        @/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc001a4f4e0, 0x34deb00, 0xc007e85f80, 0xc0077e9200)
        @/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x310
k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x34deb00, 0xc007e85f80, 0xc0077e9100)
        @/k8s.io/apiserver/pkg/server/filters/waitgroup.go:59 +0x121
net/http.HandlerFunc.ServeHTTP(0xc001d1bd10, 0x34deb00, 0xc007e85f80, 0xc0077e9100)
        net/http/server.go:2007 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x34deb00, 0xc007e85f80, 0xc0077e9000)
        @/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x274
net/http.HandlerFunc.ServeHTTP(0xc001d1bd40, 0x34deb00, 0xc007e85f80, 0xc0077e9000)
        net/http/server.go:2007 +0x44
k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        @/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca
net/http.HandlerFunc.ServeHTTP(0xc001a4f500, 0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        net/http/server.go:2007 +0x44
k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        @/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x13e
net/http.HandlerFunc.ServeHTTP(0xc001a4f520, 0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        net/http/server.go:2007 +0x44
github.com/openshift/library-go/pkg/apiserver/apiserverconfig.WithCacheControl.func1(0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        github.com/openshift/library-go.0-20200402123743-4015ba624cae/pkg/apiserver/apiserverconfig/cachecontrol.go:27 +0x268
net/http.HandlerFunc.ServeHTTP(0xc001d1bd70, 0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        net/http/server.go:2007 +0x44
k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc001d1bda0, 0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        @/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc004822620, 0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        net/http/server.go:2802 +0xa4
net/http.initNPNRequest.ServeHTTP(0x34ea1c0, 0xc006bc71d0, 0xc005352380, 0xc004822620, 0x34d1b00, 0xc0046944d0, 0xc0077e8f00)
        net/http/server.go:3365 +0x8d
golang.org/x/net/http2.(*serverConn).runHandler(0xc000b11c80, 0xc0046944d0, 0xc0077e8f00, 0xc00a019960)
        golang.org/x/net.0-20200202094626-16171245cfb2/http2/server.go:2149 +0x9f
created by golang.org/x/net/http2.(*serverConn).processHeaders
        golang.org/x/net.0-20200202094626-16171245cfb2/http2/server.go:1883 +0x4eb
E0929 06:01:08.388097       1 wrap.go:39] apiserver panic'd on GET /openapi/v2
E0929 06:01:08.393314       1 webhook.go:199] Failed to make webhook authorizer request: the server has received too many requests and has asked us to try again later (post subjectaccessreviews.authorization.k8s.io)
~~~

Version-Release number of selected component (if applicable):

OCP 4.5.9

NOTE: The cluster was manually forced into 4.5.9 from 4.5.0 (no other path available, see case 02762002 for more details)

How reproducible:

Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

See next comments about logs.

Comment 6 Abu Kashem 2020-09-29 22:22:57 UTC

After looking at the metrics I saw some evidence of p&f throttling requests, not entirely sure if this is the only symptom. The customer has disabled p&f on their cluster just to see (as a test) if this resolves the issue. The next step is to  take must-gather (with audit logs) and do further investigation

Comment 14 Scott Dodson 2020-10-01 19:46:21 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 15 Abu Kashem 2020-10-01 21:28:12 UTC

> Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
- It seems a certain type of workload with a certain level of load (requests in flight) seems to trigger this issue. We are currently investigating to find the root cause.
- If we are to block upgrade, we would block upgrade to 4.5.*. Upgrade to 4.4.* is okay since this apiserver feature is enabled in 4.5 and onward.


> What is the impact?  Is it serious enough to warrant blocking edges?
The cluster is pretty much unusable, so it is very serious. Based on the conversation with the customer, until the customer restarts the master nodes the cluster is unusable. After restart, the cluster seems to get into a bad state again after some time.


> How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
There are two approaches:
- Disable the priority and fairness feature of the apiserver by using the following command

> oc patch kubeapiserver cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"apiServerArguments":{"feature-gates":["APIPriorityAndFairness=false"]}}}}'

This will trigger a new kube-apiserver revision. Since kube-apiserver operator might be borked as well, this change will likely not take effect. In this case, you might need to manipulate the static kube-apiserver resources on the host.

- The following 'oc patch' should work seamlessly
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}'

But this is yet to be tested on a real cluster with this issue. If this works, it will not require any  apiserver revision or node restart. I am working with the ARO team to test this solution. 


> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
"Priority and Fairness" has been enabled since 4.5. Updating to 4.6 does not increase exposure/risk

Comment 16 Elana Hashman 2020-10-01 22:32:54 UTC

This has impacted an Azure Red Hat OpenShift customer who upgraded their cluster to 4.5, resulting in a ~10 hour period where they were unable to connect to their cluster to run commands as `oc login` was failing.

I believe that it potentially affects *all* 4.5 clusters with non-trivial control plane load.

Summary
-------

Symptoms to detect the issue, in the order they appeared:

- oc is slow and/or commands are failing on 4.5.x
- "Throttling request" log entries for the kubernetes-apiserver-operator and openshift-apiserver-operator
- Many 429 response codes in the Kubernetes audit logs
- Many "Failed to make webhook authorizer request ... context cancelled" log entries on the kubernetes apiservers
- "\"<COMPONENT>.openshift.io.v1\" is not ready: 503 (the server is currently unable to handle the request)" errors on the openshift-apiserver-operator, but openshift-apiserver pods are running, NOT crashlooping

The fix detailed in the comment above mitigated the issue.

Detailed timeline
-----------------

Customer upgraded cluster from 4.4.14 to 4.5.11 around 2020-09-24 12:00 UTC.

Starting at 12:17 after the upgrade began, we begin to see many instances of "Throttling request" errors from client-go on various components, but in particular on the Kubernetes and OpenShift apiserver-operators in the 1-2s range. Prior to the upgrade, we did see other instances of the "Throttling request" error on various components, but none from the API server components, and the time ranges were significantly smaller (ms vs. seconds range).

From the cluster audit logs, we see that prior to the upgrade, no requests in the past 5 days received a response code of 429. But after the upgrade, beginning at 2020-09-26 03:09 UTC, we begin logging nearly 135k errors with code 429 per hour.

Customer filed a support ticket and reported that the control plane (oc commands, web console) was slow around 2020-09-28 05:00 UTC, stating that `oc adm top node` commands are failing.

Customer responds to this by rebooting master-1 around 2020-09-29 11:00 UTC, but this does not improve the slowness.

We begin to see "E0929 ... Failed to make webhook authorizer request: Post https://<kubernetes load balancer IP>:443/apis/authorization.k8s.io/v1/subjectaccessreviews: context canceled" errors from the KASes around 2020-09-29 11:06 UTC.

At 2020-09-29 11:11 UTC, we begin seeing errors in the openshift-apiserver-operator logs stating e.g. "\"build.openshift.io.v1\" is not ready: 503 (the server is currently unable to handle the request)" for various OpenShift API components (build, image, oauth, etc.)

Customer reports that `oc login` is now failing at 2020-09-30 13:04 UTC.

We narrowed the issue down to the OpenShift API servers and were finally able to identify the issue as matching this BZ.

It is notable that the OpenShift API Server pods appeared to be healthy throughout the incident and liveness probes were *not* failing, making it difficult to detect the issue.

Per Abu's guidance, we were able to deploy the above fix to disable P&F on the KASes. However, we needed to manually edit the static pod manifests for the apiservers as the cluster was so unhealthy that the install containers were failing to provision. The customer confirmed that this workaround mitigated the issue.

Comment 17 Elana Hashman 2020-10-02 17:21:33 UTC

One addition to my previous comment: I see that the openshift-apiserver operator was reporting a degraded/unavailable status for about 35 hours.

It was Degraded and Available=False starting from around 2020-09-29 11:00 UTC, approximately when the 503 errors started showing up in the openshift-apiserver-operator logs, through 2020-09-30 22:00 UTC when the fix was applied. (Note the dashboards I'm looking at have limited granularity so the timestamps are approximate.) The underlying pods were "healthy" so we did not think to look at the operator status at the time; we were too heads down in logs to look at dashboards.

Prior to 2020-09-29 11:00 UTC, all cluster operators were reporting healthy, despite significant cluster slowness/error rates as we saw above.

Comment 18 Naveen Malik 2020-10-02 17:56:37 UTC

Going to assume to undo the flowschema patch we'd put back to prior state.  Therefore:

oc patch flowschema service-accounts --type merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}'

We can adjust this as we get more info, but this is my placeholder for OSD.

Comment 22 milti leonard 2020-10-02 21:19:16 UTC

there is a KCS in progress here [1], but it pretty much repeats the suggested actions here in the BZ: 

[1] https://access.redhat.com/solutions/5448851

Comment 25 Abu Kashem 2020-10-06 18:16:28 UTC

Created attachment 1719472 [details]
apf configuration for 4.5

Comment 26 Abu Kashem 2020-10-06 18:40:49 UTC

Hi sople,
Can you please gather data from the cluster in order for us to find the root cause? 

Data Capture:
A: Run must-gather to gather whatever must-gather collects by default. 
> oc adm must-gather

B: Run must-gather to capture audit logs
> oc adm must-gather -- /usr/bin/gather_audit_logs

C: Capture cluster prometheus metrics 

Once the capture is complete (A, B and C) please share the data with us.


Resolution:
Try the following:
- Download the attachment https://bugzilla.redhat.com/attachment.cgi?id=1719472 into a file 'apf-configuration-45.yaml' and run the following command:
> oc apply -f apf-configuration-45.yaml


If the above does not resolve the issue then try the following:
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}'

Comment 27 Abu Kashem 2020-10-06 18:45:39 UTC

For C: if you don't already have a way to collect prometheus snapshot, you can run the following script:

#!/usr/bin/env bash

function queue() {
  local TARGET="${1}"
  shift
  local LIVE
  LIVE="$(jobs | wc -l)"
  while [[ "${LIVE}" -ge 45 ]]; do
    sleep 1
    LIVE="$(jobs | wc -l)"
  done
  echo "${@}"
  if [[ -n "${FILTER:-}" ]]; then
    "${@}" | "${FILTER}" >"${TARGET}" &
  else
    "${@}" >"${TARGET}" &
  fi
}

ARTIFACT_DIR=$PWD
mkdir -p $ARTIFACT_DIR/metrics
echo "Snapshotting prometheus (may take 15s) ..."
queue ${ARTIFACT_DIR}/metrics/prometheus.tar.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- tar cvzf - -C /prometheus .
FILTER=gzip queue ${ARTIFACT_DIR}/metrics/prometheus-target-metadata.json.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- /bin/bash -c "curl -G http://localhost:9090/api/v1/targets/metadata --data-urlencode 'match_target={instance!=\"\"}'"

Comment 29 Abu Kashem 2020-10-07 15:14:46 UTC

Related BZ:
- https://bugzilla.redhat.com/show_bug.cgi?id=1885353
- https://bugzilla.redhat.com/show_bug.cgi?id=1885356
- https://bugzilla.redhat.com/show_bug.cgi?id=1885358

Comment 30 Abu Kashem 2020-10-07 15:34:49 UTC

Created attachment 1719786 [details]
apf-configuration-with-oauth-server-45.yaml

Comment 31 Abu Kashem 2020-10-07 15:38:12 UTC

If a customer is running into this issue, please follow the steps below:
- Capture data from the cluster, if possible.
- Apply the patch.


Data Capture:
A: Run must-gather to gather whatever must-gather collects by default. 
> oc adm must-gather

B: Run must-gather to capture audit logs
> oc adm must-gather -- /usr/bin/gather_audit_logs

C: Capture cluster prometheus metrics 

Once the capture is complete (A, B and C) please share the data with us.


Apply the patch:
Try the following:
- Download the attachment https://bugzilla.redhat.com/attachment.cgi?id=1719786 into a file 'apf-configuration.yaml' and run the following command:
> oc apply -f apf-configuration.yaml


If the above does not resolve the issue then try the following:
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}'

Comment 32 Abu Kashem 2020-10-07 15:42:10 UTC

Hi agawand,

That's fine. Please follow the steps outlined here https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c31 to apply the patch. Please let us know if the issue resolves.

Also, please advise the customer to try the Data collection if the issue is resolved after applying the patch.
Thanks!

Comment 35 Abu Kashem 2020-10-07 16:29:00 UTC

Hi mleonard,
For the KCS you are working on.


If the customer has already taken a step to resolve the issue they need to undo those changes before applying https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c31

Scenario 1: The customer has disabled p&f feature:
In this case the customer needs to reset it and enable p&f feature. 

Scenario 2: The customer has applied the following patch
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}'

In this case the customer can undo it by applying the following patch:
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}'

Comment 36 Andreas Karis 2020-10-07 16:38:05 UTC

Hi Abu,

I published and made a few changes to the aforementioned KCS: https://access.redhat.com/solutions/5448851

Mind going over it quickly and helping out by verifying the provided workaround steps. Also, I'd like to provide further diagnostics steps, e.g. what kind of logs do we have to check and what log messages would we see in case of running into this issue.

Thanks!

Comment 37 Andreas Karis 2020-10-07 16:39:46 UTC

Also, we might want to specify rollback steps for the different workarounds 1,2,3

Comment 38 milti leonard 2020-10-07 18:03:25 UTC

i have a question on the prometheus metrics grab code: how long is it supposed to take? im on a cluster w no significant workload and its 15m and still running. is that to be expected?

Comment 39 Abu Kashem 2020-10-07 18:30:27 UTC

It can take a while, if the prometheus db is large.

Comment 41 Abu Kashem 2020-10-07 20:56:06 UTC

Hi sople,
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}'

Even the above "oc patch" did not resolve the customer issue?

Comment 42 Stefan Schimanski 2020-10-08 11:00:46 UTC

*** Bug 1880463 has been marked as a duplicate of this bug. ***

Comment 43 Shivkumar Ople 2020-10-08 15:40:35 UTC

(In reply to Abu Kashem from comment #41)
> Hi sople,
> > oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"global-default"}}}'
> 
> Even the above "oc patch" did not resolve the customer issue?


-- After patching it is Not working for them, do we need a reboot the master nodes after patching ?

Thanks

Comment 49 Abu Kashem 2020-10-12 15:08:39 UTC

Hi gwest,

> I tested a upgrade of 4.4.5 -> 4.4.11 
>   1. Without the patch mentioned, I saw the issue. 
>   2. With the patch the upgrade was clean. (Validated Qty 2 over the weekend).

I have a couple of questions:
- Which patch are you referring to, the oc apply or oc patch in https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c31?
- This feature is enabled since 4.5, so an upgrade of "4.4.5 -> 4.4.11" failing is not related to p&f. I am not clear how you were able to apply the patch in 4.4.

Comment 50 Glenn West 2020-10-12 15:25:01 UTC

In 4.4.5, the patch did appear to apply, and did seem to make the upgrade to 4.4.11 work.
From 4.4.11, it seem to cause failure of further upgrades.

oc patch kubeapiserver cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"apiServerArguments":{"feature-gates":["APIPriorityAndFairness=false"]}}}}'

Comment 51 Rigel Di Scala 2020-10-12 15:59:43 UTC

Hello sople, akashem. Rebooting the cluster according to the official docs[1] made the cluster responsive again and it appears to be working correctly, with the exception of the cluster-operator `image-registry` which is Available but in Degraded state. This was done after patching flowschema/services-accounts and the modification is still present. However, please consider that, when the problem occurred for the first time in early September, rebooting the cluster appeared to work, with no patch applied, until it became unresponsive again nearly 4 weeks later.

1. https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html

Comment 52 Abu Kashem 2020-10-12 16:30:13 UTC

Hi gwest,

>In 4.4.5, the patch did appear to apply, and did seem to make the upgrade to 4.4.11 work.
>From 4.4.11, it seem to cause failure of further upgrades.
>
>oc patch kubeapiserver cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"apiServerArguments":{"feature-gates":["APIPriorityAndFairness=false"]}}}}'

The above patch disables the APIPriorityAndFairness feature. In 4.4 this feature is not enabled. So the oc patch command was ineffective. If you check the revision before and after the path, I expect it to be at the same revision.

Comment 53 Abu Kashem 2020-10-12 16:34:07 UTC

Hi rdiscala,
Is it the same cluster that sople and I have been looking at? If so, my hypothesis is that this cluster is experiencing a different issue than what this BZ relates to. I am waiting for the logs to do further investigation.

Comment 55 Abu Kashem 2020-10-14 15:56:47 UTC

Created attachment 1721522 [details]
apf-configuration-with-oauth-server-with-exempt-45.yaml

Comment 56 Abu Kashem 2020-10-14 22:22:27 UTC

(We have made some improvements to the P&F rules and so would like the customer to try the following)

If a customer is running into this issue, please follow the steps below:
- Capture data from the cluster, if possible.
- Tweak P&F Rules.


Data Capture:
A: Run must-gather to gather whatever must-gather collects by default. 
> oc adm must-gather

B: Run must-gather to capture audit logs
> oc adm must-gather -- /usr/bin/gather_audit_logs

C: Capture cluster prometheus metrics 

Once the capture is complete (A, B and C) please share the data with us.


Tweak P&F Rules:
- If "Priority & Fairness" feature is disabled then you must enable it before the rules can be applied.
- The new attachment yaml can be safely applied over the old attachment (https://bugzilla.redhat.com/attachment.cgi?id=1719786). So if you have already applied the old attachment then you don't need to "oc delete".


Solution A:
- Download the new attachment https://bugzilla.redhat.com/attachment.cgi?id=1721522 into a file 'apf-configuration.yaml' and run the following command:
> oc apply -f apf-configuration.yaml


Solution B:
If the above does not resolve the issue then try the following 'oc patch' commands':
> oc patch flowschema service-accounts --type=merge -p '{"spec":{"priorityLevelConfiguration":{"name":"workload-low"}}}'
> oc patch prioritylevelconfiguration workload-low --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 100}}}'
> oc patch prioritylevelconfiguration global-default --type=merge -p '{"spec":{"limited":{"assuredConcurrencyShares": 20}}}'


Both Solution A and B will not block upgrade 4.5.x -> 4.6.x.

Comment 57 Rigel Di Scala 2020-10-16 15:02:56 UTC

@akashem I will plan the test with the client next week and let you know. Thanks.

Comment 58 Abu Kashem 2020-10-20 13:27:29 UTC

Hi rdiscala, sople,

I looked at the openshift-apiserver logs:

instance: apiserver-fd95d8b8b-m9q58:

> First entry timestamp: 2020-10-11T15:05:41.965736348Z
> Last entry timestamp:  2020-10-12T13:39:28.058463563Z
> Time span: ~22 hours
> Total lines: 67373

> grep 'http: TLS handshake error from' | wc -l
> 65909


This is the actual error:
> 2020-10-12T13:38:43.252655666Z I1012 13:38:43.252549       1 log.go:172] http: TLS handshake error from 10.130.0.1:36798: EOF

instance apiserver-fd95d8b8b-wvlgq:

> First entry timestamp: 2020-10-12T01:34:32.525040392Z
> Last entry timestamp:  2020-10-12T13:40:38.602400659Z 
> Total lines: 174453

> grep 'http: TLS handshake error from' | wc -l
> 174360


So two log files from openshift apiserver are completely filled with the above error.


instance apiserver-fd95d8b8b-qbw2h:

It has a lot of the following error:
> 2020-10-07T12:23:25.313630894Z E1007 12:23:25.313597       1 errors.go:77] context canceled
> 2020-10-07T12:23:25.314132997Z E1007 12:23:25.314093       1 webhook.go:199] Failed to make webhook authorizer request: context canceled


I grepped for '10.130.0.1' across all log files:

kube-apiserver:
> loading OpenAPI spec for    "v1.image.openshift.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'read tcp 10.130.0.1:43092->10.129.0.11:8443: read: connection timed out', Header: map[Content-     Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]

package server (component of OLM)
> namespaces/openshift-operator-lifecycle-manager/pods/packageserver-54868d888b-hbqdk/packageserver/packageserver/logs/current.log:156:2020-10-11T08:25:11.776640335Z I1011 08:25:11.776546       1 log.go:172] http: TLS handshake error from  10.130.0.1:55780: EOF

openshift-monitoring:
>  namespaces/openshift-monitoring/pods/prometheus-adapter-5497f8b5bb-9ztgr/prometheus-adapter/prometheus-adapter/logs/current.log:467:2020-10-10T05:24:39.992392312Z I1010 05:24:39.992358       1 log.go:172] http2: panic serving 10.130.0.1: 38754: killing connection/stream because serving request timed out and response had been started

I have opened a thread to learn more about '10.130.0.1' (it is used for routing purposes if I am not mistaken) - https://coreos.slack.com/archives/CDCP2LA9L/p1603141588175200

Comment 63 Abu Kashem 2020-10-26 13:47:14 UTC

Hi pamoedom,
The solution (https://access.redhat.com/solutions/5448851) needs update. The recommended steps (with improved p&f rules) have been captured in this comment   - https://bugzilla.redhat.com/show_bug.cgi?id=1883589#c56. 

Did the customer apply the new rules from comment 56?

Comment 86 Abu Kashem 2020-12-09 15:58:50 UTC

Hi pamoedom

Now that the cluster in question is stable, can you close this BZ? If the issue(s) reappear then feel free to reopen it.

Thanks!

Comment 91 Red Hat Bugzilla 2023-09-15 00:48:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days