Bug 1826665

Summary: [4.4] Consistently increasing rate of traffic into the oauth server over time
Product: OpenShift Container Platform Reporter: Standa Laznicka <slaznick>
Component: apiserver-authAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.4CC: aos-bugs, bparees, jforrest, mfojtik, mifiedle, mnewby, pmali, scheng, slaznick, xxia, zyu
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1826341 Environment:
Last Closed: 2020-05-04 11:50:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1826341    
Bug Blocks:    
Attachments:
Description Flags
Issue still exists in 4.4.0-0.nightly-2020-04-23-083610
none
oauth transmit rate on 4.4.0-0.nightly-2020-04-23-192521
none
oauth receive rate on 4.4.0-0.nightly-2020-04-23-192521 none

Description Standa Laznicka 2020-04-22 09:27:27 UTC
+++ This bug was initially created as a clone of Bug #1826341 +++

Created attachment 1680555 [details]
network-rate-increase

Looking at dashboards on a longer lived 4.4 cluster I was able to see that something is causing a consistently increasing -rate- of traffic to the oauth server. At initial glance it looked like the dashboard was wrong and that it was showing total traffic. But the dashboard is indeed correct when using it against other namespaces. So the rate of traffic is actually increasing.

From talking to deads it may be challenging to determine what is causing this traffic because we are missing some instrumentation in the oauth server. So I am opening the bug against auth to start with.

The dashboard where you can see this is the OOTB "Kubernetes / Compute Resources / Namespace (Pods)" when targeted at the openshift-authentication namespace.

Attached is a sample of what I am seeing over a 2 day window (version of this cluster is 4.4.0-rc.8).

Have not yet checked 4.3 / 4.5 yet to see if they have the same behavior.

--- Additional comment from Jessica Forrester on 2020-04-21 15:42:35 CEST ---



--- Additional comment from Jessica Forrester on 2020-04-21 15:44:27 CEST ---

Just added a new attachment that shows what happens to the network against openshift-authentication after deleting the auth operator.
Ran:
oc delete -n openshift-authentication-operator pods --all

See attachment auth-network-after-operator-deleted

Drastic drop in the rate of traffic after that. Suspect there is a leaking healthcheck process here.

--- Additional comment from Jessica Forrester on 2020-04-21 15:56:37 CEST ---

checked a 4.3 long lived cluster and we do not see the same leak. this was likely introduced in 4.4

--- Additional comment from Standa Laznicka on 2020-04-21 16:03:25 CEST ---

I don't see the behavior on 4.5 (4.5.0-0.nightly-2020-04-18-184707) either

--- Additional comment from Jessica Forrester on 2020-04-21 16:05:36 CEST ---

On a slightly older (longer lived) 4.4 cluster. This eventually leads to the authentication operator degrading.

Alerts Firing:
ClusterOperatorDegraded
Cluster operator authentication has been degraded for 10 mins. Operator is degraded because IngressStateEndpoints_UnhealthyAddresses and cluster upgrades will be unstable.

ClusterOperatorDown
Cluster operator authentication has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.

--- Additional comment from Ben Parees on 2020-04-21 16:15:23 CEST ---

tentatively marking as a high severity/4.4 blocker based on this assessment from Jessica:


> we are aware it leads to the auth operator degrading after 1.5 weeks of cluster life
> note: it is recoverable by bouncing the auth operator pods.


I don't think we can just tell all our 4.4.0 customers to restart their auth pods every week until 4.4.1 comes out.

--- Additional comment from Maru Newby on 2020-04-22 09:10:45 CEST ---

The reported behavior reproduces immediately against a 4.4 nightly cluster.

--- Additional comment from Standa Laznicka on 2020-04-22 10:50:13 CEST ---

Actually, I notice the same in 4.5

--- Additional comment from Standa Laznicka on 2020-04-22 10:55:01 CEST ---

... I only had to zoom out enough because the graphing seems to differ in the latest version. The fix in the operator code is ready.

Comment 5 Mike Fiedler 2020-04-23 17:02:20 UTC
Created attachment 1681200 [details]
Issue still exists in 4.4.0-0.nightly-2020-04-23-083610

Verified fix is not included in latest 4.4 nightly 

https://openshift-release.svc.ci.openshift.org/releasestream/4.4.0-0.nightly/release/4.4.0-0.nightly-2020-04-23-083610

diff:  https://openshift-release.svc.ci.openshift.org/releasestream/4.4.0-0.nightly/release/4.4.0-0.nightly-2020-04-23-083610?from=4.4.0-0.nightly-2020-04-22-215658

Moving back to MODIFIED pending a new build with the fix.

Comment 6 Mike Fiedler 2020-04-23 19:43:31 UTC
The fix will be available in 4.4.0-0.nightly-2020-04-23-083610 when it completes

Comment 8 Mike Fiedler 2020-04-23 23:40:10 UTC
Created attachment 1681289 [details]
oauth transmit rate on 4.4.0-0.nightly-2020-04-23-192521

Comment 9 Mike Fiedler 2020-04-23 23:41:35 UTC
Created attachment 1681290 [details]
oauth receive rate on 4.4.0-0.nightly-2020-04-23-192521

Verified on 4.4.0-0.nightly-2020-04-23-192521

oauth pod transmit and receive read rates are stable over time - not growing.

Comment 13 errata-xmlrpc 2020-05-04 11:50:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581