1826341 – Consistently increasing rate of traffic into the oauth server over time

Bug 1826341 - Consistently increasing rate of traffic into the oauth server over time

Summary: Consistently increasing rate of traffic into the oauth server over time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Standa Laznicka
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1826665
TreeView+	depends on / blocked

Reported:	2020-04-21 13:20 UTC by Jessica Forrester
Modified:	2020-07-13 17:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The authentication operator was not closing connections to the oauth-server that it was opening it a loop. Consequence: The rate of traffic to the oauth-server was growing consistently as the connections were being opened faster than they were being dropped. Fix: Close the connections. Result: The authentication operator does not degrade the service of its own payload.
Clone Of:
Clones:	1826665 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:29:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
network-rate-increase (594.68 KB, image/png) 2020-04-21 13:20 UTC, Jessica Forrester	no flags	Details
auth-network-after-operator-deleted (546.88 KB, image/png) 2020-04-21 13:42 UTC, Jessica Forrester	no flags	Details
4.4-oauth-12h-net-traffic (152.75 KB, image/png) 2020-04-22 07:10 UTC, Maru Newby	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-authentication-operator pull 279	0	None	closed	Bug 1826341: ingress controller does not closes its connection to healthz	2021-02-09 03:26:12 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:29:56 UTC

Description Jessica Forrester 2020-04-21 13:20:23 UTC

Created attachment 1680555 [details]
network-rate-increase

Created attachment 1680555 [details]
network-rate-increase

Looking at dashboards on a longer lived 4.4 cluster I was able to see that something is causing a consistently increasing -rate- of traffic to the oauth server. At initial glance it looked like the dashboard was wrong and that it was showing total traffic. But the dashboard is indeed correct when using it against other namespaces. So the rate of traffic is actually increasing.

From talking to deads it may be challenging to determine what is causing this traffic because we are missing some instrumentation in the oauth server. So I am opening the bug against auth to start with.

The dashboard where you can see this is the OOTB "Kubernetes / Compute Resources / Namespace (Pods)" when targeted at the openshift-authentication namespace.

Attached is a sample of what I am seeing over a 2 day window (version of this cluster is 4.4.0-rc.8).

Have not yet checked 4.3 / 4.5 yet to see if they have the same behavior.

Comment 1 Jessica Forrester 2020-04-21 13:42:35 UTC

Created attachment 1680560 [details]
auth-network-after-operator-deleted

Comment 2 Jessica Forrester 2020-04-21 13:44:27 UTC

Just added a new attachment that shows what happens to the network against openshift-authentication after deleting the auth operator.
Ran:
oc delete -n openshift-authentication-operator pods --all

See attachment auth-network-after-operator-deleted

Drastic drop in the rate of traffic after that. Suspect there is a leaking healthcheck process here.

Comment 3 Jessica Forrester 2020-04-21 13:56:37 UTC

checked a 4.3 long lived cluster and we do not see the same leak. this was likely introduced in 4.4

Comment 4 Standa Laznicka 2020-04-21 14:03:25 UTC

I don't see the behavior on 4.5 (4.5.0-0.nightly-2020-04-18-184707) either

Comment 5 Jessica Forrester 2020-04-21 14:05:36 UTC

On a slightly older (longer lived) 4.4 cluster. This eventually leads to the authentication operator degrading.

Alerts Firing:
ClusterOperatorDegraded
Cluster operator authentication has been degraded for 10 mins. Operator is degraded because IngressStateEndpoints_UnhealthyAddresses and cluster upgrades will be unstable.

ClusterOperatorDown
Cluster operator authentication has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.

Comment 6 Ben Parees 2020-04-21 14:15:23 UTC

tentatively marking as a high severity/4.4 blocker based on this assessment from Jessica:


> we are aware it leads to the auth operator degrading after 1.5 weeks of cluster life
> note: it is recoverable by bouncing the auth operator pods.


I don't think we can just tell all our 4.4.0 customers to restart their auth pods every week until 4.4.1 comes out.

Comment 7 Maru Newby 2020-04-22 07:10:45 UTC

Created attachment 1680738 [details]
4.4-oauth-12h-net-traffic

The reported behavior reproduces immediately against a 4.4 nightly cluster.

Comment 8 Standa Laznicka 2020-04-22 08:50:13 UTC

Actually, I notice the same in 4.5

Comment 9 Standa Laznicka 2020-04-22 08:55:01 UTC

... I only had to zoom out enough because the graphing seems to differ in the latest version. The fix in the operator code is ready.

Comment 12 Mike Fiedler 2020-04-22 21:53:34 UTC

Verified on 4.5.0-0.ci-2020-04-22-162958 since 4.5 nightly builds are not currently stable.

The Grafana dashboard referenced in the description was messed up in this cluster, so verified with prometheus

sum(irate(container_network_transmit_packets_total{namespace=~"openshift-authentication"}[1m])) by (pod)
sum(irate(container_network_send_packets_total{namespace=~"openshift-authentication"}[1m])) by (pod)

Both graphs showed stable (not growing) transmit/receive packet rates over ~3 hours

Comment 13 errata-xmlrpc 2020-07-13 17:29:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.