Bug 1950379

Summary:

oauth-server is in pending/crashbackoff at beginning 50% of CI runs

Product:

OpenShift Container Platform

Reporter:

Clayton Coleman <ccoleman>

Component:

apiserver-auth

Assignee:

Standa Laznicka <slaznick>

Status:

CLOSED ERRATA

QA Contact:

pmali

Severity:

high

Docs Contact:

Priority:

urgent

Version:

4.8

CC:

aos-bugs, mfojtik, obulatov, sttts, surbania, xxia

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Enhancement

Doc Text:

Feature: The authentication operator should only synchronize relevant keys from the openshift-config-managed/router-certs secret. Reason: The authentication operator would redeploy the oauth-openshift pods even though no applicable changes occured. This may cause failure in tests that are watching the pods being pending, but is also generally undesirable. Result: The authentication operator synchronizes only the relevant keys of the openshift-config-managed/router-certs.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-27 23:01:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
observed router-certs secret config changes	none

Description Clayton Coleman 2021-04-16 13:37:10 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1383013784422977536

: [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel] expand_less
Run #0: Failed expand_less 	4m2s
fail [github.com/openshift/origin/test/extended/operators/cluster.go:160]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-authentication/oauth-openshift-6d4747d7dc-bz89m was pending entire time: unknown error",
    ]

The test runs at an arbitrary time in the suite, but appears that on average we rollout the oauth-apiserver 12-13 times

            "metadata": {
                "annotations": {
                    "deployment.kubernetes.io/revision": "13",
                },
                "generation": 13,
                "name": "oauth-openshift",
                "namespace": "openshift-authentication",
            },

during an e2e run. 

There are a number of different events fired that might indicate delays starting the pod, but given the number of rollouts none of them are consistent and it's unclear which is involved.  I.e., in this run:

At 2021-04-16 12:15:45 +0000 UTC - event for oauth-openshift-6bc97f5f68-s2xm2: {kubelet ip-10-0-153-66.us-west-1.compute.internal} Failed: Error: cannot find volume "v4-0-config-system-session" to mount into container "oauth-openshift"

and status from the pod

[36mINFO[0m[2021-04-16T12:23:55Z]       "lastTransitionTime": "2021-04-16T12:16:06Z", 
[36mINFO[0m[2021-04-16T12:23:55Z]       "reason": "Unschedulable",             
[36mINFO[0m[2021-04-16T12:23:55Z]       "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules." 
[36mINFO[0m[2021-04-16T12:23:55Z]     }                                        
[36mINFO[0m[2021-04-16T12:23:55Z]   ],     

If the oauth-apiserver has hard antiaffinity, it must be a maxUnavailable=1 deployment (not maxSurge), but it *looks* like it is maxUnavailable, so this may be irrelevant.

Comment 1 Standa Laznicka 2021-04-16 13:42:53 UTC

*** Bug 1949986 has been marked as a duplicate of this bug. ***

Comment 2 Standa Laznicka 2021-04-20 10:26:06 UTC

The number of revisions is caused by the tests that are being run. I looked at the events generated by the operator once the status stops reporting `OAuthServerDeployment_PreconditionNotFulfilled`, which is when first deployments start to rollout.

There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ.

The observed behavior matches the tests that are to be found at https://github.com/openshift/origin/blob/7f6c3218d227329ae9dc30f22e5d300786e32a44/test/extended/router/h2spec.go#L40. While it is not necessarily disrupting the cluster, I wonder whether it should by any chance run in a serial suite given how it changes global cluster configuration.

With the number of revisions cleared up, I am still yet to see why the oauth-server pods might crashloop

Comment 3 Standa Laznicka 2021-04-20 10:27:24 UTC

Created attachment 1773727 [details]
observed router-certs secret config changes

Comment 4 Standa Laznicka 2021-04-21 07:55:17 UTC

So from looking at the test result more carefully, I can see that the pod in question was actually started 30 seconds before the test ended. Since the networking tests have caused many rollouts as proven in comment 2, it's very likely that this is just another of these rollouts. The pod is unschedulable because the other pod that the new pod is supposed to replace is trying to shut down gracefully.

1. We can modify the crashlooping/pending test to check on the pods that it marks as failing for another 4 minutes to make sure they end up being rolled out successfully, but I am not sure that's what the test is supposed to be checking.
2. Another option would of course be running the tests that cause these rollouts in a separate test suite which tolerates this behavior.

Clayton, please let me know which one you'd prefer. I can have a look and implement 1., or we should assign Routing to deal with 2. I am also open to other options.
---
For the record: the authentication operator goes degraded if a rollout of a new revision is taking over 5 minutes, which is not the 4 minutes the test is assuming, but we should be able to notice a pod that is misbehaving for too long just by that.

Comment 5 Clayton Coleman 2021-04-21 16:18:42 UTC

In comment 2:

> There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ.

Can we ignore when when these changes happen?  Shouldn't oauth only care about the default router?  Roughly: "is the rollout here necessary" (and if not, let's not rollout).

For 2 I think I'd prefer not to because this test is *supposed* to be catching that something is rolling out excessively.  "Pending the entire time" (if we're testing correctly) is abnormal (nothing should be pending for more than a few seconds).

I think you're implying via 1 that the pod isn't actually pending for 4 minutes total?  It's possible that we want to move the "pending excessively" to a post-test condition that just looks at all pods and any pod pending longer than X triggers a failure, vs testing it here poorly.  I think, looking at the test, that I agree.  Let me look at the test in more detail.

Comment 6 Standa Laznicka 2021-04-22 09:24:03 UTC

> Can we ignore when when these changes happen?  Shouldn't oauth only care about the default router?  Roughly: "is the rollout here necessary" (and if not, let's not rollout).

I think I might be able to make us ignore the changes to other routers. Today, we are just syncing the whole router-certs secret from the openshift-config-managed NS, but I can see how I could make it so that we only synchronize the single key for the only single domain we care about. I'll look into that.

> I think you're implying via 1 that the pod isn't actually pending for 4 minutes total?

yes, that was my point

Comment 7 Standa Laznicka 2021-05-11 07:32:08 UTC

*** Bug 1959149 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2021-07-27 23:01:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438