Bug 1950379

Summary: oauth-server is in pending/crashbackoff at beginning 50% of CI runs
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: apiserver-authAssignee: Standa Laznicka <slaznick>
Status: CLOSED ERRATA QA Contact: pmali
Severity: high Docs Contact:
Priority: urgent    
Version: 4.8CC: aos-bugs, mfojtik, obulatov, sttts, surbania, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: The authentication operator should only synchronize relevant keys from the openshift-config-managed/router-certs secret. Reason: The authentication operator would redeploy the oauth-openshift pods even though no applicable changes occured. This may cause failure in tests that are watching the pods being pending, but is also generally undesirable. Result: The authentication operator synchronizes only the relevant keys of the openshift-config-managed/router-certs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:01:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
observed router-certs secret config changes none

Description Clayton Coleman 2021-04-16 13:37:10 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1383013784422977536

: [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel] expand_less
Run #0: Failed expand_less 	4m2s
fail [github.com/openshift/origin/test/extended/operators/cluster.go:160]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-authentication/oauth-openshift-6d4747d7dc-bz89m was pending entire time: unknown error",
    ]

The test runs at an arbitrary time in the suite, but appears that on average we rollout the oauth-apiserver 12-13 times

            "metadata": {
                "annotations": {
                    "deployment.kubernetes.io/revision": "13",
                },
                "generation": 13,
                "name": "oauth-openshift",
                "namespace": "openshift-authentication",
            },

during an e2e run. 

There are a number of different events fired that might indicate delays starting the pod, but given the number of rollouts none of them are consistent and it's unclear which is involved.  I.e., in this run:

At 2021-04-16 12:15:45 +0000 UTC - event for oauth-openshift-6bc97f5f68-s2xm2: {kubelet ip-10-0-153-66.us-west-1.compute.internal} Failed: Error: cannot find volume "v4-0-config-system-session" to mount into container "oauth-openshift"

and status from the pod

[36mINFO[0m[2021-04-16T12:23:55Z]       "lastTransitionTime": "2021-04-16T12:16:06Z", 
[36mINFO[0m[2021-04-16T12:23:55Z]       "reason": "Unschedulable",             
[36mINFO[0m[2021-04-16T12:23:55Z]       "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules." 
[36mINFO[0m[2021-04-16T12:23:55Z]     }                                        
[36mINFO[0m[2021-04-16T12:23:55Z]   ],     

If the oauth-apiserver has hard antiaffinity, it must be a maxUnavailable=1 deployment (not maxSurge), but it *looks* like it is maxUnavailable, so this may be irrelevant.

Comment 1 Standa Laznicka 2021-04-16 13:42:53 UTC
*** Bug 1949986 has been marked as a duplicate of this bug. ***

Comment 2 Standa Laznicka 2021-04-20 10:26:06 UTC
The number of revisions is caused by the tests that are being run. I looked at the events generated by the operator once the status stops reporting `OAuthServerDeployment_PreconditionNotFulfilled`, which is when first deployments start to rollout.

There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ.

The observed behavior matches the tests that are to be found at https://github.com/openshift/origin/blob/7f6c3218d227329ae9dc30f22e5d300786e32a44/test/extended/router/h2spec.go#L40. While it is not necessarily disrupting the cluster, I wonder whether it should by any chance run in a serial suite given how it changes global cluster configuration.

With the number of revisions cleared up, I am still yet to see why the oauth-server pods might crashloop

Comment 3 Standa Laznicka 2021-04-20 10:27:24 UTC
Created attachment 1773727 [details]
observed router-certs secret config changes

Comment 4 Standa Laznicka 2021-04-21 07:55:17 UTC
So from looking at the test result more carefully, I can see that the pod in question was actually started 30 seconds before the test ended. Since the networking tests have caused many rollouts as proven in comment 2, it's very likely that this is just another of these rollouts. The pod is unschedulable because the other pod that the new pod is supposed to replace is trying to shut down gracefully.

1. We can modify the crashlooping/pending test to check on the pods that it marks as failing for another 4 minutes to make sure they end up being rolled out successfully, but I am not sure that's what the test is supposed to be checking.
2. Another option would of course be running the tests that cause these rollouts in a separate test suite which tolerates this behavior.

Clayton, please let me know which one you'd prefer. I can have a look and implement 1., or we should assign Routing to deal with 2. I am also open to other options.
---
For the record: the authentication operator goes degraded if a rollout of a new revision is taking over 5 minutes, which is not the 4 minutes the test is assuming, but we should be able to notice a pod that is misbehaving for too long just by that.

Comment 5 Clayton Coleman 2021-04-21 16:18:42 UTC
In comment 2:

> There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ.

Can we ignore when when these changes happen?  Shouldn't oauth only care about the default router?  Roughly: "is the rollout here necessary" (and if not, let's not rollout).

For 2 I think I'd prefer not to because this test is *supposed* to be catching that something is rolling out excessively.  "Pending the entire time" (if we're testing correctly) is abnormal (nothing should be pending for more than a few seconds).

I think you're implying via 1 that the pod isn't actually pending for 4 minutes total?  It's possible that we want to move the "pending excessively" to a post-test condition that just looks at all pods and any pod pending longer than X triggers a failure, vs testing it here poorly.  I think, looking at the test, that I agree.  Let me look at the test in more detail.

Comment 6 Standa Laznicka 2021-04-22 09:24:03 UTC
> Can we ignore when when these changes happen?  Shouldn't oauth only care about the default router?  Roughly: "is the rollout here necessary" (and if not, let's not rollout).

I think I might be able to make us ignore the changes to other routers. Today, we are just syncing the whole router-certs secret from the openshift-config-managed NS, but I can see how I could make it so that we only synchronize the single key for the only single domain we care about. I'll look into that.

> I think you're implying via 1 that the pod isn't actually pending for 4 minutes total?

yes, that was my point

Comment 7 Standa Laznicka 2021-05-11 07:32:08 UTC
*** Bug 1959149 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2021-07-27 23:01:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438