https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1383013784422977536 : [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel] expand_less Run #0: Failed expand_less 4m2s fail [github.com/openshift/origin/test/extended/operators/cluster.go:160]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-authentication/oauth-openshift-6d4747d7dc-bz89m was pending entire time: unknown error", ] The test runs at an arbitrary time in the suite, but appears that on average we rollout the oauth-apiserver 12-13 times "metadata": { "annotations": { "deployment.kubernetes.io/revision": "13", }, "generation": 13, "name": "oauth-openshift", "namespace": "openshift-authentication", }, during an e2e run. There are a number of different events fired that might indicate delays starting the pod, but given the number of rollouts none of them are consistent and it's unclear which is involved. I.e., in this run: At 2021-04-16 12:15:45 +0000 UTC - event for oauth-openshift-6bc97f5f68-s2xm2: {kubelet ip-10-0-153-66.us-west-1.compute.internal} Failed: Error: cannot find volume "v4-0-config-system-session" to mount into container "oauth-openshift" and status from the pod [36mINFO[0m[2021-04-16T12:23:55Z] "lastTransitionTime": "2021-04-16T12:16:06Z", [36mINFO[0m[2021-04-16T12:23:55Z] "reason": "Unschedulable", [36mINFO[0m[2021-04-16T12:23:55Z] "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules." [36mINFO[0m[2021-04-16T12:23:55Z] } [36mINFO[0m[2021-04-16T12:23:55Z] ], If the oauth-apiserver has hard antiaffinity, it must be a maxUnavailable=1 deployment (not maxSurge), but it *looks* like it is maxUnavailable, so this may be irrelevant.
*** Bug 1949986 has been marked as a duplicate of this bug. ***
The number of revisions is caused by the tests that are being run. I looked at the events generated by the operator once the status stops reporting `OAuthServerDeployment_PreconditionNotFulfilled`, which is when first deployments start to rollout. There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ. The observed behavior matches the tests that are to be found at https://github.com/openshift/origin/blob/7f6c3218d227329ae9dc30f22e5d300786e32a44/test/extended/router/h2spec.go#L40. While it is not necessarily disrupting the cluster, I wonder whether it should by any chance run in a serial suite given how it changes global cluster configuration. With the number of revisions cleared up, I am still yet to see why the oauth-server pods might crashloop
Created attachment 1773727 [details] observed router-certs secret config changes
So from looking at the test result more carefully, I can see that the pod in question was actually started 30 seconds before the test ended. Since the networking tests have caused many rollouts as proven in comment 2, it's very likely that this is just another of these rollouts. The pod is unschedulable because the other pod that the new pod is supposed to replace is trying to shut down gracefully. 1. We can modify the crashlooping/pending test to check on the pods that it marks as failing for another 4 minutes to make sure they end up being rolled out successfully, but I am not sure that's what the test is supposed to be checking. 2. Another option would of course be running the tests that cause these rollouts in a separate test suite which tolerates this behavior. Clayton, please let me know which one you'd prefer. I can have a look and implement 1., or we should assign Routing to deal with 2. I am also open to other options. --- For the record: the authentication operator goes degraded if a rollout of a new revision is taking over 5 minutes, which is not the 4 minutes the test is assuming, but we should be able to notice a pod that is misbehaving for too long just by that.
In comment 2: > There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ. Can we ignore when when these changes happen? Shouldn't oauth only care about the default router? Roughly: "is the rollout here necessary" (and if not, let's not rollout). For 2 I think I'd prefer not to because this test is *supposed* to be catching that something is rolling out excessively. "Pending the entire time" (if we're testing correctly) is abnormal (nothing should be pending for more than a few seconds). I think you're implying via 1 that the pod isn't actually pending for 4 minutes total? It's possible that we want to move the "pending excessively" to a post-test condition that just looks at all pods and any pod pending longer than X triggers a failure, vs testing it here poorly. I think, looking at the test, that I agree. Let me look at the test in more detail.
> Can we ignore when when these changes happen? Shouldn't oauth only care about the default router? Roughly: "is the rollout here necessary" (and if not, let's not rollout). I think I might be able to make us ignore the changes to other routers. Today, we are just syncing the whole router-certs secret from the openshift-config-managed NS, but I can see how I could make it so that we only synchronize the single key for the only single domain we care about. I'll look into that. > I think you're implying via 1 that the pod isn't actually pending for 4 minutes total? yes, that was my point
*** Bug 1959149 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438