Bug 1950379
Summary: | oauth-server is in pending/crashbackoff at beginning 50% of CI runs | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | ||||
Component: | apiserver-auth | Assignee: | Standa Laznicka <slaznick> | ||||
Status: | CLOSED ERRATA | QA Contact: | pmali | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.8 | CC: | aos-bugs, mfojtik, obulatov, sttts, surbania, xxia | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Enhancement | |||||
Doc Text: |
Feature:
The authentication operator should only synchronize relevant keys from the openshift-config-managed/router-certs secret.
Reason:
The authentication operator would redeploy the oauth-openshift pods even though no applicable changes occured. This may cause failure in tests that are watching the pods being pending, but is also generally undesirable.
Result:
The authentication operator synchronizes only the relevant keys of the openshift-config-managed/router-certs.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-07-27 23:01:29 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Clayton Coleman
2021-04-16 13:37:10 UTC
*** Bug 1949986 has been marked as a duplicate of this bug. *** The number of revisions is caused by the tests that are being run. I looked at the events generated by the operator once the status stops reporting `OAuthServerDeployment_PreconditionNotFulfilled`, which is when first deployments start to rollout. There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ. The observed behavior matches the tests that are to be found at https://github.com/openshift/origin/blob/7f6c3218d227329ae9dc30f22e5d300786e32a44/test/extended/router/h2spec.go#L40. While it is not necessarily disrupting the cluster, I wonder whether it should by any chance run in a serial suite given how it changes global cluster configuration. With the number of revisions cleared up, I am still yet to see why the oauth-server pods might crashloop Created attachment 1773727 [details]
observed router-certs secret config changes
So from looking at the test result more carefully, I can see that the pod in question was actually started 30 seconds before the test ended. Since the networking tests have caused many rollouts as proven in comment 2, it's very likely that this is just another of these rollouts. The pod is unschedulable because the other pod that the new pod is supposed to replace is trying to shut down gracefully. 1. We can modify the crashlooping/pending test to check on the pods that it marks as failing for another 4 minutes to make sure they end up being rolled out successfully, but I am not sure that's what the test is supposed to be checking. 2. Another option would of course be running the tests that cause these rollouts in a separate test suite which tolerates this behavior. Clayton, please let me know which one you'd prefer. I can have a look and implement 1., or we should assign Routing to deal with 2. I am also open to other options. --- For the record: the authentication operator goes degraded if a rollout of a new revision is taking over 5 minutes, which is not the 4 minutes the test is assuming, but we should be able to notice a pod that is misbehaving for too long just by that. In comment 2: > There are a number of `ObserveRouterSecret` events fired which show that the openshift-config-managed/router-certs secret was updated with domains like "e2e-test-router-h2spec-*". I'm going to attach that to the BZ. Can we ignore when when these changes happen? Shouldn't oauth only care about the default router? Roughly: "is the rollout here necessary" (and if not, let's not rollout). For 2 I think I'd prefer not to because this test is *supposed* to be catching that something is rolling out excessively. "Pending the entire time" (if we're testing correctly) is abnormal (nothing should be pending for more than a few seconds). I think you're implying via 1 that the pod isn't actually pending for 4 minutes total? It's possible that we want to move the "pending excessively" to a post-test condition that just looks at all pods and any pod pending longer than X triggers a failure, vs testing it here poorly. I think, looking at the test, that I agree. Let me look at the test in more detail. > Can we ignore when when these changes happen? Shouldn't oauth only care about the default router? Roughly: "is the rollout here necessary" (and if not, let's not rollout). I think I might be able to make us ignore the changes to other routers. Today, we are just syncing the whole router-certs secret from the openshift-config-managed NS, but I can see how I could make it so that we only synchronize the single key for the only single domain we care about. I'll look into that. > I think you're implying via 1 that the pod isn't actually pending for 4 minutes total? yes, that was my point *** Bug 1959149 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |