Bug 1846203
| Summary: | install fails to complete: authentication operator Progressing=True _WellKnownNotReady: ...:6443/.well-known/oauth-authorization-server 404 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | kube-apiserver | Assignee: | Standa Laznicka <slaznick> |
| Status: | CLOSED WONTFIX | QA Contact: | Ke Wang <kewang> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.5 | CC: | aos-bugs, ffranz, jima, mfojtik, pprinett, sttts, tsze, vareti, xxia, yanyang |
| Target Milestone: | --- | Flags: | mfojtik:
needinfo?
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | LifecycleReset | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-16 11:45:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
W. Trevor King
2020-06-11 05:10:25 UTC
Bug 1822289 is a similar 404, although this one's at install-time and that one is later in the cluster's lifecycle. But maybe not all that much later, in which case this bug may be a dup of 1822289. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I checked the artifacts for each run that failed in last 48 hours. In each of these runs, error reported by authentication operator appears to be genuine. Auth operator is unable to read well-known endpoint information from kube-apiserver. This is because kube-apiserver is in the process of rolling out and these small disruptions are expected when checking for each kube-apiserver. The reason for job failure as I understand it is that the install did not complete in expected time. This is because the kube-apiserver is still rolling out. Various factors could cause kube-apiserver roll out and I did not find anything suspicious in these failures that is causing the roll out to take longer. I am occupied with other priority items. Working on this bug will be re-evaluated in next sprint. > I checked the artifacts for each run that failed in last 48 hours. In each of these runs, error reported by authentication operator appears to be genuine. Auth operator is unable to read well-known endpoint information from kube-apiserver. This is because kube-apiserver is in the process of rolling out and these small disruptions are expected when checking for each kube-apiserver.
How is auth-operator accessing kube-apiserver? A 404 is no a sign of a roll-out but rather that it still misses the right configuration.
> How is auth-operator accessing kube-apiserver? A 404 is no a sign of a roll-out but rather that it still misses the right configuration.
If I understand correctly, auth operator checks every replica of kube-apiserver for well know endpoint. As the well know endpoint is not accessible during kube-apiserver roll out, auth operator reports this error.
From what I can see, most of the failures appear on OpenStack (and in the above search link, none are actually 4.6 despite the reported claim). In most cases, it seems that the KAS just fails to roll out in time to the revision that contains the configuration with the metadata. The logs of the kube-apiserver-operator of OpenStack installations in version 4.5 show that some of the requests take up to 2.5s to process. Based on the above observation, I'm moving this to "OpenShift on OpenStack" component for investigation on why the KAS rollouts are so slow. No idea if it's related, but when I hear "Kubernetes ...mumble mumble... rollouts are slow" recently, it makes me wonder if it's a dup of bug 1868750. My strong suspicion is that this is just an issue of slow test infrastructure. The observation that, "The logs of the kube-apiserver-operator of OpenStack installations in version 4.5 show that some of the requests take up to 2.5s to process," seems to strongly indicate this. Also, since this bug was reported, the statistics on this failure occurring for the openstack platform are now on par with gcp, suggesting that this is not an openstack specific issue: openstack 4.5: 61 runs, 36% failed, 27% gcp: 113 runs, 22% failed, 24% of failures match Not sure what info I'm being asked to provide. Maybe slow infra is the culprit, but if so, it would be good if we could point at something in the cluster that is pointing the managing admin towards whatever steps they should take to mitigate/recover. This has been observed frequently in some Azure jobs, e.g. the e2e-azure-operator: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-kubernetes-autoscaler-master-e2e-azure-operator They should be able to recover by just running `openshift-install wait-for install complete` when it crashes. I will sift through a few more logs just to make sure that nothing unusual is in there, but if its just a timeout due to slow hardware, then I'm not sure its really a bug. I havent really found anything out of the ordinary, other than the kube-api-server replies on average being much slower on runs that failed with this error. It is a shame that prometheus doesn't run until the cluster is up. It would be nice to have those metrics. To better inform our debugging process, it might be a good idea to run i/o benchmarks on the instances as part of the must-gather script so that we have some data points to work with in case prometheus hasnt been started yet. It has been observed in GCP IPI installation. level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.8: 86% complete" level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.5:6443/.well-known/oauth-authorization-server endpoint data" level=info msg="Cluster operator authentication Available is False with : " level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.5.8" level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment" level=info msg="Cluster operator insights Disabled is True with Disabled: Health reporting is disabled" level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8" level=fatal msg="failed to initialize the cluster: Working towards 4.5.8: 86% complete" Been a while since the last CI-search dump here, but yeah, OpenStack seems ok now, while GCP is still impacted: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=info.*Cluster+operator+authentication+Progressing+is+True+with+_WellKnownNotReady.*404+Not+Found&maxAge=48h&type=junit&name=release-openshift-' | grep 'failures match' release-openshift-origin-installer-e2e-gcp-4.5 - 19 runs, 37% failed, 14% of failures match release-openshift-origin-installer-launch-gcp - 223 runs, 81% failed, 1% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci - 7 runs, 29% failed, 50% of failures match Example 4.5 GCP job [1] has: level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.3:6443/.well-known/oauth-authorization-server endpoint data" level=info msg="Cluster operator authentication Available is False with : " level=info msg="Cluster operator console Available is False with Deployment_FailedUpdate: DeploymentAvailable: 1 replicas ready at version 4.5.0-0.ci-2020-09-18-032040" level=info msg="Cluster operator insights Disabled is False with : " level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 5; 1 nodes are at revision 6" level=fatal msg="failed to initialize the cluster: Working towards 4.5.0-0.ci-2020-09-18-032040: 86% complete" All six nodes are Ready=True [2]. I'm not clear on why auth needs all three kube-apiservers happy, but apparently that's the case [3]. On the other hand, it's not complaining about "need at least 3 kube-apiservers" here either. Looking into the kube-apiserver pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1306798216066371584/artifacts/e2e-gcp/pods.json | jq '.items[] | select(.metadata.name == "kube-apiserver-ci-op-9bjbcigk-2aad9-9tc5j-master-2").status | {phase, initContainerStatuses}' { "phase": "Pending", "initContainerStatuses": [ { "containerID": "cri-o://c1fc27503768fcbebac7c0ce2ced5d1b7cdf9ac131bf59c6b0173221a00600f1", "image": "registry.svc.ci.openshift.org/ocp/4.5-2020-09-18-032040@sha256:224f8ee8bf76835e404f0ca8d21690557fa55ca929a14f4ed39e0d0471b59f0c", "imageID": "registry.svc.ci.openshift.org/ocp/4.5-2020-09-18-032040@sha256:224f8ee8bf76835e404f0ca8d21690557fa55ca929a14f4ed39e0d0471b59f0c", "lastState": {}, "name": "setup", "ready": false, "restartCount": 0, "state": { "running": { "startedAt": "2020-09-18T04:18:14Z" } } } ] } That also sounds similar to stuff reported in bug 1877481. So maybe this can be closed as a dup of bug 1877984? Assigning back to the API-server folks to confirm. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1306798216066371584 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1306798216066371584/artifacts/e2e-gcp/nodes.json [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1877481#c18 This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. This bug got stale, I probably won't have time anytime soon to go and loop through all the failure and see if there might be a common root cause for the failed runs. Closing. |