Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1846203

Summary:	install fails to complete: authentication operator Progressing=True _WellKnownNotReady: ...:6443/.well-known/oauth-authorization-server 404
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	kube-apiserver	Assignee:	Standa Laznicka <slaznick>
Status:	CLOSED WONTFIX	QA Contact:	Ke Wang <kewang>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.5	CC:	aos-bugs, ffranz, jima, mfojtik, pprinett, sttts, tsze, vareti, xxia, yanyang
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	LifecycleReset
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-16 11:45:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-06-11 05:10:25 UTC

Very common in CI in both 4.5 and 4.6.  Doesn't seem  to happen on AWS, but GCP, OpenStack, and Azure are certainly well represented:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?search=info.*Cluster+operator+authentication+Progressing+is+True+with+_WellKnownNotReady.*404+Not+Found&maxAge=48h&type=junit&name=release-openshift-' | grep 'failures match'
release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 4 runs, 25% failed, 100% of failures match
release-openshift-origin-installer-launch-gcp - 19 runs, 42% failed, 25% of failures match
release-openshift-ocp-installer-e2e-gcp-serial-4.5 - 7 runs, 43% failed, 33% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.6 - 4 runs, 100% failed, 50% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 3 runs, 100% failed, 33% of failures match
release-openshift-ocp-installer-e2e-openstack-4.5 - 5 runs, 100% failed, 20% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 4 runs, 100% failed, 25% of failures match

Example 4.5 GCP job [1]:

level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-pg145g0x-3c2ea.origin-ci-int-gce.dev.openshift.com:6443 to initialize..."
level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.5:6443/.well-known/oauth-authorization-server endpoint data"
level=info msg="Cluster operator authentication Available is False with : "
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 5; 1 nodes are at revision 7"
level=fatal msg="failed to initialize the cluster: Working towards 4.5.0-0.nightly-2020-06-10-162829: 87% complete"

There are a few existing slow-API-server-install bugs, but in this case the nodes all seem healthy and we're hitting an explicit IP (not a localhost:6443).  But if this ends up just being a dup of an existing API-server install bug, please close as a dup.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.5/1403

Comment 1 W. Trevor King 2020-06-11 21:41:15 UTC

Bug 1822289 is a similar 404, although this one's at install-time and that one is later in the cluster's lifecycle.  But maybe not all that much later, in which case this bug may be a dup of 1822289.

Comment 2 Maru Newby 2020-06-18 14:40:34 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 3 Maru Newby 2020-07-10 22:04:16 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 4 Maru Newby 2020-07-31 16:34:47 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 5 Venkata Siva Teja Areti 2020-08-17 19:33:59 UTC

I checked the artifacts for each run that failed in last 48 hours. In each of these runs, error reported by authentication operator appears to be genuine. Auth operator is unable to read well-known endpoint information from kube-apiserver. This is because kube-apiserver is in the process of rolling out and these small disruptions are expected when checking for each kube-apiserver.

The reason for job failure as I understand it is that the install did not complete in expected time. This is because the kube-apiserver is still rolling out. Various factors could cause kube-apiserver roll out and I did not find anything suspicious in these failures that is causing the roll out to take longer.

Comment 6 Venkata Siva Teja Areti 2020-08-21 22:02:28 UTC

I am occupied with other priority items. Working on this bug will be re-evaluated in next sprint.

Comment 7 Stefan Schimanski 2020-08-25 08:17:48 UTC

> I checked the artifacts for each run that failed in last 48 hours. In each of these runs, error reported by authentication operator appears to be genuine. Auth operator is unable to read well-known endpoint information from kube-apiserver. This is because kube-apiserver is in the process of rolling out and these small disruptions are expected when checking for each kube-apiserver.

How is auth-operator accessing kube-apiserver? A 404 is no a sign of a roll-out but rather that it still misses the right configuration.

Comment 8 Venkata Siva Teja Areti 2020-08-25 12:50:14 UTC

> How is auth-operator accessing kube-apiserver? A 404 is no a sign of a roll-out but rather that it still misses the right configuration.

If I understand correctly, auth operator checks every replica of kube-apiserver for well know endpoint. As the well know endpoint is not accessible during kube-apiserver roll out, auth operator reports this error.

Comment 9 Standa Laznicka 2020-08-26 13:31:25 UTC

From what I can see, most of the failures appear on OpenStack (and in the above search link, none are actually 4.6 despite the reported claim). In most cases, it seems that the KAS just fails to roll out in time to the revision that contains the configuration with the metadata. The logs of the kube-apiserver-operator of OpenStack installations in version 4.5 show that some of the requests take up to 2.5s to process.

Based on the above observation, I'm moving this to "OpenShift on OpenStack" component for investigation on why the KAS rollouts are so slow.

Comment 10 W. Trevor King 2020-08-26 20:47:53 UTC

No idea if it's related, but when I hear "Kubernetes ...mumble mumble... rollouts are slow" recently, it makes me wonder if it's a dup of bug 1868750.

Comment 11 egarcia 2020-09-01 18:49:18 UTC

My strong suspicion is that this is just an issue of slow test infrastructure. The observation that, "The logs of the kube-apiserver-operator of OpenStack installations in version 4.5 show that some of the requests take up to 2.5s to process," seems to strongly indicate this. Also, since this bug was reported, the statistics on this failure occurring for the openstack platform are now on par with gcp, suggesting that this is not an openstack specific issue: 

openstack 4.5: 61 runs, 36% failed, 27%
gcp: 113 runs, 22% failed, 24% of failures match

Comment 12 W. Trevor King 2020-09-01 19:18:30 UTC

Not sure what info I'm being asked to provide.  Maybe slow infra is the culprit, but if so, it would be good if we could point at something in the cluster that is pointing the managing admin towards whatever steps they should take to mitigate/recover.

Comment 13 Fabiano Franz 2020-09-04 20:34:15 UTC

This has been observed frequently in some Azure jobs, e.g. the e2e-azure-operator:

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-kubernetes-autoscaler-master-e2e-azure-operator

Comment 14 egarcia 2020-09-08 13:56:58 UTC

They should be able to recover by just running `openshift-install wait-for install complete` when it crashes. I will sift through a few more logs just to make sure that nothing unusual is in there, but if its just a timeout due to slow hardware, then I'm not sure its really a bug.

Comment 15 egarcia 2020-09-09 17:40:31 UTC

I havent really found anything out of the ordinary, other than the kube-api-server replies on average being much slower on runs that failed with this error. It is a shame that prometheus doesn't run until the cluster is up. It would be nice to have those metrics. To better inform our debugging process, it might be a good idea to run i/o benchmarks on the instances as part of the must-gather script so that we have some data points to work with in case prometheus hasnt been started yet.

Comment 16 Yang Yang 2020-09-10 02:28:32 UTC

It has been observed in GCP IPI installation.

level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.8: 86% complete"
level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.5:6443/.well-known/oauth-authorization-server endpoint data"
level=info msg="Cluster operator authentication Available is False with : "
level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.5.8"
level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment"
level=info msg="Cluster operator insights Disabled is True with Disabled: Health reporting is disabled"
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8"
level=fatal msg="failed to initialize the cluster: Working towards 4.5.8: 86% complete"

Comment 18 W. Trevor King 2020-09-18 17:37:43 UTC

Been a while since the last CI-search dump here, but yeah, OpenStack seems ok now, while GCP is still impacted:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=info.*Cluster+operator+authentication+Progressing+is+True+with+_WellKnownNotReady.*404+Not+Found&maxAge=48h&type=junit&name=release-openshift-' | grep 'failures match'
release-openshift-origin-installer-e2e-gcp-4.5 - 19 runs, 37% failed, 14% of failures match
release-openshift-origin-installer-launch-gcp - 223 runs, 81% failed, 1% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci - 7 runs, 29% failed, 50% of failures match

Example 4.5 GCP job [1] has:

level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.3:6443/.well-known/oauth-authorization-server endpoint data"
level=info msg="Cluster operator authentication Available is False with : "
level=info msg="Cluster operator console Available is False with Deployment_FailedUpdate: DeploymentAvailable: 1 replicas ready at version 4.5.0-0.ci-2020-09-18-032040"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 5; 1 nodes are at revision 6"
level=fatal msg="failed to initialize the cluster: Working towards 4.5.0-0.ci-2020-09-18-032040: 86% complete" 

All six nodes are Ready=True [2].  I'm not clear on why auth needs all three kube-apiservers happy, but apparently that's the case [3].  On the other hand, it's not complaining about "need at least 3 kube-apiservers" here either.

Looking into the kube-apiserver pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1306798216066371584/artifacts/e2e-gcp/pods.json | jq '.items[] | select(.metadata.name == "kube-apiserver-ci-op-9bjbcigk-2aad9-9tc5j-master-2").status | {phase, initContainerStatuses}'
{
  "phase": "Pending",
  "initContainerStatuses": [
    {
      "containerID": "cri-o://c1fc27503768fcbebac7c0ce2ced5d1b7cdf9ac131bf59c6b0173221a00600f1",
      "image": "registry.svc.ci.openshift.org/ocp/4.5-2020-09-18-032040@sha256:224f8ee8bf76835e404f0ca8d21690557fa55ca929a14f4ed39e0d0471b59f0c",
      "imageID": "registry.svc.ci.openshift.org/ocp/4.5-2020-09-18-032040@sha256:224f8ee8bf76835e404f0ca8d21690557fa55ca929a14f4ed39e0d0471b59f0c",
      "lastState": {},
      "name": "setup",
      "ready": false,
      "restartCount": 0,
      "state": {
        "running": {
          "startedAt": "2020-09-18T04:18:14Z"
        }
      }
    }
  ]
}

That also sounds similar to stuff reported in bug 1877481.  So maybe this can be closed as a dup of bug 1877984?  Assigning back to the API-server folks to confirm.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1306798216066371584
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1306798216066371584/artifacts/e2e-gcp/nodes.json
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1877481#c18

Comment 19 Michal Fojtik 2020-10-18 18:12:07 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 21 Michal Fojtik 2020-10-23 14:12:18 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 23 Standa Laznicka 2020-11-16 11:45:19 UTC

This bug got stale, I probably won't have time anytime soon to go and loop through all the failure and see if there might be a common root cause for the failed runs. Closing.