Bug 1994820 - machine controller doesn't send vCPU quota failed messages to cluster install logs
Summary: machine controller doesn't send vCPU quota failed messages to cluster install...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
: 2090780 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-17 22:19 UTC by Karthik Perumal
Modified: 2022-08-10 10:37 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Machine API now reports degraded if an insufficient number of worker machines start when the cluster is installed Reason: Previously, only operators such as auth and ingress showed as degraded in this scenario, hiding the fact the Machine API was the real issue. Result: Now, Machine API is included in the list of failed operators, giving users a hint they should look at the state of Machines
Clone Of:
Environment:
Last Closed: 2022-08-10 10:36:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 1019 0 None open Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running 2022-05-23 14:55:44 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:37:07 UTC

Description Karthik Perumal 2021-08-17 22:19:05 UTC
Description of problem:
OSD/ROSA cluster failed to complete installation due to insufficient vCPU quota available in the AWS account. From the machine's .status.errorMessage that failed provision:

> error launching instance: You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

This should ideally be exposed to the cluster's install logs. Instead, the install logs only hints about a few operators being degraded (like monitoring, ingress, image-registry )

Version-Release number of selected component (if applicable): ROSA (OCP 4.8.4 on AWS)


How reproducible: Seems to be easily reproducible with an AWS account lacking the required vCPU quota for the cluster to provision. I have not tried reproducing however.


Steps to Reproduce:
1. Use an AWS account with insufficient vCPU quota
2. Provision a ROSA cluster on that AWS account
3. The cluster provision should fail with degraded operators

Actual results:
The cluster provision fails with the logs only highlighting a few cluster operators being degraded.

Expected results:
The cluster provision fails with the installer logs suggesting the real root cause - in this case being vCPU quota exceeded. These error messages should be available from the install logs.

Additional info:
This is not a outlier, as we have seen the same kind of clusterProvision failures occur a few times now. This is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1943376 although that was related to a different component and different quota error.

Comment 12 Greg Sheremeta 2021-12-09 21:08:32 UTC
1. Is there any ETA on this?
2. We see a similar problem when machine-api doesn't have permission to ec2:CreateInstance (because of a bad STS role passed in). The 403 that machine-api is encountering is also not being surfaced to the install log, so the only clue we have that something is wrong is a vague "0 workers created" message. Do you consider that to be this same problem (generically, "we aren't setting cluster operator status") or do you want me to open a new bug about it?

Comment 13 Michael McCune 2021-12-10 14:13:07 UTC
@

Comment 14 Michael McCune 2021-12-10 14:15:38 UTC
i seem to have had a browser malfunction, @jspeed , is there any update from our side?

Comment 15 Joel Speed 2022-01-17 17:17:41 UTC
> 1. Is there any ETA on this?

Not presently, we need to have a think about a good way to surface this. It's an install time problem IMO, we haven't historically reported broken machines as a cluster operator issue because they don't affect the general running of a cluster.
If we are going to report on the cluster operator, then it needs to be a soft failure that doesn't cause upgrades to block once the cluster is up and running

> 2. We see a similar problem when machine-api doesn't have permission to ec2:CreateInstance (because of a bad STS role passed in). The 403 that machine-api is encountering is also not being surfaced to the install log, so the only clue we have that something is wrong is a vague "0 workers created" message. Do you consider that to be this same problem (generically, "we aren't setting cluster operator status") or do you want me to open a new bug about it?

As far as I understand the issue, this sounds the same to me

Comment 18 Michael McCune 2022-04-22 13:34:34 UTC
as far as i know, we do not have an update on this bug. we still need to answer the issues that Joel raised in comment 15, specifically about reporting broken machines as a cluster operator issue during installation.

Comment 19 Joel Speed 2022-05-13 10:40:11 UTC
This was discussed on the Cluster Lifecycle architecture call yesterday. We are going to add an intermediate step of making the Machine controller "progress" until it has observed all the Machines from the initial set of Machines are running. We have set up a WG to define what a degraded operator means and will advance on making the Machine API Operator degraded once we have a clearer direction on that

Comment 20 sunzhaohua 2022-05-24 12:41:44 UTC
Verified before pr merge
1. Build image with pr openshift/machine-api-operator/pull/1019 
2. Create manifests and update one machineset yaml file such as 99_openshift-cluster-api_worker-machineset-0.yaml, change instanceType to invalid.
3. Set up ipi cluster. 

Cluster setup failed. If we delete the failed machineset, the cluster will become to normal. 
Tested similar steps on upi cluster, don't remove machineset yaml file, will create machines failed, and cluster setup failed. If remove the failed machineset, cluster will become normal.
The same steps with payload 4.11.0-0.nightly-2022-05-20-213928, the cluster installtion is successful.

05-24 16:53:18.383  level=debug msg=Still waiting for the cluster to initialize: Cluster operator machine-api is not available
05-24 17:22:13.376  level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
05-24 17:22:13.376  level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
05-24 17:22:13.376  level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
05-24 17:22:13.377  level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
05-24 17:22:13.377  level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
05-24 17:22:13.377  level=error msg=Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready
05-24 17:22:13.377  level=error msg=Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-05-24 08:32:47 +0000 UTC <nil> 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest registry.build01.ci.openshift.org/ci-ln-h7pstxb/release@sha256:a570f4b607377f7fe9e09157ce4c08d6f07aed81a86ca9856c051997ac300527 false }]
05-24 17:22:13.377  level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
05-24 17:22:13.377  level=info msg=Cluster operator insights SCANotAvailable is True with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"id":"7","kind":"Error","href":"/api/accounts_mgmt/v1/errors/7","code":"ACCT-MGMT-7","reason":"The organization (id= 1V6IJrh1cNmDxgNlAAWZRfupr3B) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management.","operation_id":"adf1f8ad-fa1e-4833-a8b8-aa7a77459db7"}
05-24 17:22:13.377  level=info msg=Cluster operator insights Disabled is False with AsExpected: 
05-24 17:22:13.378  level=info msg=Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest
05-24 17:22:13.378  level=error msg=Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest because found 1 non running machine(s): zhsunaws222-6p2ww-worker-us-east-2a-ps77t
05-24 17:22:13.378  level=info msg=Cluster operator machine-api Available is False with Initializing: Operator is initializing
05-24 17:22:13.378  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
05-24 17:22:13.378  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
05-24 17:22:13.378  level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
05-24 17:22:13.378  level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
05-24 17:22:13.378  level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
05-24 17:22:13.378  level=error msg=failed to initialize the cluster: Cluster operator machine-api is not available

$ oc get clusterversion                                                                                                                             
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          64m     Unable to apply 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest: the cluster operator machine-api has not yet successfully rolled out

$ oc get machine                                                                                                                    
NAME                                        PHASE     TYPE         REGION      ZONE         AGE
zhsunaws222-6p2ww-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   63m
zhsunaws222-6p2ww-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   63m
zhsunaws222-6p2ww-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   63m
zhsunaws222-6p2ww-worker-us-east-2a-ps77t   Failed                                          60m
zhsunaws222-6p2ww-worker-us-east-2b-lbfb7   Running   m6i.xlarge   us-east-2   us-east-2b   60m
zhsunaws222-6p2ww-worker-us-east-2c-r7qr2   Running   m6i.xlarge   us-east-2   us-east-2c   60m

$ oc get co                                                                                                                          
NAME                                       VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      48m
baremetal                                  4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      60m
cloud-controller-manager                   4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      63m
cloud-credential                           4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      62m
cluster-autoscaler                                                                                   True        False         True       60m     machine-api not ready
config-operator                            4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      62m
console                                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      51m
csi-snapshot-controller                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      61m
dns                                        4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      60m
etcd                                       4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         True       60m     UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-05-24 08:32:47 +0000 UTC <nil> 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest registry.build01.ci.openshift.org/ci-ln-h7pstxb/release@sha256:a570f4b607377f7fe9e09157ce4c08d6f07aed81a86ca9856c051997ac300527 false }]
image-registry                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      55m
ingress                                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      55m
insights                                   4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      55m
kube-apiserver                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      56m
kube-controller-manager                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      59m
kube-scheduler                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      58m
kube-storage-version-migrator              4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      61m
machine-api                                                                                          False       True          True       61m     Operator is initializing

$ oc edit co machine-api
status:
  conditions:
  - lastTransitionTime: "2022-05-24T08:35:53Z"
    message: 'Progressing towards operator: 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest'
    reason: SyncingResources
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-05-24T08:39:33Z"
    message: 'Failed when progressing towards operator: 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest
      because found 1 non running machine(s): zhsunaws222-6p2ww-worker-us-east-2a-ps77t'
    reason: SyncingFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2022-05-24T08:35:53Z"
    message: Operator is initializing
    reason: Initializing
    status: "False"
    type: Available
  - lastTransitionTime: "2022-05-24T08:35:53Z"
    status: "True"
    type: Upgradeable

$ oc delete machineset zhsunaws222-6p2ww-worker-us-east-2a                       [18:24:45]
machineset.machine.openshift.io "zhsunaws222-6p2ww-worker-us-east-2a" deleted

$ oc get machine                                                                 [18:24:55]
NAME                                        PHASE     TYPE         REGION      ZONE         AGE
zhsunaws222-6p2ww-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   111m
zhsunaws222-6p2ww-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   111m
zhsunaws222-6p2ww-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   111m
zhsunaws222-6p2ww-worker-us-east-2b-lbfb7   Running   m6i.xlarge   us-east-2   us-east-2b   108m
zhsunaws222-6p2ww-worker-us-east-2c-r7qr2   Running   m6i.xlarge   us-east-2   us-east-2c   108m

$ oc get clusterversion                                                          [18:24:59]
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         6s      Cluster version is 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest

 $ oc get co                                                                                                                                           
NAME                                       VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      96m
baremetal                                  4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
cloud-controller-manager                   4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      111m
cloud-credential                           4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      111m
cluster-autoscaler                         4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
config-operator                            4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      110m
console                                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      99m
csi-snapshot-controller                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
dns                                        4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
etcd                                       4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      108m
image-registry                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      103m
ingress                                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      104m
insights                                   4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      104m
kube-apiserver                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      104m
kube-controller-manager                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      107m
kube-scheduler                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      106m
kube-storage-version-migrator              4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      110m
machine-api                                4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      24s
machine-approver                           4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
machine-config                             4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      108m
marketplace                                4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
monitoring                                 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      100m
network                                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      111m
node-tuning                                4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
openshift-apiserver                        4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      104m
openshift-controller-manager               4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
openshift-samples                          4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      103m
operator-lifecycle-manager                 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
operator-lifecycle-manager-catalog         4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m
operator-lifecycle-manager-packageserver   4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      104m
service-ca                                 4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      110m
storage                                    4.11.0-0.ci.test-2022-05-24-021537-ci-ln-h7pstxb-latest   True        False         False      109m

Comment 21 Joel Speed 2022-05-24 13:45:22 UTC
I also tried to do this without using the degraded condition, only using progressing, the output of the installer is pretty much the same:

INFO Waiting up to 30m0s (until 2:13PM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 2:31PM) for the cluster at https://api.jspeed-test-2.devcluster.openshift.com:6443 to initialize...
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready
ERROR Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-05-24 12:44:16 +0000 UTC <nil> 4.11.0-0.nightly-2022-05-24-062131 quay.io/jspeed/release@sha256:3e84ce1004b7312c8bedcde5c7f63521c1a7fc89fd8cc4564135acfd65f8562b false }]
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights SCANotAvailable is True with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"id":"7","kind":"Error","href":"/api/accounts_mgmt/v1/errors/7","code":"ACCT-MGMT-7","reason":"The organization (id= 1W4cVqx5p9Ty1StMSTk4reQMa07) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management.","operation_id":"c99eae58-342a-406b-9e36-13d0f9fa503c"}
INFO Cluster operator machine-api Progressing is True with Initializing: found 1 non running machine(s): jspeed-test-2-wk64h-worker-us-east-2c-scxz7
INFO Cluster operator machine-api Available is False with Initializing: Operator is initializing
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Cluster operator machine-api is not available

Comment 22 Joel Speed 2022-05-25 08:09:41 UTC
If we look at using progressing with available=true (previous post was available=false), then the error from MAPI is much harder to find, still only and info log and not the headline problem in a cluster that is having other issues (ie other operators aren't up because MAPI hasn't created enough machines):

INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 5:59PM) for the Kubernetes API at https://api.jspeed-test-2.devcluster.openshift.com:6443...
INFO API v1.23.3+ad897c4 up
INFO Waiting up to 30m0s (until 6:10PM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 6:39PM) for the cluster at https://api.jspeed-test-2.devcluster.openshift.com:6443 to initialize...
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.jspeed-test-2.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded:
ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address
ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.119.56:443/healthz": dial tcp 172.30.119.56:443: connect: connection refused
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
INFO Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.119.56:443/healthz": dial tcp 172.30.119.56:443: connect: connection refused
INFO OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found
INFO ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
INFO WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.jspeed-test-2.devcluster.openshift.com in route console in namespace openshift-console
ERROR RouteHealthDegraded: console route is not admitted
ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.jspeed-test-2.devcluster.openshift.com in route console in namespace openshift-console
INFO Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted
ERROR Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-05-24 16:41:17 +0000 UTC <nil> 4.11.0-0.nightly-2022-05-24-062131 quay.io/jspeed/release@sha256:3e84ce1004b7312c8bedcde5c7f63521c1a7fc89fd8cc4564135acfd65f8562b false }]
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
INFO Cluster operator image-registry Available is False with NoReplicasAvailable: Available: The deployment does not have available replicas
INFO NodeCADaemonAvailable: The daemon set node-ca has available replicas
INFO ImagePrunerAvailable: Pruner CronJob has been created
INFO Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed
ERROR Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not have available replicas
INFO Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-85b5cccc4c-gpfqt" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod "router-default-85b5cccc4c-xjcm8" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
INFO Cluster operator insights SCANotAvailable is True with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"id":"7","kind":"Error","href":"/api/accounts_mgmt/v1/errors/7","code":"ACCT-MGMT-7","reason":"The organization (id= 1W4cVqx5p9Ty1StMSTk4reQMa07) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management.","operation_id":"5f4a53be-16cf-45f7-82f4-9284646b9937"}
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator machine-api Progressing is True with Initializing: found 1 non running machine(s): jspeed-test-2-gsbjg-worker-us-east-2c-bbjlj
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
INFO Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
INFO Cluster operator network ManagementStateDegraded is False with :
INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring

Comment 23 Joel Speed 2022-07-04 15:24:11 UTC
*** Bug 2090780 has been marked as a duplicate of this bug. ***

Comment 25 sunzhaohua 2022-07-08 10:08:37 UTC
Move to verified, this is verified before pr merge.

Comment 27 errata-xmlrpc 2022-08-10 10:36:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.