Description of problem: During installation of 4.4.0-0.nightly-2020-03-06-030852 on VMware using RHcos of rhcos-44.81.202002241126-0-vmware.x86_64.ova I'm getting this error after issuing this command: openshift-install wait-for install-complete The error is : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-vlan44.brown-chesterfield.com: [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-0.nightly-2020-03-06-030852". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b039973cb8ade25cdc53e05f2336f402263a2e1bed976a2a80e60986fbc90a6b". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.ope
Can you provide how you are creating the cluster, the logs doesn't seem familiar to me. Secondly, moving this to Auth team. Please attach the must-gather logs for debugging.
We use a terraform to create the cluster. Please provide information on how to get must-gather logs?
*** Bug 1811225 has been marked as a duplicate of this bug. ***
I was given the following as the must-gather process #oc get co #oc get pods --all-namespaces #oc adm must-gather However, at this point the cluster is not at a point where you can run oc commands, so I cannot get the must-gather.
Turned on debug for openshift-install wait-for command as: openshift-install wait-for install-complete --log-level="debug" Failure output with debug on: module.ocp4_finish_up.null_resource.do_action[0]: Still creating... [30m0s elapsed] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-t44two.brown-chesterfield.com: [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-0.nightly-2020-03-09-120006". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fec9371baf9082b513045d76becde894b6d5b57581365649664e22939d3e9faa". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Available is False with : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Cluster installation has failed, exit code was: 1
Is there a way to increase the wait time used by the openshift-install wait-for command?
That's really a question for someone from the installer team, I do not know. Also, when you say you are using terraform to create the cluster, does that mean you are not using the openshift-installer?
As I've stated above a couple times we do call the openshift-install command and the results of doing that are shown above. Terraforms is a way to wrap the openshift-installer command. As you know, vmware is a upi install option, which means the VMs are not created as part of the install process. We use terraforms to automate the create of the VMs and then call the openshift-install commands within scripts in terraforms.
The terraform we use, has worked perfectly for installing 4.3GA, and nightly builds in the past and 4.2, the only thing we are changing is pointing to the 4.4 nightly to pick up and what vmware template to use (comes from the ova file downloaded pointed to here https://github.com/openshift/installer/blob/master/data/data/rhcos.json#L123-L127)
We continue to see this error. up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server It goes all the way back to Feb 10th 4.4 nightly install tries on VMware. It was marked as a duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1802678 This one is in https://bugzilla.redhat.com/show_bug.cgi?id=1801898 as Status: ON_QA → VERIFIED last entry shows "blocked" by https://bugzilla.redhat.com/show_bug.cgi?id=1798945 but it shows CLOSED CURRENTRELEASE. It's hard to know for sure the status of this particular issue (No subsets found for the endpoints of oauth-server) that we continue to see. This occurred today trying it against 4.4.0-rc.0 on VMware. We are also using coreos rhcos-44.81.202003110830-0 This is all coming from here. https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.4.0-rc.0/ [TXT] release.txt 11-Mar-2020 16:04 19K In the release.txt it references Component Versions: kubernetes 1.17.1 machine-os 44.81.202003110830-0 Red Hat Enterprise Linux CoreOS Are these all the correct core-os level to be using? This is blocking our ability to install OCP 4.4 on VMware for teams that need it for ongoing test/dev.
Still getting this error on latest nightly 4.4.0-0.nightly-2020-03-13-073111 using machine-os 44.81.202003130330-0 Red Hat Enterprise Linux CoreOS The error is : ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server Please advise on your next debug steps or addidition information you need?
@Abhinav Dahiya Can we get an update on this issue!!!! It has been sitting for a long time.
Looks like the installer folks are ignoring the question from comment 7, let me move the BZ to their component, hopefully they will make them notice it more easily.
(In reply to krapohl from comment #6) > Is there a way to increase the wait time used by the openshift-install > wait-for command? No, just run it again. Auth is complaining about the route being degraded and thus it should be triaged by the Routing team if the Auth team is not sure why that's happening.
The nature of UPI installations limit our ability to diagnose your issue, and we need more detailed information about your specific environment. Areas that can impact ingress and DNS (and by extension auth) in a UPI installation are broad, and include: * VPC configuration * Security group configuration * Load balancer configuration * DNS configuration Can you provide any of the following? 1. Exact steps and Terraform assets to reproduce your cluster (preferred) 2. The Terraform module(s) used to satisfy all the networking requirements laid out in the UPI guide[1] 3. More detailed info about each of the components above to audit against the UPI requirements [1] https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md
I have tried the 4.4.0-rc.1 driver now which said it used machine-os 44.81.202003161031-0 Red Hat Enterprise Linux CoreOS, which I downloaded and referenced as the template. It failed as before, but this time as advised from above I tried running the openshift-install wait-for install-complete command again, and looks like it is still failing the same way. Following is the output from running command second time. INFO Waiting up to 30m0s for the cluster at https://api.walt-44rc2.brown-chesterfield.com:6443 to initialize... ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-44rc2.brown-chesterfield.com: [] INFO Cluster operator authentication Progressing is Unknown with NoData: INFO Cluster operator authentication Available is Unknown with NoData: INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host [] INFO Cluster operator console Available is Unknown with NoData: INFO Cluster operator dns Progressing is True with Reconciling: Not all DNS DaemonSets available. ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. Moving to release version "4.4.0-rc.2". Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32ff96f3da22ca4134ccfa46e898b8e910bced87cec4226d8f0a99e682083673". INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. INFO Cluster operator insights Disabled is False with : INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available INFO Cluster operator monitoring Available is False with : INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main INFO Cluster operator service-ca Progressing is True with _ManagedDeploymentsAvailable: Progressing: Progressing: service-ca does not have available replicas FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring Cluster installation has failed, exit code was: 1 Also, the document referenced here: https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md is the exact document we used to implement the terraform. As I've stated before, this terraform has been successfully used on previous OCP VMware installations of 4.2, and all its fixpacks and 4.3 and all its fixpacks. The only release it is not working on is 4.4. Is there major installations changes for 4.4?
Note in previous comment anywhere I say 4.4.0-rc.1, I meant 4.4.0-rc.2
Since I saw the nightlies are now using a new rhcos went ahead and tried lated-4.4 using machine-os 44.81.202003192230-0 Red Hat Enterprise Linux CoreOS. Same result: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-44nightly.brown-chesterfield.com: [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-0.nightly-2020-03-23-010639". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e109b7e2afc8e208f157ea9334b0638420d9e2200d7110016db97046dfbc712". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Available is False with : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Cluster installation has failed, exit code was: 1
Don't know if you can get at it but I will email dmace a pointer to our repository where our terraform lives. Once cloned you go into dir tf_openshift_4/vsphere-upi. We run by doing a terraform init, ./install-openshift.sh It's fairly complex, but like I said, this has been working in our vCenter env on 4.2, 4.3, and is only broken with the nightly and rc builds of 4.4. So my assumption is there are still issues with 4.4 on vmware. We saw a similar pattern on 4.3, where we could not get any nightly's working on our vCenter until the rc builds came out, however, 4.4 seems we cannot even get the 4.4.0-rc.X working either. All VMs in the cluster have public IP 9 dot IPs. DNS server is AWS Route53. No loadbalancer but we turn ingress routing on in all worker nodes.
We have also tried this on a VMware setup where we have implemented a loadbalancer, DNS server in a VM, and then put the OCP cluster completely on private IPs, with a separate VM used to route all request through it to the public IP side for the cluster. This scenario was implemented using VLANs in VMware. This scenario also has the same failures.
I see no issues with the latest nightly. DEBUG Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-03-25-101443: 100% complete, waiting on authentication, monitoring DEBUG Cluster is initialized INFO Waiting up to 10m0s for the openshift-console route to be created... DEBUG Route found in openshift-console namespace: console DEBUG Route found in openshift-console namespace: downloads DEBUG OpenShift console route is created INFO Install complete! [root@control-plane-0 ~]# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba7b8a0f0a9d9a3420acefa2a569db9aab5d630d7c705c0f17df829e5badf3fd CustomOrigin: Managed by machine-config-operator Version: 44.81.202003230949-0 (2020-03-23T09:54:01Z) ostree://f61524fda480c611dcd25629fd15eb6de27a306689261c211dbc8e88c19a5219 Version: 44.81.202001241431.0 (2020-01-24T14:36:48Z)
(In reply to krapohl from comment #4) > I was given the following as the must-gather process > > #oc get co > #oc get pods --all-namespaces > #oc adm must-gather > > However, at this point the cluster is not at a point where you can run oc > commands, so I cannot get the must-gather. can you attach the log from the failed must-gather command here? The API should be up right?
(In reply to krapohl from comment #2) > We use a terraform to create the cluster. > > Please provide information on how to get must-gather logs? https://docs.openshift.com/container-platform/4.3/support/gathering-cluster-data.html
Just installed 4.40-rc.4 using machine-os 44.81.202003230949-0 Red Hat Enterprise Linux CoreOS. Same failures. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-rc4.brown-chesterfield.com: [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host [] module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData: module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-rc.4". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:981953eb5c642bbf69e3dd69d4cf0493f3b158360d1a1483b3ecb6f40b13005c". module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Available is False with : module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Cluster installation has failed, exit code was: 1 Error: error executing "/tmp/terraform_482330057.sh": Process exited with status 1
I've stated this before, we cannot do a oc login to the cluster in its current state This is what I get when I do a : oc login api.walt-rc4.brown-chesterfield.com:6443 -u kubeadmin -p $(cat $ocp_dir/auth/kubeadmin-password) --insecure-skip-tls-verify=true > /dev/null error: couldn't get https://api.walt-rc4.brown-chesterfield.com:6443/.well-known/oauth-authorization-server: unexpected response status 404 So running must gather is not an option!!!!!!! I cannot run "oc" commands.
`oc login` has a dependency on oauth and thus won't work if that component is failing. Please use the admin kubeconfig at auth/kubeconfig rather than password based auth to avoid the dependency on oauth.
Created attachment 1673641 [details] must-gather.tar.gz.partaa Must gather information using partaa oc adm must-gather --config=ocp/auth/kubeconfig Cluster ID from command: oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}' --config=ocp/auth/kubeconfig 6c852220-81a2-4ae0-bc54-0942e09438c2 Had to split the tar.gz into two parts, partaa and partab beacause of upload file limitation of 19M. To put back together cat must-gather.tar.gz.parta* > must-gather.tar.gz
Created attachment 1673642 [details] must-gather.tar.gz.partab Must gather information using partab oc adm must-gather --config=ocp/auth/kubeconfig Cluster ID from command: oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}' --config=ocp/auth/kubeconfig 6c852220-81a2-4ae0-bc54-0942e09438c2 Had to split the tar.gz into two parts, partaa and partab beacause of upload file limitation of 19M. To put back together cat must-gather.tar.gz.parta* > must-gather.tar.gz
Just based on the must-gather, the reason ingress isn't running seems pretty clear. From openshift-ingress-operator/ingresscontrollers/default: apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2020-03-25T22:16:23Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: default namespace: openshift-ingress-operator resourceVersion: "13300" selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default uid: 2ce57c12-ba2d-4173-902e-25ac6a1719b3 spec: replicas: 2 status: availableReplicas: 0 conditions: - lastTransitionTime: "2020-03-25T22:16:24Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2020-03-25T22:16:30Z" message: 'The deployment is unavailable: Deployment does not have minimum availability.' reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2020-03-25T22:16:30Z" message: 'The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.' reason: DeploymentUnavailable status: "True" type: DeploymentDegraded So, ingress is reporting degraded because the deployment isn't available. Looking at the deployment in openshift-ingress/deployments/router-default and the events in openshift-ingress, you can see that the reason the deployment is stuck is because the router pods are pending scheduling. The reason they're pending scheduling is because there are no worker nodes on which to schedule them. Looking at the node resources, I see only three master nodes and no workers. Without ready workers, ingress can't roll out.
Are you approving CSRs per the documentation? https://docs.openshift.com/container-platform/4.3/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere
Why aren't the workers becoming ready? Is there something wrong with the worker ignition files. VMs are definately created and have IP addresses and from my experience they are showing the correct CoreOS on the login screen. The Boot sequence has completed successfully, we are just running the "openshift-install wait-for install-complete" and waiting for the cluster to come up. I can log into each worker node using ssh -i id_ocp_key core.68.61 for example. What additional information do you need to debug why the workers are not becoming ready?
Sorry didnt see the previous note on checking CSRs before entering my last comment. Working on that now.
Following the direction in the previous given about validating CSRs, first thing I did was a oc get no, and I'm not seeing any worker nodes. [root@walt-rc55555-keg ~]# oc get no --config=ocp/auth/kubeconfig Flag --config has been deprecated, use --kubeconfig instead NAME STATUS ROLES AGE VERSION ip-9-46-68-58.cluster.internal Ready master 16h v1.17.1 ip-9-46-68-59.cluster.internal Ready master 16h v1.17.1 ip-9-46-68-60.cluster.internal Ready master 16h v1.17.1 So it does not appear the cluster has recognized the worker node per the doc. What is next?
Did you approve the CSRs? oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve
Ran the oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve then oc get no and started to see two of the worker nodes show up. Had to run the above approve csr command multiple times before all the worker nodes showed up. Then re-ran "openshift-install wait-for install-complete" again and now looks like OCP is now up. Thank you for your help ... looks like we need some modifications in our terraform automation to handle the un-approved csr's. Never ran into this on 4.3 and 4.2.
Just another note seems like the documentation here https://docs.openshift.com/container-platform/4.2/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere has problems. Procedure 1 seems to imply that you must get to the point where your master and worker nodes are all in the ready state before you go any further. Where in fact the worker nodes do not go to ready state until you have approved all the CSRs first. This section does not reflect how it really works and needs to be re-written, based on my experience. I would not have progressed past procedure 1, thinking there was some other problem I needed to solve to get all the workers visible and in ready state before I could approve the CSRs.
This is likely due to a change in Kubelet behavior. Re-opening and assigning to node team, they'll likely dupe this but I think we need clarity on the node approval workflow too which i hope they can deliver.
Duping this issue to 1811225 makes no sense. Its resolution was to dup to this issue, 1811221. So in effect there is no resolution because of the cycle dup of the issues. Can we please have a resolution of the problem and explanation per Scot Dodson request.
Do we have an explanation on why in OCP 4.4 that the command "openshift-install wait-for bootstrap-complete" is now not waiting for all pending CSRs to be approved and all worker nodes are not showing Ready state, before it completes. In OCP 4.2 and 4.3, this is not how it worked. Seems like a new bug in OCP 4.4 Vmware UPI code.
I hit this issue too. Problem is that even when following documentation this issue is visible. We have a script which is doing following: ./openshift-install --dir "${CLUSTER_DIR}" wait-for bootstrap-complete oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve || true ./openshift-install --dir "${CLUSTER_DIR}" wait-for install-complete and we still hit the issue. It's necessary to run 'oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve' multiple times when './openshift-install --dir "${CLUSTER_DIR}" wait-for install-complete' is in progress.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
Please point to the fix within the advisory which fixes the problem?
We do not re-open bugs that were closed as part of the errata process under any circumstances. You're welcome to post a question and set needinfo but do not re-open the bug.
(In reply to krapohl from comment #51) > Please point to the fix within the advisory which fixes the problem? This issue was only in our CI environment. We were destroying the bootstrap vm and address within IPAM and a job would come after and reuse that addresses. The problem was that the API - A record contained that bootstrap ip. This was causing random (depending on DNS query) failures. In the meantime until a fix can be determined we removed deletion of the bootstrap node at least in CI. This past weekend every version OCP that was tested passed CI: https://prow.svc.ci.openshift.org/?job=*vsphere*
I have seen this issue on ppc64le platform as well via UPI installation in 4.4 latest builds, so wondering where the went in and what build?
The fact remains the issue is still happening whether you do not want the issue re-opened. You did not solve the problem.
(In reply to krapohl from comment #46) (In reply to Filip Brychta from comment #48) > Do we have an explanation on why in OCP 4.4 that the command > "openshift-install wait-for bootstrap-complete" is now not waiting for all > pending CSRs to be approved and all worker nodes are not showing Ready > state, before it completes. > > In OCP 4.2 and 4.3, this is not how it worked. > > Seems like a new bug in OCP 4.4 Vmware UPI code. In bootstrap node, there is an approve-csr.service constantly running to monitor and approve any pending CSRs from the worker nodes via the API endpoint. This works flawlessly when the worker nodes generate CSRs >>BEFORE<< the bootstrapping completes, which is the case in the OCP pre-4.4 deployments. With OCP 4.4 (at least 4.4.3 on our baremetal deployment) however, this is no longer valid - either it takes longer for the worker nodes to reach the stage for submitting the CSRs, or the 4.4 bootstrapping ends sooner than before. This has caused the CSRs to get stuck in pending forever and installation eventually fails if you don't catch the moment to manually approve the CSRs by oc command. This is not acceptable for our automated cluster build process. Knowing the root cause, I've come up with this simple patch which will modify the approve-csr.sh script to stay active for another 20 minutes after bootstrap ends: CSR_PATCH=$(jq -r '.storage.files[]|select(.path=="/usr/local/bin/approve-csr.sh")|.contents.source|split(",")[1]' bootstrap.ign | base64 -d | sed -e 's/-f/$(find/' -e 's/-f/$(find/' -e 's/]/-type f -mmin +20) ]/' -e 's/{1}/{1%-loopback}/' | base64 -w0) IGN=$(cat << EOF { "overwrite" : true, "mode" : 365, "filesystem" : "root", "path" : "/usr/local/bin/approve-csr.sh", "contents" : { "source" : "data:text/plain;charset=utf-8;base64,$CSR_PATCH" }, "user" : { "name" : "root" } } EOF ) jq '.storage.files += '"[$IGN]" bootstrap.ign > bootstrap-new.ign cp bootstrap-new.ign bootstrap.ign (You can include this patch in your cluster build process easily. Use at your own risk. )