Bug 1811221 - During installing of 4.4.0-0.nightly-2020-03-06-030852 on VMware getting ERROR Cluster operator authentication Degraded
Summary: During installing of 4.4.0-0.nightly-2020-03-06-030852 on VMware getting ERRO...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.4.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1811225 (view as bug list)
Depends On:
Blocks: 1731242
TreeView+ depends on / blocked
 
Reported: 2020-03-06 21:59 UTC by krapohl
Modified: 2020-05-07 21:41 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Kubelet would erroneously exit on startup when CSI storage timed out. Consequence: Fix: CSI bringup now waits for the API readiness. Result:
Clone Of:
Environment:
Last Closed: 2020-05-04 13:07:49 UTC
Target Upstream Version:
Embargoed:
krapohl: needinfo-
krapohl: needinfo-
krapohl: needinfo-
krapohl: needinfo-


Attachments (Terms of Use)
must-gather.tar.gz.partaa (12.00 MB, application/gzip)
2020-03-25 23:23 UTC, krapohl
no flags Details
must-gather.tar.gz.partab (11.65 MB, application/octet-stream)
2020-03-25 23:24 UTC, krapohl
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:46:11 UTC

Description krapohl 2020-03-06 21:59:58 UTC
Description of problem:
During installation of 4.4.0-0.nightly-2020-03-06-030852 on VMware using RHcos of rhcos-44.81.202002241126-0-vmware.x86_64.ova I'm getting this error

after issuing this command:
openshift-install wait-for install-complete

The error is :

module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-vlan44.brown-chesterfield.com: []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-0.nightly-2020-03-06-030852".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b039973cb8ade25cdc53e05f2336f402263a2e1bed976a2a80e60986fbc90a6b".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with :
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.ope

Comment 1 Abhinav Dahiya 2020-03-06 22:06:25 UTC
Can you provide how you are creating the cluster, the logs doesn't seem familiar to me.

Secondly, moving this to Auth team. Please attach the must-gather logs for debugging.

Comment 2 krapohl 2020-03-06 22:12:07 UTC
We use a terraform to create the cluster. 

Please provide information on how to get must-gather logs?

Comment 3 krapohl 2020-03-06 22:15:25 UTC
*** Bug 1811225 has been marked as a duplicate of this bug. ***

Comment 4 krapohl 2020-03-09 15:43:09 UTC
I was given the following as the must-gather process

#oc get co
#oc get pods --all-namespaces 
#oc adm must-gather

However, at this point the cluster is not at a point where you can run oc commands, so I cannot get the must-gather.

Comment 5 krapohl 2020-03-09 20:18:18 UTC
Turned on debug for openshift-install wait-for  command  as:

openshift-install wait-for install-complete --log-level="debug"

Failure output with debug on:

module.ocp4_finish_up.null_resource.do_action[0]: Still creating... [30m0s elapsed]
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-t44two.brown-chesterfield.com: []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-0.nightly-2020-03-09-120006".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fec9371baf9082b513045d76becde894b6d5b57581365649664e22939d3e9faa".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with :
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Available is False with :
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Cluster installation has failed, exit code was: 1

Comment 6 krapohl 2020-03-09 20:19:23 UTC
Is there a way to increase the wait time used by the openshift-install wait-for command?

Comment 7 Standa Laznicka 2020-03-10 09:02:49 UTC
That's really a question for someone from the installer team, I do not know.

Also, when you say you are using terraform to create the cluster, does that mean you are not using the openshift-installer?

Comment 8 krapohl 2020-03-10 12:54:26 UTC
As I've stated above a couple times we do call the openshift-install command and the results of doing that are shown above.

Terraforms is a way to wrap the openshift-installer command. As you know, vmware is a upi install option, which means the VMs are not created as part of the install process. We use terraforms to automate the create of the VMs and then call the openshift-install commands within scripts in terraforms.

Comment 9 krapohl 2020-03-10 14:40:52 UTC
The terraform we use, has worked perfectly for installing 4.3GA, and nightly builds in the past and 4.2, the only thing we are changing is pointing to the 4.4 nightly to pick up and what vmware template to use (comes from the ova file downloaded pointed to here https://github.com/openshift/installer/blob/master/data/data/rhcos.json#L123-L127)

Comment 11 isaic 2020-03-12 21:02:21 UTC
We continue to see this error. 


up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server

It goes all the way back to Feb 10th 4.4 nightly install tries on VMware.  It was marked as a duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1802678

This one is in https://bugzilla.redhat.com/show_bug.cgi?id=1801898  as Status: ON_QA → VERIFIED   last entry shows "blocked" by https://bugzilla.redhat.com/show_bug.cgi?id=1798945 but it shows CLOSED CURRENTRELEASE.  It's hard to know for sure the status of this particular issue (No subsets found for the endpoints of oauth-server) that we continue to see. 


This occurred today trying it against 4.4.0-rc.0  on VMware. 

We are also using coreos rhcos-44.81.202003110830-0

This is all coming from here.  
https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.4.0-rc.0/

[TXT]	release.txt	11-Mar-2020 16:04 	19K	 

In the release.txt it references 
Component Versions:
  kubernetes 1.17.1               
  machine-os 44.81.202003110830-0 Red Hat Enterprise Linux CoreOS

Are these all the correct core-os level to be using?  

This is blocking our ability to install OCP 4.4 on VMware for teams that need it for ongoing test/dev.

Comment 12 krapohl 2020-03-13 15:56:22 UTC
Still getting this error on latest nightly 4.4.0-0.nightly-2020-03-13-073111 using machine-os 44.81.202003130330-0 Red Hat Enterprise Linux CoreOS

The error is : 

 ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server


Please advise on your next debug steps or addidition information you need?

Comment 13 krapohl 2020-03-17 20:44:31 UTC
@Abhinav Dahiya Can we get an update on this issue!!!! It has been sitting for a long time.

Comment 14 Standa Laznicka 2020-03-18 10:34:30 UTC
Looks like the installer folks are ignoring the question from comment 7, let me move the BZ to their component, hopefully they will make them notice it more easily.

Comment 15 Scott Dodson 2020-03-18 12:22:20 UTC
(In reply to krapohl from comment #6)
> Is there a way to increase the wait time used by the openshift-install
> wait-for command?

No, just run it again.


Auth is complaining about the route being degraded and thus it should be triaged by the Routing team if the Auth team is not sure why that's happening.

Comment 16 Dan Mace 2020-03-18 13:45:16 UTC
The nature of UPI installations limit our ability to diagnose your issue, and we need more detailed information about your specific environment. Areas that can impact ingress and DNS (and by extension auth) in a UPI installation are broad, and include:

* VPC configuration
* Security group configuration
* Load balancer configuration
* DNS configuration

Can you provide any of the following?

1. Exact steps and Terraform assets to reproduce your cluster (preferred)
2. The Terraform module(s) used to satisfy all the networking requirements laid out in the UPI guide[1]
3. More detailed info about each of the components above to audit against the UPI requirements

[1] https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md

Comment 17 krapohl 2020-03-20 13:57:48 UTC
I have tried the 4.4.0-rc.1 driver now which said it used machine-os 44.81.202003161031-0 Red Hat Enterprise Linux CoreOS, which I downloaded and referenced as the template. It failed as before, but this time as advised from above I tried running the openshift-install wait-for install-complete command again, and looks like it is still failing the same way. Following is the output from running command second time.

INFO Waiting up to 30m0s for the cluster at https://api.walt-44rc2.brown-chesterfield.com:6443 to initialize... 
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-44rc2.brown-chesterfield.com: [] 
INFO Cluster operator authentication Progressing is Unknown with NoData:  
INFO Cluster operator authentication Available is Unknown with NoData:  
INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host [] 
INFO Cluster operator console Available is Unknown with NoData:  
INFO Cluster operator dns Progressing is True with Reconciling: Not all DNS DaemonSets available. 
ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default 
INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
Moving to release version "4.4.0-rc.2".
Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32ff96f3da22ca4134ccfa46e898b8e910bced87cec4226d8f0a99e682083673". 
INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. 
INFO Cluster operator insights Disabled is False with :  
INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available 
INFO Cluster operator monitoring Available is False with :  
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 
ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main 
INFO Cluster operator service-ca Progressing is True with _ManagedDeploymentsAvailable: Progressing: 
Progressing: service-ca does not have available replicas 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring 
Cluster installation has failed, exit code was: 1


Also, the document referenced here: https://github.com/openshift/installer/blob/master/docs/user/vsphere/install_upi.md is the exact document we used to implement the terraform. As I've stated before, this terraform has been successfully used on previous OCP VMware installations of 4.2, and all its fixpacks and 4.3 and all its fixpacks. The only release it is not working on is 4.4. Is there major installations changes for 4.4?

Comment 18 krapohl 2020-03-20 14:01:16 UTC
Note in previous comment anywhere I say 4.4.0-rc.1, I meant 4.4.0-rc.2

Comment 19 krapohl 2020-03-23 14:24:42 UTC
Since I saw the nightlies are now using a new rhcos went ahead and tried lated-4.4 using machine-os 44.81.202003192230-0 Red Hat Enterprise Linux CoreOS.

Same result:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-44nightly.brown-chesterfield.com: []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-0.nightly-2020-03-23-010639".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e109b7e2afc8e208f157ea9334b0638420d9e2200d7110016db97046dfbc712".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with :
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Available is False with :
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Cluster installation has failed, exit code was: 1

Comment 20 krapohl 2020-03-23 15:04:02 UTC
Don't know if you can get at it but I will email dmace a pointer to our repository where our terraform lives.

Once cloned you go into dir tf_openshift_4/vsphere-upi. We run by doing a terraform init, ./install-openshift.sh

It's fairly complex, but like I said, this has been working in our vCenter env on 4.2, 4.3, and is only broken with the nightly and rc builds of 4.4. So my assumption is there are still issues with 4.4 on vmware. 

We saw a similar pattern on 4.3, where we could not get any nightly's working on our vCenter until the rc builds came out, however, 4.4 seems we cannot even get the 4.4.0-rc.X working either.


All VMs in the cluster have public IP 9 dot IPs.
DNS server is AWS Route53.
No loadbalancer but we turn ingress routing on in all worker nodes.

Comment 21 krapohl 2020-03-23 15:33:12 UTC
We have also tried this on a VMware setup where we have implemented a loadbalancer, DNS server in a VM, and then put the OCP cluster completely on private IPs, with a separate VM used to route all request through it to the public IP side for the cluster. This scenario was implemented using VLANs in VMware. This scenario also has the same failures.

Comment 22 Joseph Callen 2020-03-25 16:46:14 UTC
I see no issues with the latest nightly.  


DEBUG Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-03-25-101443: 100% complete, waiting on authentication, monitoring
DEBUG Cluster is initialized                       
INFO Waiting up to 10m0s for the openshift-console route to be created... 
DEBUG Route found in openshift-console namespace: console 
DEBUG Route found in openshift-console namespace: downloads 
DEBUG OpenShift console route is created           
INFO Install complete!                            

[root@control-plane-0 ~]# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba7b8a0f0a9d9a3420acefa2a569db9aab5d630d7c705c0f17df829e5badf3fd
              CustomOrigin: Managed by machine-config-operator
                   Version: 44.81.202003230949-0 (2020-03-23T09:54:01Z)

  ostree://f61524fda480c611dcd25629fd15eb6de27a306689261c211dbc8e88c19a5219
                   Version: 44.81.202001241431.0 (2020-01-24T14:36:48Z)

Comment 23 Abhinav Dahiya 2020-03-25 16:55:09 UTC
(In reply to krapohl from comment #4)
> I was given the following as the must-gather process
> 
> #oc get co
> #oc get pods --all-namespaces 
> #oc adm must-gather
> 
> However, at this point the cluster is not at a point where you can run oc
> commands, so I cannot get the must-gather.

can you attach the log from the failed must-gather command here? The API should be up right?

Comment 24 Abhinav Dahiya 2020-03-25 16:57:32 UTC
(In reply to krapohl from comment #2)
> We use a terraform to create the cluster. 
> 
> Please provide information on how to get must-gather logs?

https://docs.openshift.com/container-platform/4.3/support/gathering-cluster-data.html

Comment 25 krapohl 2020-03-25 18:44:30 UTC
Just installed 4.40-rc.4 using machine-os 44.81.202003230949-0 Red Hat Enterprise Linux CoreOS. 
Same failures.

module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.walt-rc4.brown-chesterfield.com: []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Progressing is True with RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host []
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator console Available is Unknown with NoData:
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to release version "4.4.0-rc.4".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:981953eb5c642bbf69e3dd69d4cf0493f3b158360d1a1483b3ecb6f40b13005c".
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator insights Disabled is False with :
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Available is False with :
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring
module.ocp4_finish_up.null_resource.do_action[0] (remote-exec): Cluster installation has failed, exit code was: 1


Error: error executing "/tmp/terraform_482330057.sh": Process exited with status 1

Comment 26 krapohl 2020-03-25 19:06:13 UTC
I've stated this before, we cannot do a oc login to the cluster in its current state 

This is what I get when I do a :

oc login api.walt-rc4.brown-chesterfield.com:6443 -u kubeadmin -p $(cat $ocp_dir/auth/kubeadmin-password) --insecure-skip-tls-verify=true > /dev/null


error: couldn't get https://api.walt-rc4.brown-chesterfield.com:6443/.well-known/oauth-authorization-server: unexpected response status 404

So running must gather is not an option!!!!!!!

I cannot run "oc" commands.

Comment 27 Scott Dodson 2020-03-25 19:44:04 UTC
`oc login` has a dependency on oauth and thus won't work if that component is failing. Please use the admin kubeconfig at auth/kubeconfig rather than password based auth to avoid the dependency on oauth.

Comment 28 krapohl 2020-03-25 23:23:27 UTC
Created attachment 1673641 [details]
must-gather.tar.gz.partaa

Must gather information using partaa

oc adm must-gather --config=ocp/auth/kubeconfig

Cluster ID from command: oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}' --config=ocp/auth/kubeconfig 

6c852220-81a2-4ae0-bc54-0942e09438c2

Had to split the tar.gz into two parts, partaa and partab beacause of upload file limitation of 19M.

To put back together 
cat must-gather.tar.gz.parta* > must-gather.tar.gz

Comment 29 krapohl 2020-03-25 23:24:35 UTC
Created attachment 1673642 [details]
must-gather.tar.gz.partab

Must gather information using partab

oc adm must-gather --config=ocp/auth/kubeconfig

Cluster ID from command: oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}' --config=ocp/auth/kubeconfig 

6c852220-81a2-4ae0-bc54-0942e09438c2

Had to split the tar.gz into two parts, partaa and partab beacause of upload file limitation of 19M.

To put back together 
cat must-gather.tar.gz.parta* > must-gather.tar.gz

Comment 30 Dan Mace 2020-03-26 12:31:09 UTC
Just based on the must-gather, the reason ingress isn't running seems pretty clear.

From openshift-ingress-operator/ingresscontrollers/default:

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2020-03-25T22:16:23Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "13300"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
  uid: 2ce57c12-ba2d-4173-902e-25ac6a1719b3
spec:
  replicas: 2
status:
  availableReplicas: 0
  conditions:
  - lastTransitionTime: "2020-03-25T22:16:24Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-03-25T22:16:30Z"
    message: 'The deployment is unavailable: Deployment does not have minimum availability.'
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2020-03-25T22:16:30Z"
    message: 'The deployment has Available status condition set to False (reason:
      MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.'
    reason: DeploymentUnavailable
    status: "True"
    type: DeploymentDegraded

So, ingress is reporting degraded because the deployment isn't available. Looking at the deployment in openshift-ingress/deployments/router-default and the events in openshift-ingress, you can see that the reason the deployment is stuck is because the router pods are pending scheduling. The reason they're pending scheduling is because there are no worker nodes on which to schedule them.

Looking at the node resources, I see only three master nodes and no workers.

Without ready workers, ingress can't roll out.

Comment 32 krapohl 2020-03-26 14:30:24 UTC
Why aren't the workers becoming ready? Is there something wrong with the worker ignition files. VMs are definately created and have IP addresses and from my experience they are showing the correct CoreOS on the login screen.  The Boot sequence has completed successfully, we are just running the "openshift-install wait-for install-complete" and waiting for the cluster to come up. 

I can log into each worker node using ssh -i id_ocp_key core.68.61 for example. 

What additional information do you need to debug why the workers are not becoming ready?

Comment 33 krapohl 2020-03-26 14:34:25 UTC
Sorry didnt see the previous note on checking CSRs before entering my last comment. Working on that now.

Comment 34 krapohl 2020-03-26 14:42:32 UTC
Following the direction in the previous given about validating CSRs, first thing I did was a oc get no, and I'm not seeing any worker nodes.

[root@walt-rc55555-keg ~]# oc get no --config=ocp/auth/kubeconfig 
Flag --config has been deprecated, use --kubeconfig instead
NAME                             STATUS   ROLES    AGE   VERSION
ip-9-46-68-58.cluster.internal   Ready    master   16h   v1.17.1
ip-9-46-68-59.cluster.internal   Ready    master   16h   v1.17.1
ip-9-46-68-60.cluster.internal   Ready    master   16h   v1.17.1

So it does not appear the cluster has recognized the worker node per the doc.

What is next?

Comment 35 Joseph Callen 2020-03-26 16:17:56 UTC
Did you approve the CSRs?
oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve

Comment 36 krapohl 2020-03-26 19:20:12 UTC
Ran the 

oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve

then 

oc get no

and started to see two of the worker nodes show up. Had to run the above approve csr command multiple times before all the worker nodes showed up.

Then re-ran "openshift-install wait-for install-complete" again and now looks like OCP is now up.


Thank you for your help ... looks like we need some modifications in our terraform automation to handle the un-approved csr's. Never ran into this on 4.3 and 4.2.

Comment 37 krapohl 2020-03-26 19:29:07 UTC
Just another note seems like the documentation  here  https://docs.openshift.com/container-platform/4.2/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere  has problems.

Procedure 1 seems to imply that you must get to the point where your master and worker nodes are all in the ready state before you go any further. 

Where in fact the worker nodes do not go to ready state until you have approved all the CSRs first. 

This section does not reflect how it really works and needs to be re-written, based on my experience. I would not have progressed past procedure 1, thinking there was some other problem I needed to solve to get all the workers visible and in ready state before I could approve the CSRs.

Comment 38 Scott Dodson 2020-03-30 13:30:03 UTC
This is likely due to a change in Kubelet behavior. Re-opening and assigning to node team, they'll likely dupe this but I think we need clarity on the node approval workflow too which i hope they can deliver.

Comment 42 krapohl 2020-03-31 20:22:26 UTC
Duping this issue to 1811225 makes no sense. Its resolution was to dup to this issue, 1811221. So in effect there is no resolution because of the cycle dup of the issues. 

Can we please have a resolution of the problem and explanation per Scot Dodson request.

Comment 46 krapohl 2020-04-06 17:15:13 UTC
Do we have an explanation on why in OCP 4.4 that the command "openshift-install wait-for bootstrap-complete" is now not waiting for all pending CSRs to be approved and all worker nodes are not showing Ready state, before it completes.

In OCP 4.2 and 4.3, this is not how it worked. 

Seems like a new bug in OCP 4.4 Vmware UPI code.

Comment 48 Filip Brychta 2020-04-20 15:53:41 UTC
I hit this issue too.
Problem is that even when following documentation this issue is visible.

We have a script which is doing following:
./openshift-install --dir "${CLUSTER_DIR}" wait-for bootstrap-complete
oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve || true
./openshift-install --dir "${CLUSTER_DIR}" wait-for install-complete

and we still hit the issue.

It's necessary to run 'oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve'
multiple times when './openshift-install --dir "${CLUSTER_DIR}" wait-for install-complete' is in progress.

Comment 50 errata-xmlrpc 2020-05-04 11:45:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 51 krapohl 2020-05-04 12:58:00 UTC
Please point to the fix within the advisory which fixes the problem?

Comment 52 Scott Dodson 2020-05-04 13:07:49 UTC
We do not re-open bugs that were closed as part of the errata process under any circumstances. You're welcome to post a question and set needinfo but do not re-open the bug.

Comment 53 Joseph Callen 2020-05-04 13:31:45 UTC
(In reply to krapohl from comment #51)
> Please point to the fix within the advisory which fixes the problem?

This issue was only in our CI environment. We were destroying the bootstrap vm and address within IPAM and a job would come after and reuse that addresses.  The problem was that the API - A record contained that bootstrap ip. This was causing random (depending on DNS query) failures. In the meantime until a fix can be determined we removed deletion of the bootstrap node at least in CI.

This past weekend every version OCP that was tested passed CI:

https://prow.svc.ci.openshift.org/?job=*vsphere*

Comment 54 mkumatag 2020-05-06 09:49:39 UTC
I have seen this issue on ppc64le platform as well via UPI installation in 4.4 latest builds, so wondering where the went in and what build?

Comment 55 krapohl 2020-05-06 12:11:50 UTC
The fact remains the issue is still happening whether you do not want the issue re-opened. You did not solve the problem.

Comment 56 Anthony Chen 2020-05-07 21:41:50 UTC
(In reply to krapohl from comment #46)
(In reply to Filip Brychta from comment #48)
> Do we have an explanation on why in OCP 4.4 that the command
> "openshift-install wait-for bootstrap-complete" is now not waiting for all
> pending CSRs to be approved and all worker nodes are not showing Ready
> state, before it completes.
> 
> In OCP 4.2 and 4.3, this is not how it worked. 
> 
> Seems like a new bug in OCP 4.4 Vmware UPI code.

In bootstrap node, there is an approve-csr.service constantly running to monitor and approve any pending CSRs from the worker nodes via the API endpoint. This works flawlessly when the worker nodes generate CSRs >>BEFORE<< the bootstrapping completes, which is the case in the OCP pre-4.4 deployments. With OCP 4.4 (at least 4.4.3 on our baremetal deployment) however, this is no longer valid - either it takes longer for the worker nodes to reach the stage for submitting the CSRs, or the 4.4 bootstrapping ends sooner than before. This has caused the CSRs to get stuck in pending forever and installation eventually fails if you don't catch the moment to manually approve the CSRs by oc command. This is not acceptable for our automated cluster build process. Knowing the root cause, I've come up with this simple patch which will modify the approve-csr.sh script to stay active for another 20 minutes after bootstrap ends:

CSR_PATCH=$(jq -r '.storage.files[]|select(.path=="/usr/local/bin/approve-csr.sh")|.contents.source|split(",")[1]' bootstrap.ign | base64 -d | sed -e 's/-f/$(find/' -e 's/-f/$(find/' -e 's/]/-type f -mmin +20) ]/' -e 's/{1}/{1%-loopback}/' | base64 -w0)

IGN=$(cat << EOF
{
  "overwrite" : true,
  "mode" : 365,
  "filesystem" : "root",
  "path" : "/usr/local/bin/approve-csr.sh",
  "contents" : { "source" : "data:text/plain;charset=utf-8;base64,$CSR_PATCH" },
  "user" : {
  "name" : "root"
  }
}
EOF
)

jq '.storage.files += '"[$IGN]" bootstrap.ign > bootstrap-new.ign

cp bootstrap-new.ign bootstrap.ign

(You can include this patch in your cluster build process easily. Use at your own risk. )


Note You need to log in before you can comment on or make changes to this bug.