Description of problem: 1. For the configuration of a 3 node converged cluster for OCP 4.8 on Z for KVM, the installation does not complete as a number of cluster operators do not complete roll out. 2. The same configuration of a 3 node converged cluster for OCP 4.8 on Z for zVM consistently completes installation. 3. This installation issue does not appear to be a KVM resource issue as multiple CPU, Memory, Storage configurations have been tested for the 3 master nodes for KVM, including for environments where these resources are significantly greater than the 3 node minimum configuration resource recommendations. 4. This installation issue does appear to be an issue with requiring at least 2 nodes dedicated to the worker role (only), and not working when master nodes fulfill both the master and worker roles. Version-Release number of selected component (if applicable): Multiple 4.8.0 nightly builds including current builds over the past 2+ weeks. How reproducible: 1. Given extensive testing on different KVM LPARs/hypervisors on 2 different z15 servers, this is straight forward to reproduce with any sufficient CPU/Memory configuration with only 3 master nodes that are scheduleable (as worker nodes). 2. This issue looks to be specific to the KVM environment/configuration and not the CPU, Memory, and/or storage (disk) resources involved, as our testing has been conducted with significant amounts of these resources (including above the minimal required). 3. This issue appears to be tied to some factor(s) involved in the configuration of dedicated worker nodes fulfilling the "worker" role, as opposed to 3 master nodes' fulfilling both the "master" and "worker" roles. Steps to Reproduce: 1. Configure a OCP 4.8 on Z KVM environment with 3 master nodes and 0 worker nodes. 2. Initiate the OCP 4.8 on Z installation with the 3 master nodes as "mastersSchedulable". 3. The installation will not successfully complete. Please see some specifics under the "Additional info" section. Actual results: 1. The OCP 4.8 on Z KVM install does not successfully complete as the status is stuck forever at "some cluster operators have not yet rolled out", per the "oc get clusterversion" command executed on the KVM bastion node. 2. For KVM environments, when installing with 3 master nodes and 0 worker nodes, where the master nodes fulfill both the master and worker roles: 1. The OCP 4.8 installation does not complete. 2. The authentication and console cluster operators are not available and in degraded state. 3. The ingress cluster operator is available but in degraded state. 4. The monitoring cluster operator successfully rolls out and is available and not in a degraded state. (It appears that the monitoring cluster operator needs a minimum of 2 nodes fulfilling the worker role to successfully roll out, whether the worker role is fulfilled by the master nodes and/or dedicated worker nodes.) 3. For KVM environments, when installing with 3 master nodes and 1 worker node, with dedicated roles per nodes where no nodes fulfill both master and worker roles: 1. The OCP 4.8 installation does not complete (as is expected). 2. The authentication and console cluster operators become available. 3. The ingress cluster operator status is degraded. 4. The monitoring cluster operator does not roll out (does not become available). 4. For KVM environments, when installing with 3 master nodes and 1 worker node, where the master nodes fulfill both master and worker roles, and there is 1 worker node dedicated to the worker role: 1. The OCP 4.8 installation does not complete. 2. The authentication and console cluster operators are not available and in degraded state. 3. The ingress cluster operator status is degraded. 4. The monitoring cluster operator successfully rolls out. Note: 1. This configuration seems to reinforce that for the authentication and console cluster operators to become available, and the ingress operator to not be degraded, 2 nodes dedicated to the worker role (only) are required. 5. For KVM environments, when installing with 3 master nodes and 2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 2 worker nodes dedicated to the worker role: 1. The OCP 4.8 installation does not complete. 2. The authentication and console cluster operators are not available and in degraded state. 3. The ingress cluster operator status is degraded. 4. The monitoring cluster operator successfully rolls out. Notes: 1. This configuration seems to reinforce that for the authentication and console cluster operators to become available, and the ingress operator to not be degraded, 2 nodes dedicated to the worker role (only) are required. 2. This configuration also seems to suggest that the when the master nodes fulfill both the master and worker roles, even when there are 2 other nodes fulfilling the worker role only, the master nodes interfere with or prevent the proper functioning of the dedicated worker nodes' role for the authentication, console, and ingress cluster operators. 6. For KVM environments, when installing with 3 master nodes and 2 worker nodes, with dedicated roles per nodes where no nodes fulfill both master and worker roles: 1. The OCP 4.8 installation completes successfully (as expected). 2. The authentication and console cluster operators become available. 3. The ingress cluster operator status is not degraded. 4. The monitoring cluster operator successfully rolls out. 7. Please let me know if the #4 and #5 install scenarios listed above should constitute a separate bugzilla defect? Expected results: 1. The OCP 4.8 on Z KVM install should successfully complete, as it does in a standard KVM configuration with 3 dedicated master nodes and 2 dedicated worker nodes, and in a OCP 4.8 on Z zVM environment with 3 master nodes only. Additional info: 1. Here is the output of the "oc get clusterversion", "oc get nodes" and "oc get co" commands from the KVM bastion node for a 3 node converged cluster. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 176m Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-154342: some cluster operators have not yet rolled out NAME STATUS ROLES AGE VERSION master-0.pok-242.ocptest.pok.stglabs.ibm.com Ready master,worker 173m v1.21.0-rc.0+6143dea master-1.pok-242.ocptest.pok.stglabs.ibm.com Ready master,worker 173m v1.21.0-rc.0+6143dea master-2.pok-242.ocptest.pok.stglabs.ibm.com Ready master,worker 173m v1.21.0-rc.0+6143dea NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-s390x-2021-04-26-154342 False False True 170m baremetal 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m cloud-credential 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 174m cluster-autoscaler 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m config-operator 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m console 4.8.0-0.nightly-s390x-2021-04-26-154342 False True True 164m csi-snapshot-controller 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m dns 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m etcd 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 168m image-registry 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 165m ingress 4.8.0-0.nightly-s390x-2021-04-26-154342 True False True 117m insights 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 164m kube-apiserver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 164m kube-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 168m kube-scheduler 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 168m kube-storage-version-migrator 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m machine-api 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m machine-approver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m machine-config 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m marketplace 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m monitoring 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 116m network 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m node-tuning 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m openshift-apiserver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 87m openshift-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m openshift-samples 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 164m operator-lifecycle-manager 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m operator-lifecycle-manager-catalog 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 85m service-ca 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m storage 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m Thank you.
Folks, Please note that we have now also determined that for OCP 4.8 on Z for zVM, scenarios #4 and #5 listed above in the "Actual results:" section also apply. Specifically: 1. For zVM environments, when installing with 3 master nodes and 1 worker node, where the master nodes fulfill both master and worker roles, and there is 1 worker node dedicated to the worker role: 1. The OCP 4.8 installation does not complete. 2. The authentication and console cluster operators are not available and in degraded state. 3. The ingress cluster operator status is degraded. 4. The monitoring cluster operator successfully rolls out. 5. The network cluster operator is not available. Note: 1. This configuration seems to reinforce that for the authentication and console cluster operators to become available, and the ingress operator to not be degraded, 2 nodes dedicated to the worker role (only) are required. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 27m Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-125220: some cluster operators have not yet rolled out # oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 26m v1.21.0-rc.0+6143dea master-1.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 26m v1.21.0-rc.0+6143dea master-2.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 26m v1.21.0-rc.0+6143dea worker-0.pok-25.ocptest.pok.stglabs.ibm.com Ready worker 16m v1.21.0-rc.0+6143dea # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-s390x-2021-04-26-125220 False False True 22m baremetal 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m cloud-credential 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 26m cluster-autoscaler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m config-operator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 22m console 4.8.0-0.nightly-s390x-2021-04-26-125220 False True True 15m csi-snapshot-controller 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m dns 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m etcd 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m image-registry 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m ingress 4.8.0-0.nightly-s390x-2021-04-26-125220 True False True 16m insights 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 15m kube-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True True False 18m kube-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 18m kube-scheduler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m kube-storage-version-migrator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m machine-api 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m machine-approver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m machine-config 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m marketplace 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m monitoring 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 7m network False True True 26m node-tuning 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m openshift-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m openshift-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 19m openshift-samples 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 15m operator-lifecycle-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m operator-lifecycle-manager-catalog 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m service-ca 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m storage 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 22m 2. For zVM environments, when installing with 3 master nodes and 2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 2 worker nodes dedicated to the worker role: 1. The OCP 4.8 installation does not complete. 2. The authentication and console cluster operators are not available and in degraded state. 3. The ingress cluster operator status is degraded. 4. The monitoring cluster operator successfully rolls out. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 4h44m Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-125220: some cluster operators have not yet rolled out # oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 4h43m v1.21.0-rc.0+6143dea master-1.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 4h43m v1.21.0-rc.0+6143dea master-2.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 4h43m v1.21.0-rc.0+6143dea worker-0.pok-25.ocptest.pok.stglabs.ibm.com Ready worker 4h30m v1.21.0-rc.0+6143dea worker-1.pok-25.ocptest.pok.stglabs.ibm.com Ready worker 4h30m v1.21.0-rc.0+6143dea # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-s390x-2021-04-26-125220 False False True 4h39m baremetal 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m cloud-credential 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h43m cluster-autoscaler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m config-operator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m console 4.8.0-0.nightly-s390x-2021-04-26-125220 False True True 4h34m csi-snapshot-controller 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m dns 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m etcd 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m image-registry 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m ingress 4.8.0-0.nightly-s390x-2021-04-26-125220 True False True 4h36m insights 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h32m kube-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h37m kube-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h37m kube-scheduler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m kube-storage-version-migrator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m machine-api 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m machine-approver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m machine-config 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h37m marketplace 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m monitoring 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h29m network 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m node-tuning 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h29m openshift-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m openshift-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m openshift-samples 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h32m operator-lifecycle-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m operator-lifecycle-manager-catalog 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m service-ca 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m storage 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m # Thank you, Kyle
Folks, Given the above described scenarios where the OCP 4.8 on Z installations do not complete, for both KVM and zVM environments, for configurations of 3 master nodes and 1-2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 1-2 worker nodes dedicated to the worker role, should this be a separate defect from this defect "Converged 3 node cluster does not install/function for OCP 4.8 on Z for KVM environment"? Thank you, Kyle
Setting "reviewed-in-sprint" to "+" as it is unlikely to be resolved before the end of the week.
Folks, The issue with the 3 node converged cluster for OCP 4.8 on Z for KVM may now be resolved as configuration related and additional tests are underway to confirm. Reducing the severity of this bugzilla to medium and I'll provide an update by the end of today. It does appear, however, that the issues still exist with the worker role being fulfilled by both the master nodes and 1-2 dedicated worker nodes at the same time (as documented above). This has been seen on both the KVM and zVM environments. Additional tests are underway with additional different OCP 4.8 builds to confirm. Thank you, Kyle
I have seen the exact same behavior when testing the 3-node install on zKVM libvirt IPI: https://issues.redhat.com/browse/MULTIARCH-935 For that, it turned out that the load balancer rules needed to be tweaked for the 3-node setup. Network configuration could be the cause for both of the setups you mention here.
Dylan, Thank you for the information. 1. For the 3 node converged cluster for OCP 4.8 on Z for KVM for UPI, we seem to have a somewhat different resolution that I'm continuing to test and it looks promising. 2. For the coexistence issues of 3 master nodes fulfilling both the master and worker roles, while also having 1-2 dedicated nodes fulfill the worker role, would you have any insight or information? Thank you, Kyle
Got it, good to hear! I unfortunately have not deployed with a hybrid of schedulable masters and dedicated workers and so do not have much input there.
Hi Kyle, Would you be able to provide some must-gather logs for the clusters/scenarios in question? At a glance, the 3 Cluster Operators you are talking about are linked in that without the ingress CO, the console won't come up, and the authentication CO won't come up without the console being up - so it could be that you are seeing issues with the ingress CO. That could be, for example, that the ingress pods are on hosts that aren't being balanced to appropriately (Load balancer issues that Dylan suggested). With some more detail/logs we'll be able to have a bit more insight.
Andy, Thanks for the information. I've recreated the following 2 problem scenarios and will be providing the corresponding must-gather logs. 1. 3 master nodes and 1 worker node, where the 3 master nodes fulfill both master and worker roles, and 1 worker node dedicated to the worker role, and not all operators complete rollout. 2. 3 master nodes and 2 worker nodes, where the 3 master nodes fulfill both master and worker roles, and 2 worker nodes dedicated to the worker role, and not all operators complete rollout. Here are the "oc adm must-gather" summaries for each of the above 2 problem scenarios: 1. 3 master nodes and 1 worker node, where the 3 master nodes fulfill both master and worker roles, and 1 worker node dedicated to the worker role. ==================================================================================================================================================== ClusterID: 0b153142-ba7a-4e15-b632-6f6a2fe4736b ClusterVersion: Installing "4.8.0-0.nightly-s390x-2021-04-28-215101" for About an hour: Unable to apply 4.8.0-0.nightly-s390x-2021-04-28-215101: some cluster operators have not yet rolled out ClusterOperators: clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF clusteroperator/console is not available (DeploymentAvailable: 0 pods available for console deployment) because RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) 2. 3 master nodes and 2 worker nodes, where the 3 master nodes fulfill both master and worker roles, and 2 worker nodes dedicated to the worker role. ===================================================================================================================================================== ClusterID: 99ebe958-3f04-4e69-b7ec-a2a31a64056b ClusterVersion: Installing "4.8.0-0.nightly-s390x-2021-04-28-215101" for 42 minutes: Unable to apply 4.8.0-0.nightly-s390x-2021-04-28-215101: some cluster operators have not yet rolled out ClusterOperators: clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF clusteroperator/console is not available (DeploymentAvailable: 0 pods available for console deployment) because RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) Thank you, Kyle
Created attachment 1777109 [details] Issue with 3 master nodes fulfilling both master and worker roles, and 1 worker node fulfilling worker role Must-gather data for issue with 3 master nodes fulfilling both master and worker roles, and 1 worker node fulfilling worker role. OCP build 4.8.0-0.nightly-s390x-2021-04-28-215101. Thank you.
Created attachment 1777110 [details] Issue with 3 master nodes fulfilling both master and worker roles, and 2 worker nodes fulfilling worker role. Must-gather data for issue with 3 master nodes fulfilling both master and worker roles, and 2 worker nodes fulfilling worker role. OCP build 4.8.0-0.nightly-s390x-2021-04-28-215101. Thank you.
Hi Kyle, Thanks for the logs - I can see the following: For the ingress-operator the pods are up, and they seem normal, the operator can't come up fully because the canary-openshift-ingress-canary route is not reachable (it returns an EOF): error performing canary route check {"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com\": Get \"https://canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com\": EOF"} Similarly for the console we can see what you linked to above: err: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF The same issue is happening with the Authentication Operator: OAuthServerRouteEndpointAccessibleController reconciliation failed: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF This means the cluster is unable to reach the ingress pods (and as such the *.app addresses) even though the ingress/console/auth pods are up and running. This is likely to be one of 3 things: * The DNS is pointing to the wrong address within the cluster - how is your DNS setup for *.apps, and can you confirm it's resolving to the correct address? We know it is resolving since it attempts to connect to the address. You could attempt to do that via curl or some other similar to see if it resolves correctly. * The LoadBalancer has no available backends - is the LB configuration correctly and pointing to ALL hosts - since the ingress pods can exist on any worker (and will move hosts, for example if a host with the ingress pod on were to reboot/fail), and all the hosts are workers. * Are there firewall rules preventing the traffic on 80/443 from reaching the ingress pods. Andy
Andy, Thanks very much for the information. I'll check a bit later and let you know. Here's some additional information and then a question to help with further understanding. 1. For both KVM and zVM environments, a standard 3 master nodes and 2 worker nodes (and more) OCP 4.8 clusters have been tested with the same OCP 4.8 builds and all install completely with all operators rolled out. 2. For both KVM and zVM environments, the converged 3 node OCP 4.8 cluster with only the 3 master nodes fulfilling both the master and worker roles have been tested with the same OCP 4.8 builds and all install completely with all operators rolled out. If any of the 3 conditions you have listed were impacting the install of the 3 master nodes and 1-2 worker nodes configurations, where the 3 master nodes fulfill both master and worker roles, and 1-2 worker nodes dedicated to the worker role, would they not also impact the OCP cluster installs listed for the environments in #1 and #2 listed above in this comment? Thank you, Kyle
Folks, FYI. For the converged 3 node cluster, the following 11 OCP 4.8 nightly builds have successfully installed for both KVM and zVM environments (in addition to a good number of previous builds). 1. 4.8.0-0.nightly-s390x-2021-04-28-005546 2. 4.8.0-0.nightly-s390x-2021-04-28-061949 3. 4.8.0-0.nightly-s390x-2021-04-28-090300 4. 4.8.0-0.nightly-s390x-2021-04-28-102839 5. 4.8.0-0.nightly-s390x-2021-04-28-120918 6. 4.8.0-0.nightly-s390x-2021-04-28-133218 7. 4.8.0-0.nightly-s390x-2021-04-28-144752 8. 4.8.0-0.nightly-s390x-2021-04-28-175339 9. 4.8.0-0.nightly-s390x-2021-04-28-202733 10. 4.8.0-0.nightly-s390x-2021-04-28-215101 11. 4.8.0-0.nightly-s390x-2021-04-28-231853 Thank you, Kyle
Hi Kyle, It is odd that it worked in the instances you mentioned, but the actual resolution of the *.apps addresses/connecting to the ingress pods is not handled by OpenShift itself, that would be handled by DNS and potentially LB servicing the cluster. Given the errors, either the ingress pods are broken (I think you would get a different error in that case), or we are unable to reach them. We can test that theory further by checking against the individual ingress pods directly to see if that works. First, get the IPs of the ingress pods: $ oc get pods -n openshift-ingress -o wide | grep -v ^NAME | awk '{print $6}' 192.168.124.51 192.168.124.52 Second, curl against those IPs specifically, for example: $ curl -Ik https://192.168.124.51/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com" HTTP/1.1 200 OK ... $ curl -Ik https://192.168.124.52/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com" HTTP/1.1 200 OK ... As a "prove it fails vs non-ingress hosts" $ curl -Ik https://192.168.124.53/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com" curl: (7) Failed to connect to 192.168.124.53 port 443: No route to host We can compare that to the address itself, so for e.g.: $ curl -Ik https://canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com/ HTTP/1.1 200 OK ... You should get the same results on the 2 ingress pods and the address itself. If the ingress pods are reachable (and work), but the address itself doesn't, then it has to be a resolution/LB configuration issue - I've subbed in the IP addreses from a KVM cluster I have up, but the approach is the same for zVM.
Since Kyle has set the severity of this bug to "low", I am setting the "Blocker?" flag as "-" to triage the bug since low bugs should not be blocker bugs. Also based on Comment 14 there has been successful installs.
Hi @Muhammad, should this bug still be assigned to you? If so, do you think it will be resolved before the end of this sprint (May 22nd)?
Also @Kyle - I see from Comment 14 that the builds have successfully installed, can this bug be closed out or are there other work involved?
Dan, Thanks for the note. Yes, this bug can be closed for the converged 3 node cluster install issue with the OCP 4.8 on Z KVM environment. FYI. For the converged 3 node cluster, the following 9 OCP 4.8 nightly builds from May 17, 2021 have successfully installed for KVM (in addition to a good number of previous builds). 1. 4.8.0-0.nightly-s390x-2021-05-17-075233 2. 4.8.0-0.nightly-s390x-2021-05-17-091104 3. 4.8.0-0.nightly-s390x-2021-05-17-113817 4. 4.8.0-0.nightly-s390x-2021-05-17-141027 5. 4.8.0-0.nightly-s390x-2021-05-17-153235 6. 4.8.0-0.nightly-s390x-2021-05-17-180853 7. 4.8.0-0.nightly-s390x-2021-05-17-195805 8. 4.8.0-0.nightly-s390x-2021-05-17-213847 9. 4.8.0-0.nightly-s390x-2021-05-17-233356 Thank you, Kyle
Thank you Kyle. Closing