Bug 1954237
| Summary: | Converged 3 node cluster does not install/function for OCP 4.8 on Z for KVM environment | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | krmoser |
| Component: | Multi-Arch | Assignee: | Muhammad Adeel (IBM) <madeel> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Barry Donahue <bdonahue> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | amccrae, chanphil, christian.lapolt, danili, dorzel, Holger.Wolf, jschinta, wolfgang.voesch |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | s390x | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-18 11:39:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1934148 | ||
| Attachments: | |||
|
Description
krmoser
2021-04-27 18:55:41 UTC
Folks, Please note that we have now also determined that for OCP 4.8 on Z for zVM, scenarios #4 and #5 listed above in the "Actual results:" section also apply. Specifically: 1. For zVM environments, when installing with 3 master nodes and 1 worker node, where the master nodes fulfill both master and worker roles, and there is 1 worker node dedicated to the worker role: 1. The OCP 4.8 installation does not complete. 2. The authentication and console cluster operators are not available and in degraded state. 3. The ingress cluster operator status is degraded. 4. The monitoring cluster operator successfully rolls out. 5. The network cluster operator is not available. Note: 1. This configuration seems to reinforce that for the authentication and console cluster operators to become available, and the ingress operator to not be degraded, 2 nodes dedicated to the worker role (only) are required. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 27m Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-125220: some cluster operators have not yet rolled out # oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 26m v1.21.0-rc.0+6143dea master-1.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 26m v1.21.0-rc.0+6143dea master-2.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 26m v1.21.0-rc.0+6143dea worker-0.pok-25.ocptest.pok.stglabs.ibm.com Ready worker 16m v1.21.0-rc.0+6143dea # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-s390x-2021-04-26-125220 False False True 22m baremetal 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m cloud-credential 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 26m cluster-autoscaler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m config-operator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 22m console 4.8.0-0.nightly-s390x-2021-04-26-125220 False True True 15m csi-snapshot-controller 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m dns 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m etcd 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m image-registry 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m ingress 4.8.0-0.nightly-s390x-2021-04-26-125220 True False True 16m insights 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 15m kube-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True True False 18m kube-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 18m kube-scheduler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m kube-storage-version-migrator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m machine-api 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m machine-approver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m machine-config 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m marketplace 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m monitoring 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 7m network False True True 26m node-tuning 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m openshift-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m openshift-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 19m openshift-samples 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 15m operator-lifecycle-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m operator-lifecycle-manager-catalog 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 20m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 16m service-ca 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 21m storage 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 22m 2. For zVM environments, when installing with 3 master nodes and 2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 2 worker nodes dedicated to the worker role: 1. The OCP 4.8 installation does not complete. 2. The authentication and console cluster operators are not available and in degraded state. 3. The ingress cluster operator status is degraded. 4. The monitoring cluster operator successfully rolls out. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 4h44m Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-125220: some cluster operators have not yet rolled out # oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 4h43m v1.21.0-rc.0+6143dea master-1.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 4h43m v1.21.0-rc.0+6143dea master-2.pok-25.ocptest.pok.stglabs.ibm.com Ready master,worker 4h43m v1.21.0-rc.0+6143dea worker-0.pok-25.ocptest.pok.stglabs.ibm.com Ready worker 4h30m v1.21.0-rc.0+6143dea worker-1.pok-25.ocptest.pok.stglabs.ibm.com Ready worker 4h30m v1.21.0-rc.0+6143dea # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-s390x-2021-04-26-125220 False False True 4h39m baremetal 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m cloud-credential 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h43m cluster-autoscaler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m config-operator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m console 4.8.0-0.nightly-s390x-2021-04-26-125220 False True True 4h34m csi-snapshot-controller 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m dns 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m etcd 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m image-registry 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m ingress 4.8.0-0.nightly-s390x-2021-04-26-125220 True False True 4h36m insights 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h32m kube-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h37m kube-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h37m kube-scheduler 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m kube-storage-version-migrator 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m machine-api 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m machine-approver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m machine-config 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h37m marketplace 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m monitoring 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h29m network 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m node-tuning 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h29m openshift-apiserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m openshift-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m openshift-samples 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h32m operator-lifecycle-manager 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m operator-lifecycle-manager-catalog 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h38m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h34m service-ca 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m storage 4.8.0-0.nightly-s390x-2021-04-26-125220 True False False 4h39m # Thank you, Kyle Folks, Given the above described scenarios where the OCP 4.8 on Z installations do not complete, for both KVM and zVM environments, for configurations of 3 master nodes and 1-2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 1-2 worker nodes dedicated to the worker role, should this be a separate defect from this defect "Converged 3 node cluster does not install/function for OCP 4.8 on Z for KVM environment"? Thank you, Kyle Setting "reviewed-in-sprint" to "+" as it is unlikely to be resolved before the end of the week. Folks, The issue with the 3 node converged cluster for OCP 4.8 on Z for KVM may now be resolved as configuration related and additional tests are underway to confirm. Reducing the severity of this bugzilla to medium and I'll provide an update by the end of today. It does appear, however, that the issues still exist with the worker role being fulfilled by both the master nodes and 1-2 dedicated worker nodes at the same time (as documented above). This has been seen on both the KVM and zVM environments. Additional tests are underway with additional different OCP 4.8 builds to confirm. Thank you, Kyle I have seen the exact same behavior when testing the 3-node install on zKVM libvirt IPI: https://issues.redhat.com/browse/MULTIARCH-935 For that, it turned out that the load balancer rules needed to be tweaked for the 3-node setup. Network configuration could be the cause for both of the setups you mention here. Dylan, Thank you for the information. 1. For the 3 node converged cluster for OCP 4.8 on Z for KVM for UPI, we seem to have a somewhat different resolution that I'm continuing to test and it looks promising. 2. For the coexistence issues of 3 master nodes fulfilling both the master and worker roles, while also having 1-2 dedicated nodes fulfill the worker role, would you have any insight or information? Thank you, Kyle Got it, good to hear! I unfortunately have not deployed with a hybrid of schedulable masters and dedicated workers and so do not have much input there. Hi Kyle, Would you be able to provide some must-gather logs for the clusters/scenarios in question? At a glance, the 3 Cluster Operators you are talking about are linked in that without the ingress CO, the console won't come up, and the authentication CO won't come up without the console being up - so it could be that you are seeing issues with the ingress CO. That could be, for example, that the ingress pods are on hosts that aren't being balanced to appropriately (Load balancer issues that Dylan suggested). With some more detail/logs we'll be able to have a bit more insight. Andy,
Thanks for the information. I've recreated the following 2 problem scenarios and will be providing the corresponding must-gather logs.
1. 3 master nodes and 1 worker node, where the 3 master nodes fulfill both master and worker roles, and 1 worker node dedicated to the worker role, and not all operators complete rollout.
2. 3 master nodes and 2 worker nodes, where the 3 master nodes fulfill both master and worker roles, and 2 worker nodes dedicated to the worker role, and not all operators complete rollout.
Here are the "oc adm must-gather" summaries for each of the above 2 problem scenarios:
1. 3 master nodes and 1 worker node, where the 3 master nodes fulfill both master and worker roles, and 1 worker node dedicated to the worker role.
====================================================================================================================================================
ClusterID: 0b153142-ba7a-4e15-b632-6f6a2fe4736b
ClusterVersion: Installing "4.8.0-0.nightly-s390x-2021-04-28-215101" for About an hour: Unable to apply 4.8.0-0.nightly-s390x-2021-04-28-215101: some cluster operators have not yet rolled out
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF
clusteroperator/console is not available (DeploymentAvailable: 0 pods available for console deployment) because RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF
clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
2. 3 master nodes and 2 worker nodes, where the 3 master nodes fulfill both master and worker roles, and 2 worker nodes dedicated to the worker role.
=====================================================================================================================================================
ClusterID: 99ebe958-3f04-4e69-b7ec-a2a31a64056b
ClusterVersion: Installing "4.8.0-0.nightly-s390x-2021-04-28-215101" for 42 minutes: Unable to apply 4.8.0-0.nightly-s390x-2021-04-28-215101: some cluster operators have not yet rolled out
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF
clusteroperator/console is not available (DeploymentAvailable: 0 pods available for console deployment) because RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF
clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
Thank you,
Kyle
Created attachment 1777109 [details]
Issue with 3 master nodes fulfilling both master and worker roles, and 1 worker node fulfilling worker role
Must-gather data for issue with 3 master nodes fulfilling both master and worker roles, and 1 worker node fulfilling worker role.
OCP build 4.8.0-0.nightly-s390x-2021-04-28-215101.
Thank you.
Created attachment 1777110 [details]
Issue with 3 master nodes fulfilling both master and worker roles, and 2 worker nodes fulfilling worker role.
Must-gather data for issue with 3 master nodes fulfilling both master and worker roles, and 2 worker nodes fulfilling worker role.
OCP build 4.8.0-0.nightly-s390x-2021-04-28-215101.
Thank you.
Hi Kyle,
Thanks for the logs - I can see the following:
For the ingress-operator the pods are up, and they seem normal, the operator can't come up fully because the canary-openshift-ingress-canary route is not reachable (it returns an EOF):
error performing canary route check {"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com\": Get \"https://canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com\": EOF"}
Similarly for the console we can see what you linked to above:
err: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF
The same issue is happening with the Authentication Operator:
OAuthServerRouteEndpointAccessibleController reconciliation failed: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF
This means the cluster is unable to reach the ingress pods (and as such the *.app addresses) even though the ingress/console/auth pods are up and running. This is likely to be one of 3 things:
* The DNS is pointing to the wrong address within the cluster - how is your DNS setup for *.apps, and can you confirm it's resolving to the correct address? We know it is resolving since it attempts to connect to the address. You could attempt to do that via curl or some other similar to see if it resolves correctly.
* The LoadBalancer has no available backends - is the LB configuration correctly and pointing to ALL hosts - since the ingress pods can exist on any worker (and will move hosts, for example if a host with the ingress pod on were to reboot/fail), and all the hosts are workers.
* Are there firewall rules preventing the traffic on 80/443 from reaching the ingress pods.
Andy
Andy, Thanks very much for the information. I'll check a bit later and let you know. Here's some additional information and then a question to help with further understanding. 1. For both KVM and zVM environments, a standard 3 master nodes and 2 worker nodes (and more) OCP 4.8 clusters have been tested with the same OCP 4.8 builds and all install completely with all operators rolled out. 2. For both KVM and zVM environments, the converged 3 node OCP 4.8 cluster with only the 3 master nodes fulfilling both the master and worker roles have been tested with the same OCP 4.8 builds and all install completely with all operators rolled out. If any of the 3 conditions you have listed were impacting the install of the 3 master nodes and 1-2 worker nodes configurations, where the 3 master nodes fulfill both master and worker roles, and 1-2 worker nodes dedicated to the worker role, would they not also impact the OCP cluster installs listed for the environments in #1 and #2 listed above in this comment? Thank you, Kyle Folks, FYI. For the converged 3 node cluster, the following 11 OCP 4.8 nightly builds have successfully installed for both KVM and zVM environments (in addition to a good number of previous builds). 1. 4.8.0-0.nightly-s390x-2021-04-28-005546 2. 4.8.0-0.nightly-s390x-2021-04-28-061949 3. 4.8.0-0.nightly-s390x-2021-04-28-090300 4. 4.8.0-0.nightly-s390x-2021-04-28-102839 5. 4.8.0-0.nightly-s390x-2021-04-28-120918 6. 4.8.0-0.nightly-s390x-2021-04-28-133218 7. 4.8.0-0.nightly-s390x-2021-04-28-144752 8. 4.8.0-0.nightly-s390x-2021-04-28-175339 9. 4.8.0-0.nightly-s390x-2021-04-28-202733 10. 4.8.0-0.nightly-s390x-2021-04-28-215101 11. 4.8.0-0.nightly-s390x-2021-04-28-231853 Thank you, Kyle Hi Kyle,
It is odd that it worked in the instances you mentioned, but the actual resolution of the *.apps addresses/connecting to the ingress pods is not handled by OpenShift itself, that would be handled by DNS and potentially LB servicing the cluster. Given the errors, either the ingress pods are broken (I think you would get a different error in that case), or we are unable to reach them.
We can test that theory further by checking against the individual ingress pods directly to see if that works.
First, get the IPs of the ingress pods:
$ oc get pods -n openshift-ingress -o wide | grep -v ^NAME | awk '{print $6}'
192.168.124.51
192.168.124.52
Second, curl against those IPs specifically, for example:
$ curl -Ik https://192.168.124.51/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com"
HTTP/1.1 200 OK
...
$ curl -Ik https://192.168.124.52/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com"
HTTP/1.1 200 OK
...
As a "prove it fails vs non-ingress hosts"
$ curl -Ik https://192.168.124.53/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com"
curl: (7) Failed to connect to 192.168.124.53 port 443: No route to host
We can compare that to the address itself, so for e.g.:
$ curl -Ik https://canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com/
HTTP/1.1 200 OK
...
You should get the same results on the 2 ingress pods and the address itself. If the ingress pods are reachable (and work), but the address itself doesn't, then it has to be a resolution/LB configuration issue - I've subbed in the IP addreses from a KVM cluster I have up, but the approach is the same for zVM.
Since Kyle has set the severity of this bug to "low", I am setting the "Blocker?" flag as "-" to triage the bug since low bugs should not be blocker bugs. Also based on Comment 14 there has been successful installs. Hi @Muhammad, should this bug still be assigned to you? If so, do you think it will be resolved before the end of this sprint (May 22nd)? Also @Kyle - I see from Comment 14 that the builds have successfully installed, can this bug be closed out or are there other work involved? Dan, Thanks for the note. Yes, this bug can be closed for the converged 3 node cluster install issue with the OCP 4.8 on Z KVM environment. FYI. For the converged 3 node cluster, the following 9 OCP 4.8 nightly builds from May 17, 2021 have successfully installed for KVM (in addition to a good number of previous builds). 1. 4.8.0-0.nightly-s390x-2021-05-17-075233 2. 4.8.0-0.nightly-s390x-2021-05-17-091104 3. 4.8.0-0.nightly-s390x-2021-05-17-113817 4. 4.8.0-0.nightly-s390x-2021-05-17-141027 5. 4.8.0-0.nightly-s390x-2021-05-17-153235 6. 4.8.0-0.nightly-s390x-2021-05-17-180853 7. 4.8.0-0.nightly-s390x-2021-05-17-195805 8. 4.8.0-0.nightly-s390x-2021-05-17-213847 9. 4.8.0-0.nightly-s390x-2021-05-17-233356 Thank you, Kyle Thank you Kyle. Closing |