Bug 1954237

Summary:

Converged 3 node cluster does not install/function for OCP 4.8 on Z for KVM environment

Product:

OpenShift Container Platform

Reporter:

krmoser

Component:

Multi-Arch

Assignee:

Muhammad Adeel (IBM) <madeel>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Barry Donahue <bdonahue>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

4.8

CC:

amccrae, chanphil, christian.lapolt, danili, dorzel, Holger.Wolf, jschinta, wolfgang.voesch

Target Milestone:

---

Target Release:

---

Hardware:

s390x

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-18 11:39:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1934148

Attachments:

Description	Flags
Issue with 3 master nodes fulfilling both master and worker roles, and 1 worker node fulfilling worker role	none
Issue with 3 master nodes fulfilling both master and worker roles, and 2 worker nodes fulfilling worker role.	none

Description krmoser 2021-04-27 18:55:41 UTC

Description of problem:
1. For the configuration of a 3 node converged cluster for OCP 4.8 on Z for KVM, the installation does not complete as a number of cluster operators do not complete roll out.

2. The same configuration of a 3 node converged cluster for OCP 4.8 on Z for zVM consistently completes installation.

3. This installation issue does not appear to be a KVM resource issue as multiple CPU, Memory, Storage configurations have been tested for the 3 master nodes for KVM, including for environments where these resources are significantly greater than the 3 node minimum configuration resource recommendations.

4. This installation issue does appear to be an issue with requiring at least 2 nodes dedicated to the worker role (only), and not working when master nodes fulfill both the master and worker roles.

Version-Release number of selected component (if applicable):
Multiple 4.8.0 nightly builds including current builds over the past 2+ weeks.

How reproducible:
1. Given extensive testing on different KVM LPARs/hypervisors on 2 different z15 servers, this is straight forward to reproduce with any sufficient CPU/Memory configuration with only 3 master nodes that are scheduleable (as worker nodes).

2. This issue looks to be specific to the KVM environment/configuration and not the CPU, Memory, and/or storage (disk) resources involved, as our testing has been conducted with significant amounts of these resources (including above the minimal required).

3. This issue appears to be tied to some factor(s) involved in the configuration of dedicated worker nodes fulfilling the "worker" role, as opposed to 3 master nodes' fulfilling both the "master" and "worker" roles.

Steps to Reproduce:
1. Configure a OCP 4.8 on Z KVM environment with 3 master nodes and 0 worker nodes.

2. Initiate the OCP 4.8 on Z installation with the 3 master nodes as "mastersSchedulable".

3. The installation will not successfully complete. Please see some specifics under the "Additional info" section.

Actual results:
1. The OCP 4.8 on Z KVM install does not successfully complete as the status is stuck forever at "some cluster operators have not yet rolled out", per the "oc get clusterversion" command executed on the KVM bastion node.

2. For KVM environments, when installing with 3 master nodes and 0 worker nodes, where the master nodes fulfill both the master and worker roles:
1. The OCP 4.8 installation does not complete.
2. The authentication and console cluster operators are not available and in degraded state.
3. The ingress cluster operator is available but in degraded state.
4. The monitoring cluster operator successfully rolls out and is available and not in a degraded state. (It appears that the monitoring cluster operator needs a minimum of 2 nodes fulfilling the worker role to successfully roll out, whether the worker role is fulfilled by the master nodes and/or dedicated worker nodes.)

3. For KVM environments, when installing with 3 master nodes and 1 worker node, with dedicated roles per nodes where no nodes fulfill both master and worker roles:
1. The OCP 4.8 installation does not complete (as is expected).
2. The authentication and console cluster operators become available.
3. The ingress cluster operator status is degraded.
4. The monitoring cluster operator does not roll out (does not become available).

4. For KVM environments, when installing with 3 master nodes and 1 worker node, where the master nodes fulfill both master and worker roles, and there is 1 worker node dedicated to the worker role:
1. The OCP 4.8 installation does not complete.
2. The authentication and console cluster operators are not available and in degraded state.
3. The ingress cluster operator status is degraded.
4. The monitoring cluster operator successfully rolls out.

Note:
1. This configuration seems to reinforce that for the authentication and console cluster operators to become available, and the ingress operator to not be degraded, 2 nodes dedicated to the worker role (only) are required.

5. For KVM environments, when installing with 3 master nodes and 2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 2 worker nodes dedicated to the worker role:
1. The OCP 4.8 installation does not complete.
2. The authentication and console cluster operators are not available and in degraded state.
3. The ingress cluster operator status is degraded.
4. The monitoring cluster operator successfully rolls out.

Notes:
1. This configuration seems to reinforce that for the authentication and console cluster operators to become available, and the ingress operator to not be degraded, 2 nodes dedicated to the worker role (only) are required.

2. This configuration also seems to suggest that the when the master nodes fulfill both the master and worker roles, even when there are 2 other nodes fulfilling the worker role only, the master nodes interfere with or prevent the proper functioning of the dedicated worker nodes' role for the authentication, console, and ingress cluster operators.

6. For KVM environments, when installing with 3 master nodes and 2 worker nodes, with dedicated roles per nodes where no nodes fulfill both master and worker roles:
1. The OCP 4.8 installation completes successfully (as expected).
2. The authentication and console cluster operators become available.
3. The ingress cluster operator status is not degraded.
4. The monitoring cluster operator successfully rolls out.

7. Please let me know if the #4 and #5 install scenarios listed above should constitute a separate bugzilla defect?

Expected results:
1. The OCP 4.8 on Z KVM install should successfully complete, as it does in a standard KVM configuration with 3 dedicated master nodes and 2 dedicated worker nodes, and in a OCP 4.8 on Z zVM environment with 3 master nodes only.

Additional info:
1. Here is the output of the "oc get clusterversion", "oc get nodes" and "oc get co" commands from the KVM bastion node for a 3 node converged cluster.

NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 176m Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-154342: some cluster operators have not yet rolled out

NAME STATUS ROLES AGE VERSION
master-0.pok-242.ocptest.pok.stglabs.ibm.com Ready master,worker 173m v1.21.0-rc.0+6143dea
master-1.pok-242.ocptest.pok.stglabs.ibm.com Ready master,worker 173m v1.21.0-rc.0+6143dea
master-2.pok-242.ocptest.pok.stglabs.ibm.com Ready master,worker 173m v1.21.0-rc.0+6143dea

NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.8.0-0.nightly-s390x-2021-04-26-154342 False False True 170m
baremetal 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m
cloud-credential 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 174m
cluster-autoscaler 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m
config-operator 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
console 4.8.0-0.nightly-s390x-2021-04-26-154342 False True True 164m
csi-snapshot-controller 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
dns 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m
etcd 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 168m
image-registry 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 165m
ingress 4.8.0-0.nightly-s390x-2021-04-26-154342 True False True 117m
insights 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 164m
kube-apiserver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 164m
kube-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 168m
kube-scheduler 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 168m
kube-storage-version-migrator 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
machine-api 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m
machine-approver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
machine-config 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m
marketplace 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
monitoring 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 116m
network 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
node-tuning 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
openshift-apiserver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 87m
openshift-controller-manager 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m
openshift-samples 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 164m
operator-lifecycle-manager 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 169m
operator-lifecycle-manager-catalog 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
operator-lifecycle-manager-packageserver 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 85m
service-ca 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m
storage 4.8.0-0.nightly-s390x-2021-04-26-154342 True False False 170m

Thank you.

Comment 1 krmoser 2021-04-28 07:10:48 UTC

Folks,

Please note that we have now also determined that for OCP 4.8 on Z for zVM, scenarios #4 and #5 listed above in the "Actual results:" section also apply.
Specifically:

1. For zVM environments, when installing with 3 master nodes and 1 worker node, where the master nodes fulfill both master and worker roles, and there is 1 worker node dedicated to the worker role:
  1. The OCP 4.8 installation does not complete.
  2. The authentication and console cluster operators are not available and in degraded state.
  3. The ingress cluster operator status is degraded.
  4. The monitoring cluster operator successfully rolls out.
  5. The network cluster operator is not available.

Note: 
  1. This configuration seems to reinforce that for the authentication and console cluster operators to become available, and the ingress operator to not be degraded, 2 nodes dedicated to the worker role (only) are required.

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          27m     Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-125220: some cluster operators have not yet rolled out

# oc get nodes
NAME                                          STATUS   ROLES           AGE   VERSION
master-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master,worker   26m   v1.21.0-rc.0+6143dea
master-1.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master,worker   26m   v1.21.0-rc.0+6143dea
master-2.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master,worker   26m   v1.21.0-rc.0+6143dea
worker-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    worker          16m   v1.21.0-rc.0+6143dea

# oc get co
NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-s390x-2021-04-26-125220   False       False         True       22m
baremetal                                  4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      21m
cloud-credential                           4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      26m
cluster-autoscaler                         4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      21m
config-operator                            4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      22m
console                                    4.8.0-0.nightly-s390x-2021-04-26-125220   False       True          True       15m
csi-snapshot-controller                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      20m
dns                                        4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      20m
etcd                                       4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      20m
image-registry                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      16m
ingress                                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         True       16m
insights                                   4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      15m
kube-apiserver                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        True          False      18m
kube-controller-manager                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      18m
kube-scheduler                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      20m
kube-storage-version-migrator              4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      21m
machine-api                                4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      21m
machine-approver                           4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      21m
machine-config                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      20m
marketplace                                4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      21m
monitoring                                 4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      7m
network                                                                              False       True          True       26m
node-tuning                                4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      16m
openshift-apiserver                        4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      16m
openshift-controller-manager               4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      19m
openshift-samples                          4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      15m
operator-lifecycle-manager                 4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      20m
operator-lifecycle-manager-catalog         4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      20m
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      16m
service-ca                                 4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      21m
storage                                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      22m

  

2. For zVM environments, when installing with 3 master nodes and 2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 2 worker nodes dedicated to the worker role:
  1. The OCP 4.8 installation does not complete.
  2. The authentication and console cluster operators are not available and in degraded state.
  3. The ingress cluster operator status is degraded.
  4. The monitoring cluster operator successfully rolls out.

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          4h44m   Unable to apply 4.8.0-0.nightly-s390x-2021-04-26-125220: some cluster operators have not yet rolled out

# oc get nodes
NAME                                          STATUS   ROLES           AGE     VERSION
master-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master,worker   4h43m   v1.21.0-rc.0+6143dea
master-1.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master,worker   4h43m   v1.21.0-rc.0+6143dea
master-2.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master,worker   4h43m   v1.21.0-rc.0+6143dea
worker-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    worker          4h30m   v1.21.0-rc.0+6143dea
worker-1.pok-25.ocptest.pok.stglabs.ibm.com   Ready    worker          4h30m   v1.21.0-rc.0+6143dea

# oc get co
NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-s390x-2021-04-26-125220   False       False         True       4h39m
baremetal                                  4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
cloud-credential                           4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h43m
cluster-autoscaler                         4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h38m
config-operator                            4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
console                                    4.8.0-0.nightly-s390x-2021-04-26-125220   False       True          True       4h34m
csi-snapshot-controller                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h34m
dns                                        4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h38m
etcd                                       4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h38m
image-registry                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h34m
ingress                                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         True       4h36m
insights                                   4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h32m
kube-apiserver                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h37m
kube-controller-manager                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h37m
kube-scheduler                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h34m
kube-storage-version-migrator              4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
machine-api                                4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
machine-approver                           4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
machine-config                             4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h37m
marketplace                                4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
monitoring                                 4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h29m
network                                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
node-tuning                                4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h29m
openshift-apiserver                        4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h34m
openshift-controller-manager               4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h38m
openshift-samples                          4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h32m
operator-lifecycle-manager                 4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h38m
operator-lifecycle-manager-catalog         4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h38m
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h34m
service-ca                                 4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
storage                                    4.8.0-0.nightly-s390x-2021-04-26-125220   True        False         False      4h39m
#


Thank you,
Kyle

Comment 2 krmoser 2021-04-28 07:15:14 UTC

Folks,

Given the above described scenarios where the OCP 4.8 on Z installations do not complete, for both KVM and zVM environments, for configurations of 3 master nodes and 1-2 worker nodes, where the master nodes fulfill both master and worker roles, and there are 1-2 worker nodes dedicated to the worker role, should this be a separate defect from this defect "Converged 3 node cluster does not install/function for OCP 4.8 on Z for KVM environment"?

Thank you,
Kyle

Comment 3 Dan Li 2021-04-28 14:08:58 UTC

Setting "reviewed-in-sprint" to "+" as it is unlikely to be resolved before the end of the week.

Comment 4 krmoser 2021-04-28 15:15:33 UTC

Folks,

The issue with the 3 node converged cluster for OCP 4.8 on Z for KVM may now be resolved as configuration related and additional tests are underway to confirm.  Reducing the severity of this bugzilla to medium and I'll provide an update by the end of today.

It does appear, however, that the issues still exist with the worker role being fulfilled by both the master nodes and 1-2 dedicated worker nodes at the same time (as documented above).  This has been seen on both the KVM and zVM environments.  Additional tests are underway with additional different OCP 4.8 builds to confirm.

Thank you,
Kyle

Comment 5 Dylan Orzel 2021-04-28 15:17:15 UTC

I have seen the exact same behavior when testing the 3-node install on zKVM libvirt IPI: https://issues.redhat.com/browse/MULTIARCH-935

For that, it turned out that the load balancer rules needed to be tweaked for the 3-node setup. Network configuration could be the cause for both of the setups you mention here.

Comment 6 krmoser 2021-04-28 15:28:27 UTC

Dylan,

Thank you for the information.  

1. For the 3 node converged cluster for OCP 4.8 on Z for KVM for UPI, we seem to have a somewhat different resolution that I'm continuing to test and it looks promising.

2. For the coexistence issues of 3 master nodes fulfilling both the master and worker roles, while also having 1-2 dedicated nodes fulfill the worker role, would you have any insight or information?

Thank you,
Kyle

Comment 7 Dylan Orzel 2021-04-28 15:45:54 UTC

Got it, good to hear! I unfortunately have not deployed with a hybrid of schedulable masters and dedicated workers and so do not have much input there.

Comment 8 Andy McCrae 2021-04-28 17:44:00 UTC

Hi Kyle,

Would you be able to provide some must-gather logs for the clusters/scenarios in question?

At a glance, the 3 Cluster Operators you are talking about are linked in that without the ingress CO, the console won't come up, and the authentication CO won't come up without the console being up - so it could be that you are seeing issues with the ingress CO. That could be, for example, that the ingress pods are on hosts that aren't being balanced to appropriately (Load balancer issues that Dylan suggested).

With some more detail/logs we'll be able to have a bit more insight.

Comment 9 krmoser 2021-04-29 09:37:52 UTC

Andy,

Thanks for the information.  I've recreated the following 2 problem scenarios and will be providing the corresponding must-gather logs.

1. 3 master nodes and 1 worker node, where the 3 master nodes fulfill both master and worker roles, and 1 worker node dedicated to the worker role, and not all operators complete rollout.

2. 3 master nodes and 2 worker nodes, where the 3 master nodes fulfill both master and worker roles, and 2 worker nodes dedicated to the worker role, and not all operators complete rollout.


Here are the "oc adm must-gather" summaries for each of the above 2 problem scenarios:

1. 3 master nodes and 1 worker node, where the 3 master nodes fulfill both master and worker roles, and 1 worker node dedicated to the worker role.
====================================================================================================================================================
ClusterID: 0b153142-ba7a-4e15-b632-6f6a2fe4736b
ClusterVersion: Installing "4.8.0-0.nightly-s390x-2021-04-28-215101" for About an hour: Unable to apply 4.8.0-0.nightly-s390x-2021-04-28-215101: some cluster operators have not yet rolled out
ClusterOperators:
        clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF
        clusteroperator/console is not available (DeploymentAvailable: 0 pods available for console deployment) because RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF
        clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)



2. 3 master nodes and 2 worker nodes, where the 3 master nodes fulfill both master and worker roles, and 2 worker nodes dedicated to the worker role.
=====================================================================================================================================================
ClusterID: 99ebe958-3f04-4e69-b7ec-a2a31a64056b
ClusterVersion: Installing "4.8.0-0.nightly-s390x-2021-04-28-215101" for 42 minutes: Unable to apply 4.8.0-0.nightly-s390x-2021-04-28-215101: some cluster operators have not yet rolled out
ClusterOperators:
        clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF
        clusteroperator/console is not available (DeploymentAvailable: 0 pods available for console deployment) because RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF
        clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)


Thank you,
Kyle

Comment 10 krmoser 2021-04-29 09:41:14 UTC

Created attachment 1777109 [details]
Issue with 3 master nodes fulfilling both master and worker roles, and 1 worker node fulfilling worker role

Must-gather data for issue with 3 master nodes fulfilling both master and worker roles, and 1 worker node fulfilling worker role.

OCP build 4.8.0-0.nightly-s390x-2021-04-28-215101.

Thank you.

Comment 11 krmoser 2021-04-29 09:42:56 UTC

Created attachment 1777110 [details]
Issue with 3 master nodes fulfilling both master and worker roles, and 2 worker nodes fulfilling worker role.

Must-gather data for issue with 3 master nodes fulfilling both master and worker roles, and 2 worker nodes fulfilling worker role.

OCP build 4.8.0-0.nightly-s390x-2021-04-28-215101.

Thank you.

Comment 12 Andy McCrae 2021-04-29 10:11:22 UTC

Hi Kyle,

Thanks for the logs - I can see the following:

For the ingress-operator the pods are up, and they seem normal, the operator can't come up fully because the canary-openshift-ingress-canary route is not reachable (it returns an EOF):

error performing canary route check	{"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com\": Get \"https://canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com\": EOF"}

Similarly for the console we can see what you linked to above:
err: failed to GET route (https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health): Get "https://console-openshift-console.apps.pok-71.ocptest.pok.stglabs.ibm.com/health": EOF

The same issue is happening with the Authentication Operator:
OAuthServerRouteEndpointAccessibleController reconciliation failed: Get "https://oauth-openshift.apps.pok-71.ocptest.pok.stglabs.ibm.com/healthz": EOF

This means the cluster is unable to reach the ingress pods (and as such the *.app addresses) even though the ingress/console/auth pods are up and running. This is likely to be one of 3 things:
* The DNS is pointing to the wrong address within the cluster - how is your DNS setup for *.apps, and can you confirm it's resolving to the correct address? We know it is resolving since it attempts to connect to the address. You could attempt to do that via curl or some other similar to see if it resolves correctly. 
* The LoadBalancer has no available backends - is the LB configuration correctly and pointing to ALL hosts - since the ingress pods can exist on any worker (and will move hosts, for example if a host with the ingress pod on were to reboot/fail), and all the hosts are workers.
* Are there firewall rules preventing the traffic on 80/443 from reaching the ingress pods.

Andy

Comment 13 krmoser 2021-04-29 10:30:51 UTC

Andy,

Thanks very much for the information.  I'll check a bit later and let you know.  Here's some additional information and then a question to help with further understanding.

1. For both KVM and zVM environments, a standard 3 master nodes and 2 worker nodes (and more) OCP 4.8 clusters have been tested with the same OCP 4.8 builds and all install completely with all operators rolled out.

2. For both KVM and zVM environments, the converged 3 node OCP 4.8 cluster with only the 3 master nodes fulfilling both the master and worker roles have been tested with the same OCP 4.8 builds and all install completely with all operators rolled out.


If any of the 3 conditions you have listed were impacting the install of the 3 master nodes and 1-2 worker nodes configurations, where the 3 master nodes fulfill both master and worker roles, and 1-2 worker nodes dedicated to the worker role, would they not also impact the OCP cluster installs listed for the environments in #1 and #2 listed above in this comment?

Thank you,
Kyle

Comment 14 krmoser 2021-04-29 10:34:41 UTC

Folks,

FYI.  For the converged 3 node cluster, the following 11 OCP 4.8 nightly builds have successfully installed for both KVM and zVM environments (in addition to a good number of previous builds).

 1. 4.8.0-0.nightly-s390x-2021-04-28-005546
 2. 4.8.0-0.nightly-s390x-2021-04-28-061949
 3. 4.8.0-0.nightly-s390x-2021-04-28-090300
 4. 4.8.0-0.nightly-s390x-2021-04-28-102839
 5. 4.8.0-0.nightly-s390x-2021-04-28-120918
 6. 4.8.0-0.nightly-s390x-2021-04-28-133218
 7. 4.8.0-0.nightly-s390x-2021-04-28-144752
 8. 4.8.0-0.nightly-s390x-2021-04-28-175339
 9. 4.8.0-0.nightly-s390x-2021-04-28-202733
10. 4.8.0-0.nightly-s390x-2021-04-28-215101
11. 4.8.0-0.nightly-s390x-2021-04-28-231853

Thank you,
Kyle

Comment 15 Andy McCrae 2021-04-29 11:06:55 UTC

Hi Kyle,

It is odd that it worked in the instances you mentioned, but the actual resolution of the *.apps addresses/connecting to the ingress pods is not handled by OpenShift itself, that would be handled by DNS and potentially LB servicing the cluster. Given the errors, either the ingress pods are broken (I think you would get a different error in that case), or we are unable to reach them.
We can test that theory further by checking against the individual ingress pods directly to see if that works.
First, get the IPs of the ingress pods:

$ oc get pods -n openshift-ingress -o wide | grep -v ^NAME | awk '{print  $6}'
192.168.124.51
192.168.124.52

Second, curl against those IPs specifically, for example:

$ curl -Ik https://192.168.124.51/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com"
HTTP/1.1 200 OK
...
$ curl -Ik https://192.168.124.52/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com"
HTTP/1.1 200 OK
...
As a "prove it fails vs non-ingress hosts"
$ curl -Ik https://192.168.124.53/ -H "HOST:canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com"
curl: (7) Failed to connect to 192.168.124.53 port 443: No route to host

We can compare that to the address itself, so for e.g.:
$ curl -Ik https://canary-openshift-ingress-canary.apps.pok-71.ocptest.pok.stglabs.ibm.com/
HTTP/1.1 200 OK
...

You should get the same results on the 2 ingress pods and the address itself. If the ingress pods are reachable (and work), but the address itself doesn't, then it has to be a resolution/LB configuration issue - I've subbed in the IP addreses from a KVM cluster I have up, but the approach is the same for zVM.

Comment 16 Dan Li 2021-04-30 12:27:53 UTC

Since Kyle has set the severity of this bug to "low", I am setting the "Blocker?" flag as "-" to triage the bug since low bugs should not be blocker bugs. Also based on Comment 14 there has been successful installs.

Comment 17 Dan Li 2021-05-17 18:51:38 UTC

Hi @Muhammad, should this bug still be assigned to you? If so, do you think it will be resolved before the end of this sprint (May 22nd)?

Comment 18 Dan Li 2021-05-17 18:54:55 UTC

Also @Kyle - I see from Comment 14 that the builds have successfully installed, can this bug be closed out or are there other work involved?

Comment 19 krmoser 2021-05-18 03:58:25 UTC

Dan,

Thanks for the note.  Yes, this bug can be closed for the converged 3 node cluster install issue with the OCP 4.8 on Z KVM environment.

FYI.  For the converged 3 node cluster, the following 9 OCP 4.8 nightly builds from May 17, 2021 have successfully installed for KVM (in addition to a good number of previous builds).

 1. 4.8.0-0.nightly-s390x-2021-05-17-075233
 2. 4.8.0-0.nightly-s390x-2021-05-17-091104
 3. 4.8.0-0.nightly-s390x-2021-05-17-113817
 4. 4.8.0-0.nightly-s390x-2021-05-17-141027
 5. 4.8.0-0.nightly-s390x-2021-05-17-153235
 6. 4.8.0-0.nightly-s390x-2021-05-17-180853
 7. 4.8.0-0.nightly-s390x-2021-05-17-195805
 8. 4.8.0-0.nightly-s390x-2021-05-17-213847
 9. 4.8.0-0.nightly-s390x-2021-05-17-233356

Thank you,
Kyle

Comment 20 Dan Li 2021-05-18 11:39:29 UTC

Thank you Kyle. Closing