Description of problem: the replicas of ingresscontroller is 2 in a fresh installed SNO private cluster, and co/ingress reports error: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.10.0-0.nightly-2022-01-19-150530 True False True 5h45m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6bf4954d74-n4pvg" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) OpenShift release version: 4.10.0-0.nightly-2022-01-19-150530 Cluster Platform: AWS/GCP How reproducible: 100% Steps to Reproduce (in detail): 1. fresh install a SNO private cluster Actual results: $ oc get infrastructures.config.openshift.io cluster -oyaml status: apiServerInternalURI: https://api-int.hongli-sno.qe.gcp.devcluster.openshift.com:6443 apiServerURL: https://api.hongli-sno.qe.gcp.devcluster.openshift.com:6443 controlPlaneTopology: SingleReplica etcdDiscoveryDomain: "" infrastructureName: hongli-sno-sdhj9 infrastructureTopology: SingleReplica <------SNO cluster platform: GCP platformStatus: gcp: projectID: openshift-qe region: us-central1 type: GCP #### only one node (master+worker) $ oc get node NAME STATUS ROLES AGE VERSION hongli-sno-sdhj9-master-0.c.openshift-qe.internal Ready master,worker 7h2m v1.23.0+60f5a1c $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.10.0-0.nightly-2022-01-19-150530 True False True 5h45m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6bf4954d74-n4pvg" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) Expected results: co/ingress should not Degraded, and the replicas should be 1 for SNO private cluster. Impact of the problem: Additional info: 1. delete the ingresscontroller/default and wait until ingress operator recreate a new one, then the replicas is updated to 1 and co/ingress is back to normal. 2. didn't find the same issue on non private cluster ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.
Setting blocker+ because this breaks the install for SNO+private. The issue probably lies in the generation of the default ingresscontroller manifest that the installer uses when the install-config specifies that a private cluster is desired.
Is this a regression from 4.9, or is this also broken on 4.9 (and probably earlier releases too)?
The issue can be reproduced in 4.9.0-0.nightly-2022-01-20-172411 1. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-01-20-172411 True False 59m Error while reconciling 4.9.0-0.nightly-2022-01-20-172411: the cluster operator ingress is degraded % 2. % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-62-116.us-east-2.compute.internal Ready master,worker 70m v1.22.3+e790d7f % 3. % oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.9.0-0.nightly-2022-01-20-172411 True False True 65m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6b6fbf7f7f-qfkzs" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) %
Thanks! Because this is not a regression, I am clearing the blocker flag. However, I have already posted a fix for review anyway.
After discussing with the installer team, we've decided that the appropriate way to resolve the issue is to change the operator's defaulting behavior when spec.replicas is omitted on an IngressController.
Blocked on getting a reviewer for <https://github.com/openshift/api/pull/1103>. Moving this BZ off of 4.10.0; we'll get it in a later release.
This BZ is somewhat related to this proposed enhancement: <https://github.com/openshift/enhancements/pull/1041>. I'll keep this BZ on the backlog for now.
https://github.com/openshift/cluster-ingress-operator/pull/728/commits/d52a837623d29d8b265bf3fa9e395a37be778f78 for https://issues.redhat.com/browse/MGMT-9797 should have fixed the issue. Please verify and let me know if there is still an issue.
Verified it with 4.11.0-0.nightly-2022-10-26-170309 on a sno cluster 1. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-10-26-170309 True False 12m Cluster version is 4.11.0-0.nightly-2022-10-26-170309 % 2. % oc get infrastructures.config.openshift.io cluster -oyaml status: apiServerInternalURI: https://api-int.shudi-411snop12.qe.devcluster.openshift.com:6443 apiServerURL: https://api.shudi-411snop12.qe.devcluster.openshift.com:6443 controlPlaneTopology: SingleReplica etcdDiscoveryDomain: "" infrastructureName: shudi-411snop12-2tc8f infrastructureTopology: SingleReplica <--- platform: AWS platformStatus: aws: region: us-east-2 type: AWS % 3 % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-54-255.us-east-2.compute.internal Ready master,worker 31m v1.24.6+5157800 % 4. check the router-pod, only one pod as expected shudi@Shudis-MacBook-Pro ~ % oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-c86b8754f-jkj8m 1/1 Running 3 (22m ago) 29m % 5. % oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.11.0-0.nightly-2022-10-26-170309 True False False 21m %
The change mentioned in comment 11 shipped in the 4.11.0 GA release, so I am changing the resolution of this BZ to "CURRENTRELEASE".