Bug 2042826 - [SNO] the replicas of ingresscontroller/default is 2 on new installed SNO private cluster
Summary: [SNO] the replicas of ingresscontroller/default is 2 on new installed SNO pri...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.11.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Shudi Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-20 08:25 UTC by Hongan Li
Modified: 2023-06-28 13:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-02 01:38:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Hongan Li 2022-01-20 08:25:37 UTC
Description of problem:
the replicas of ingresscontroller is 2 in a fresh installed SNO private cluster, and co/ingress reports error:

NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.10.0-0.nightly-2022-01-19-150530   True        False         True       5h45m   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6bf4954d74-n4pvg" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)



OpenShift release version:
4.10.0-0.nightly-2022-01-19-150530

Cluster Platform:
AWS/GCP

How reproducible:
100%

Steps to Reproduce (in detail):
1. fresh install a SNO private cluster


Actual results:

$ oc get infrastructures.config.openshift.io cluster -oyaml
status:
  apiServerInternalURI: https://api-int.hongli-sno.qe.gcp.devcluster.openshift.com:6443
  apiServerURL: https://api.hongli-sno.qe.gcp.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  etcdDiscoveryDomain: ""
  infrastructureName: hongli-sno-sdhj9
  infrastructureTopology: SingleReplica             <------SNO cluster
  platform: GCP
  platformStatus:
    gcp:
      projectID: openshift-qe
      region: us-central1
    type: GCP

#### only one node (master+worker)
$ oc get node
NAME                                                STATUS   ROLES           AGE    VERSION
hongli-sno-sdhj9-master-0.c.openshift-qe.internal   Ready    master,worker   7h2m   v1.23.0+60f5a1c

$ oc get co/ingress
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.10.0-0.nightly-2022-01-19-150530   True        False         True       5h45m   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6bf4954d74-n4pvg" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)


Expected results:
co/ingress should not Degraded, and the replicas should be 1 for SNO private cluster.

Impact of the problem:


Additional info:
1. delete the ingresscontroller/default and wait until ingress operator recreate a new one, then the replicas is updated to 1 and co/ingress is back to normal.
2. didn't find the same issue on non private cluster


** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2022-01-20 17:12:21 UTC
Setting blocker+ because this breaks the install for SNO+private.  

The issue probably lies in the generation of the default ingresscontroller manifest that the installer uses when the install-config specifies that a private cluster is desired.

Comment 2 Miciah Dashiel Butler Masters 2022-01-20 19:01:08 UTC
Is this a regression from 4.9, or is this also broken on 4.9 (and probably earlier releases too)?

Comment 3 Shudi Li 2022-01-21 03:58:28 UTC
The issue can be reproduced in 4.9.0-0.nightly-2022-01-20-172411

1.
% oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2022-01-20-172411   True        False         59m     Error while reconciling 4.9.0-0.nightly-2022-01-20-172411: the cluster operator ingress is degraded
% 

2.
% oc get node
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-62-116.us-east-2.compute.internal   Ready    master,worker   70m   v1.22.3+e790d7f
%

3.
% oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.9.0-0.nightly-2022-01-20-172411   True        False         True       65m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6b6fbf7f7f-qfkzs" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
%

Comment 4 Miciah Dashiel Butler Masters 2022-01-21 04:20:22 UTC
Thanks!  Because this is not a regression, I am clearing the blocker flag.  However, I have already posted a fix for review anyway.

Comment 6 Miciah Dashiel Butler Masters 2022-01-21 21:23:31 UTC
After discussing with the installer team, we've decided that the appropriate way to resolve the issue is to change the operator's defaulting behavior when spec.replicas is omitted on an IngressController.

Comment 9 Miciah Dashiel Butler Masters 2022-01-27 17:17:55 UTC
Blocked on getting a reviewer for <https://github.com/openshift/api/pull/1103>.  Moving this BZ off of 4.10.0; we'll get it in a later release.

Comment 10 Miciah Dashiel Butler Masters 2022-03-09 19:17:01 UTC
This BZ is somewhat related to this proposed enhancement: <https://github.com/openshift/enhancements/pull/1041>.  I'll keep this BZ on the backlog for now.

Comment 11 Miciah Dashiel Butler Masters 2022-10-24 20:41:47 UTC
https://github.com/openshift/cluster-ingress-operator/pull/728/commits/d52a837623d29d8b265bf3fa9e395a37be778f78 for https://issues.redhat.com/browse/MGMT-9797 should have fixed the issue.  Please verify and let me know if there is still an issue.

Comment 13 Shudi Li 2022-10-27 02:30:10 UTC
Verified it with 4.11.0-0.nightly-2022-10-26-170309 on a sno cluster
1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-10-26-170309   True        False         12m     Cluster version is 4.11.0-0.nightly-2022-10-26-170309
%

2.
% oc get infrastructures.config.openshift.io cluster -oyaml
status:
  apiServerInternalURI: https://api-int.shudi-411snop12.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.shudi-411snop12.qe.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  etcdDiscoveryDomain: ""
  infrastructureName: shudi-411snop12-2tc8f
  infrastructureTopology: SingleReplica       <---
  platform: AWS
  platformStatus:
    aws:
      region: us-east-2
    type: AWS
% 

3
% oc get node
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-54-255.us-east-2.compute.internal   Ready    master,worker   31m   v1.24.6+5157800
% 

4. check the router-pod, only one pod as expected
shudi@Shudis-MacBook-Pro ~ % oc -n openshift-ingress get pods
NAME                             READY   STATUS    RESTARTS      AGE
router-default-c86b8754f-jkj8m   1/1     Running   3 (22m ago)   29m
% 

5.
% oc get co/ingress
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.11.0-0.nightly-2022-10-26-170309   True        False         False      21m     
%

Comment 15 Miciah Dashiel Butler Masters 2022-11-07 20:31:30 UTC
The change mentioned in comment 11 shipped in the 4.11.0 GA release, so I am changing the resolution of this BZ to "CURRENTRELEASE".


Note You need to log in before you can comment on or make changes to this bug.