Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2042826

Summary:	[SNO] the replicas of ingresscontroller/default is 2 on new installed SNO private cluster
Product:	OpenShift Container Platform	Reporter:	Hongan Li <hongli>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Shudi Li <shudili>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, gpei, mmasters, shudili, yunjiang
Version:	4.10
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-11-02 01:38:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongan Li 2022-01-20 08:25:37 UTC

Description of problem:
the replicas of ingresscontroller is 2 in a fresh installed SNO private cluster, and co/ingress reports error:

NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.10.0-0.nightly-2022-01-19-150530   True        False         True       5h45m   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6bf4954d74-n4pvg" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)



OpenShift release version:
4.10.0-0.nightly-2022-01-19-150530

Cluster Platform:
AWS/GCP

How reproducible:
100%

Steps to Reproduce (in detail):
1. fresh install a SNO private cluster


Actual results:

$ oc get infrastructures.config.openshift.io cluster -oyaml
status:
  apiServerInternalURI: https://api-int.hongli-sno.qe.gcp.devcluster.openshift.com:6443
  apiServerURL: https://api.hongli-sno.qe.gcp.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  etcdDiscoveryDomain: ""
  infrastructureName: hongli-sno-sdhj9
  infrastructureTopology: SingleReplica             <------SNO cluster
  platform: GCP
  platformStatus:
    gcp:
      projectID: openshift-qe
      region: us-central1
    type: GCP

#### only one node (master+worker)
$ oc get node
NAME                                                STATUS   ROLES           AGE    VERSION
hongli-sno-sdhj9-master-0.c.openshift-qe.internal   Ready    master,worker   7h2m   v1.23.0+60f5a1c

$ oc get co/ingress
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.10.0-0.nightly-2022-01-19-150530   True        False         True       5h45m   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6bf4954d74-n4pvg" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)


Expected results:
co/ingress should not Degraded, and the replicas should be 1 for SNO private cluster.

Impact of the problem:


Additional info:
1. delete the ingresscontroller/default and wait until ingress operator recreate a new one, then the replicas is updated to 1 and co/ingress is back to normal.
2. didn't find the same issue on non private cluster


** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2022-01-20 17:12:21 UTC

Setting blocker+ because this breaks the install for SNO+private.  

The issue probably lies in the generation of the default ingresscontroller manifest that the installer uses when the install-config specifies that a private cluster is desired.

Comment 2 Miciah Dashiel Butler Masters 2022-01-20 19:01:08 UTC

Is this a regression from 4.9, or is this also broken on 4.9 (and probably earlier releases too)?

Comment 3 Shudi Li 2022-01-21 03:58:28 UTC

The issue can be reproduced in 4.9.0-0.nightly-2022-01-20-172411

1.
% oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2022-01-20-172411   True        False         59m     Error while reconciling 4.9.0-0.nightly-2022-01-20-172411: the cluster operator ingress is degraded
% 

2.
% oc get node
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-62-116.us-east-2.compute.internal   Ready    master,worker   70m   v1.22.3+e790d7f
%

3.
% oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.9.0-0.nightly-2022-01-20-172411   True        False         True       65m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6b6fbf7f7f-qfkzs" cannot be scheduled: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
%

Comment 4 Miciah Dashiel Butler Masters 2022-01-21 04:20:22 UTC

Thanks!  Because this is not a regression, I am clearing the blocker flag.  However, I have already posted a fix for review anyway.

Comment 6 Miciah Dashiel Butler Masters 2022-01-21 21:23:31 UTC

After discussing with the installer team, we've decided that the appropriate way to resolve the issue is to change the operator's defaulting behavior when spec.replicas is omitted on an IngressController.

Comment 9 Miciah Dashiel Butler Masters 2022-01-27 17:17:55 UTC

Blocked on getting a reviewer for <https://github.com/openshift/api/pull/1103>.  Moving this BZ off of 4.10.0; we'll get it in a later release.

Comment 10 Miciah Dashiel Butler Masters 2022-03-09 19:17:01 UTC

This BZ is somewhat related to this proposed enhancement: <https://github.com/openshift/enhancements/pull/1041>.  I'll keep this BZ on the backlog for now.

Comment 11 Miciah Dashiel Butler Masters 2022-10-24 20:41:47 UTC

https://github.com/openshift/cluster-ingress-operator/pull/728/commits/d52a837623d29d8b265bf3fa9e395a37be778f78 for https://issues.redhat.com/browse/MGMT-9797 should have fixed the issue.  Please verify and let me know if there is still an issue.

Comment 13 Shudi Li 2022-10-27 02:30:10 UTC

Verified it with 4.11.0-0.nightly-2022-10-26-170309 on a sno cluster
1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-10-26-170309   True        False         12m     Cluster version is 4.11.0-0.nightly-2022-10-26-170309
%

2.
% oc get infrastructures.config.openshift.io cluster -oyaml
status:
  apiServerInternalURI: https://api-int.shudi-411snop12.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.shudi-411snop12.qe.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  etcdDiscoveryDomain: ""
  infrastructureName: shudi-411snop12-2tc8f
  infrastructureTopology: SingleReplica       <---
  platform: AWS
  platformStatus:
    aws:
      region: us-east-2
    type: AWS
% 

3
% oc get node
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-54-255.us-east-2.compute.internal   Ready    master,worker   31m   v1.24.6+5157800
% 

4. check the router-pod, only one pod as expected
shudi@Shudis-MacBook-Pro ~ % oc -n openshift-ingress get pods
NAME                             READY   STATUS    RESTARTS      AGE
router-default-c86b8754f-jkj8m   1/1     Running   3 (22m ago)   29m
% 

5.
% oc get co/ingress
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.11.0-0.nightly-2022-10-26-170309   True        False         False      21m     
%

Comment 15 Miciah Dashiel Butler Masters 2022-11-07 20:31:30 UTC

The change mentioned in comment 11 shipped in the 4.11.0 GA release, so I am changing the resolution of this BZ to "CURRENTRELEASE".