Description of problem: When deploying a cluster with only 1 worker, the ingress operator can't meet it's desire for replicas = 2, but the operator does not go into a degraded state. Installation succeeds because of this, when it should not. Version-Release number of selected component (if applicable): 4.5 How reproducible: Always Steps to Reproduce: 1. Deploy a cluster with 3 masters, and 1 worker Actual results: Install succeeds Expected results: It should fail with ingress being marked as degraded. Additional info: NAME READY STATUS RESTARTS AGE router-default-7b9df87dc5-dctc7 0/1 Pending 0 28m router-default-7b9df87dc5-jnwzf 1/1 Running 0 28m Operator is not degraded: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.5.0-0.ci-2020-04-30-121321 True False False 2m4
Can you attach the output of `oc -n openshift-ingress get deploy/router-default -o yaml` (or the deployment spec from must-gather if that is easier)?
Sorry, didn't mean to re-assign at this time.
Moving to 4.6.
Created attachment 1694842 [details] oc -n openshift-ingress get deploy/router-default -o yaml
It seems like the "router-default" deployment is reporting incorrect status. If the deployment reported Available=False in its status conditions, then the ingress operator would set Degraded=True in the clusteroperator's status conditions. The "router-default" deployment specifies 2 replicas with 25% maximum unavailable: spec: replicas: 2 # ... strategy: rollingUpdate: maxSurge: 0 maxUnavailable: 25% type: RollingUpdate The deployment has 1 available replica yet reports Available=True: status: availableReplicas: 1 conditions: - lastTransitionTime: "2020-06-03T14:21:40Z" lastUpdateTime: "2020-06-03T14:21:40Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2020-06-03T14:31:40Z" lastUpdateTime: "2020-06-03T14:31:40Z" message: ReplicaSet "router-default-f4c5b8bdd" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing observedGeneration: 1 readyReplicas: 1 replicas: 2 unavailableReplicas: 1 updatedReplicas: 2 The deployment's maxUnavailable parameter has the following meaning: // The maximum number of pods that can be unavailable during the update. // Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%). // Absolute number is calculated from percentage by rounding down. The deployment's "Available" condition has the following meaning: // Available means the deployment is available, ie. at least the minimum available // replicas required are up and running for at least minReadySeconds. If the maximum unavailable is ⌊ %maxUnavailable * spec.replicas ⌋ = ⌊ 25% * 2 ⌋ = 0, then that implies that the minimum available is spec.replicas - 0 = 2. This minimum is not met, so the deployment controller should set Available=False per the API documentation. Based on the above analysis, I am re-assigning this Bugzilla report to kube-controller-manager, which manages the deployment controller, which sets the deployment's status.
Per your configuration: strategy: rollingUpdate: maxSurge: 0 maxUnavailable: 25% type: RollingUpdate that one replica of the router fulfils minimal requirements, and that's expressed in status: - lastTransitionTime: "2020-06-03T14:21:40Z" lastUpdateTime: "2020-06-03T14:21:40Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" For now I'd suggest looking at the above, which will give you the availability of your deployment and then for full availability compare .spec.replicas with .spec.readyReplicas.
I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1844502 to fix the Progressing state in the mean time, until we'll get the new status we'll be working in the upcoming months.
Handy code for my previous suggestion lives here: https://github.com/kubernetes/kubernetes/blob/442a69c3bdf6fe8e525b05887e57d89db1e2f3a5/pkg/controller/deployment/util/deployment_util.go#L738-L745
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I'll work on getting the posted fix reviewed and merged in the upcoming sprint.
Verified with 4.6.0-0.nightly-2020-07-25-091217 and issue has been fixed. Below three types are added - lastTransitionTime: "2020-07-27T07:56:49Z" message: 'The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.' reason: DeploymentUnavailable status: "False" type: DeploymentAvailable - lastTransitionTime: "2020-07-27T07:56:48Z" message: 1/3 of replicas are available, max unavailable is 1 reason: DeploymentMinimumReplicasNotMet status: "False" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2020-07-27T07:56:48Z" message: 1/3 of replicas are available reason: DeploymentReplicasNotAvailable status: "False" type: DeploymentReplicasAllAvailable and one type is removed: - lastTransitionTime: "2020-07-27T01:59:34Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "False" type: DeploymentDegraded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196