Bug 1830271 - cluster-ingress-operator is not marked degraded when replicas not met
Summary: cluster-ingress-operator is not marked degraded when replicas not met
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-01 12:15 UTC by Stephen Benjamin
Modified: 2022-08-04 22:27 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 15:58:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc -n openshift-ingress get deploy/router-default -o yaml (13.99 KB, text/plain)
2020-06-03 14:48 UTC, Stephen Benjamin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 420 0 None closed Bug 1830271: status: Replace "DeploymentDegraded" condition 2021-01-21 20:12:37 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:56 UTC

Description Stephen Benjamin 2020-05-01 12:15:25 UTC
Description of problem:

When deploying a cluster with only 1 worker, the ingress operator can't meet it's desire for replicas = 2, but the operator does not go into a degraded state. Installation succeeds because of this, when it should not.

Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. Deploy a cluster with 3 masters, and 1 worker

Actual results:
Install succeeds

Expected results:
It should fail with ingress being marked as degraded.

Additional info:

NAME                              READY   STATUS    RESTARTS   AGE
router-default-7b9df87dc5-dctc7   0/1     Pending   0          28m
router-default-7b9df87dc5-jnwzf   1/1     Running   0          28m

Operator is not degraded:

NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.ci-2020-04-30-121321   True        False         False      2m4

Comment 1 Miciah Dashiel Butler Masters 2020-05-08 19:22:41 UTC
Can you attach the output of `oc -n openshift-ingress get deploy/router-default -o yaml` (or the deployment spec from must-gather if that is easier)?

Comment 2 Miciah Dashiel Butler Masters 2020-05-08 19:23:47 UTC
Sorry, didn't mean to re-assign at this time.

Comment 3 Andrew McDermott 2020-05-19 15:11:49 UTC
Moving to 4.6.

Comment 4 Stephen Benjamin 2020-06-03 14:48:06 UTC
Created attachment 1694842 [details]
oc -n openshift-ingress get deploy/router-default -o yaml

Comment 5 Miciah Dashiel Butler Masters 2020-06-04 17:31:46 UTC
It seems like the "router-default" deployment is reporting incorrect status.  If the deployment reported Available=False in its status conditions, then the ingress operator would set Degraded=True in the clusteroperator's status conditions.

The "router-default" deployment specifies 2 replicas with 25% maximum unavailable:

    spec:
      replicas: 2
      # ...
      strategy:
        rollingUpdate:
          maxSurge: 0
          maxUnavailable: 25%
        type: RollingUpdate

The deployment has 1 available replica yet reports Available=True:

    status:
      availableReplicas: 1
      conditions:
      - lastTransitionTime: "2020-06-03T14:21:40Z"
        lastUpdateTime: "2020-06-03T14:21:40Z"
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"
        type: Available
      - lastTransitionTime: "2020-06-03T14:31:40Z"
        lastUpdateTime: "2020-06-03T14:31:40Z"
        message: ReplicaSet "router-default-f4c5b8bdd" has timed out progressing.
        reason: ProgressDeadlineExceeded
        status: "False"
        type: Progressing
      observedGeneration: 1
      readyReplicas: 1
      replicas: 2
      unavailableReplicas: 1
      updatedReplicas: 2

The deployment's maxUnavailable parameter has the following meaning:

    	// The maximum number of pods that can be unavailable during the update.
    	// Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%).
    	// Absolute number is calculated from percentage by rounding down.

The deployment's "Available" condition has the following meaning:

    	// Available means the deployment is available, ie. at least the minimum available
    	// replicas required are up and running for at least minReadySeconds.

If the maximum unavailable is ⌊ %maxUnavailable * spec.replicas ⌋ = ⌊ 25% * 2 ⌋ = 0, then that implies that the minimum available is spec.replicas - 0 = 2.  This minimum is not met, so the deployment controller should set Available=False per the API documentation.

Based on the above analysis, I am re-assigning this Bugzilla report to kube-controller-manager, which manages the deployment controller, which sets the deployment's status.

Comment 6 Maciej Szulik 2020-06-05 14:35:11 UTC
Per your configuration:

  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 25%
    type: RollingUpdate

that one replica of the router fulfils minimal requirements, and that's expressed in status:

      - lastTransitionTime: "2020-06-03T14:21:40Z"
        lastUpdateTime: "2020-06-03T14:21:40Z"
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"

For now I'd suggest looking at the above, which will give you the availability of your deployment
and then for full availability compare .spec.replicas with .spec.readyReplicas.

Comment 7 Maciej Szulik 2020-06-05 14:38:25 UTC
I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1844502 to fix the Progressing state in the mean time, until we'll get the new status we'll be working in the upcoming months.

Comment 9 Andrew McDermott 2020-06-17 10:23:01 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 10 Miciah Dashiel Butler Masters 2020-07-09 05:04:18 UTC
I'll work on getting the posted fix reviewed and merged in the upcoming sprint.

Comment 13 Hongan Li 2020-07-27 08:25:13 UTC
Verified with 4.6.0-0.nightly-2020-07-25-091217 and issue has been fixed.

Below three types are added
  - lastTransitionTime: "2020-07-27T07:56:49Z"
    message: 'The deployment has Available status condition set to False (reason:
      MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.'
    reason: DeploymentUnavailable
    status: "False"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-07-27T07:56:48Z"
    message: 1/3 of replicas are available, max unavailable is 1
    reason: DeploymentMinimumReplicasNotMet
    status: "False"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-07-27T07:56:48Z"
    message: 1/3 of replicas are available
    reason: DeploymentReplicasNotAvailable
    status: "False"
    type: DeploymentReplicasAllAvailable

and one type is removed:
  - lastTransitionTime: "2020-07-27T01:59:34Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded

Comment 15 errata-xmlrpc 2020-10-27 15:58:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.