1830271 – cluster-ingress-operator is not marked degraded when replicas not met

Bug 1830271 - cluster-ingress-operator is not marked degraded when replicas not met

Summary: cluster-ingress-operator is not marked degraded when replicas not met

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-01 12:15 UTC by Stephen Benjamin
Modified:	2022-08-04 22:27 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 15:58:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
oc -n openshift-ingress get deploy/router-default -o yaml (13.99 KB, text/plain) 2020-06-03 14:48 UTC, Stephen Benjamin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 420	0	None	closed	Bug 1830271: status: Replace "DeploymentDegraded" condition	2021-01-21 20:12:37 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:58:56 UTC

Description Stephen Benjamin 2020-05-01 12:15:25 UTC

Description of problem:

When deploying a cluster with only 1 worker, the ingress operator can't meet it's desire for replicas = 2, but the operator does not go into a degraded state. Installation succeeds because of this, when it should not.

Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. Deploy a cluster with 3 masters, and 1 worker

Actual results:
Install succeeds

Expected results:
It should fail with ingress being marked as degraded.

Additional info:

NAME                              READY   STATUS    RESTARTS   AGE
router-default-7b9df87dc5-dctc7   0/1     Pending   0          28m
router-default-7b9df87dc5-jnwzf   1/1     Running   0          28m

Operator is not degraded:

NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.ci-2020-04-30-121321   True        False         False      2m4

Comment 1 Miciah Dashiel Butler Masters 2020-05-08 19:22:41 UTC

Can you attach the output of `oc -n openshift-ingress get deploy/router-default -o yaml` (or the deployment spec from must-gather if that is easier)?

Comment 2 Miciah Dashiel Butler Masters 2020-05-08 19:23:47 UTC

Sorry, didn't mean to re-assign at this time.

Comment 3 Andrew McDermott 2020-05-19 15:11:49 UTC

Moving to 4.6.

Comment 4 Stephen Benjamin 2020-06-03 14:48:06 UTC

Created attachment 1694842 [details]
oc -n openshift-ingress get deploy/router-default -o yaml

Comment 5 Miciah Dashiel Butler Masters 2020-06-04 17:31:46 UTC

It seems like the "router-default" deployment is reporting incorrect status.  If the deployment reported Available=False in its status conditions, then the ingress operator would set Degraded=True in the clusteroperator's status conditions.

The "router-default" deployment specifies 2 replicas with 25% maximum unavailable:

    spec:
      replicas: 2
      # ...
      strategy:
        rollingUpdate:
          maxSurge: 0
          maxUnavailable: 25%
        type: RollingUpdate

The deployment has 1 available replica yet reports Available=True:

    status:
      availableReplicas: 1
      conditions:
      - lastTransitionTime: "2020-06-03T14:21:40Z"
        lastUpdateTime: "2020-06-03T14:21:40Z"
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"
        type: Available
      - lastTransitionTime: "2020-06-03T14:31:40Z"
        lastUpdateTime: "2020-06-03T14:31:40Z"
        message: ReplicaSet "router-default-f4c5b8bdd" has timed out progressing.
        reason: ProgressDeadlineExceeded
        status: "False"
        type: Progressing
      observedGeneration: 1
      readyReplicas: 1
      replicas: 2
      unavailableReplicas: 1
      updatedReplicas: 2

The deployment's maxUnavailable parameter has the following meaning:

    	// The maximum number of pods that can be unavailable during the update.
    	// Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%).
    	// Absolute number is calculated from percentage by rounding down.

The deployment's "Available" condition has the following meaning:

    	// Available means the deployment is available, ie. at least the minimum available
    	// replicas required are up and running for at least minReadySeconds.

If the maximum unavailable is ⌊ %maxUnavailable * spec.replicas ⌋ = ⌊ 25% * 2 ⌋ = 0, then that implies that the minimum available is spec.replicas - 0 = 2.  This minimum is not met, so the deployment controller should set Available=False per the API documentation.

Based on the above analysis, I am re-assigning this Bugzilla report to kube-controller-manager, which manages the deployment controller, which sets the deployment's status.

Comment 6 Maciej Szulik 2020-06-05 14:35:11 UTC

Per your configuration:

  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 25%
    type: RollingUpdate

that one replica of the router fulfils minimal requirements, and that's expressed in status:

      - lastTransitionTime: "2020-06-03T14:21:40Z"
        lastUpdateTime: "2020-06-03T14:21:40Z"
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"

For now I'd suggest looking at the above, which will give you the availability of your deployment
and then for full availability compare .spec.replicas with .spec.readyReplicas.

Comment 7 Maciej Szulik 2020-06-05 14:38:25 UTC

I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1844502 to fix the Progressing state in the mean time, until we'll get the new status we'll be working in the upcoming months.

Comment 8 Maciej Szulik 2020-06-05 14:41:03 UTC

Handy code for my previous suggestion lives here: https://github.com/kubernetes/kubernetes/blob/442a69c3bdf6fe8e525b05887e57d89db1e2f3a5/pkg/controller/deployment/util/deployment_util.go#L738-L745

Comment 9 Andrew McDermott 2020-06-17 10:23:01 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 10 Miciah Dashiel Butler Masters 2020-07-09 05:04:18 UTC

I'll work on getting the posted fix reviewed and merged in the upcoming sprint.

Comment 13 Hongan Li 2020-07-27 08:25:13 UTC

Verified with 4.6.0-0.nightly-2020-07-25-091217 and issue has been fixed.

Below three types are added
  - lastTransitionTime: "2020-07-27T07:56:49Z"
    message: 'The deployment has Available status condition set to False (reason:
      MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.'
    reason: DeploymentUnavailable
    status: "False"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-07-27T07:56:48Z"
    message: 1/3 of replicas are available, max unavailable is 1
    reason: DeploymentMinimumReplicasNotMet
    status: "False"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-07-27T07:56:48Z"
    message: 1/3 of replicas are available
    reason: DeploymentReplicasNotAvailable
    status: "False"
    type: DeploymentReplicasAllAvailable

and one type is removed:
  - lastTransitionTime: "2020-07-27T01:59:34Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded

Comment 15 errata-xmlrpc 2020-10-27 15:58:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.