Bug 1830271

Summary:

cluster-ingress-operator is not marked degraded when replicas not met

Product:

OpenShift Container Platform

Reporter:

Stephen Benjamin <stbenjam>

Component:

Networking

Assignee:

Miciah Dashiel Butler Masters <mmasters>

Networking sub component:

router

QA Contact:

Hongan Li <hongli>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

amcdermo, aos-bugs, maszulik, mfojtik, mmasters

Version:

4.5

Target Milestone:

---

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-10-27 15:58:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
oc -n openshift-ingress get deploy/router-default -o yaml	none

Description Stephen Benjamin 2020-05-01 12:15:25 UTC

Description of problem:

When deploying a cluster with only 1 worker, the ingress operator can't meet it's desire for replicas = 2, but the operator does not go into a degraded state. Installation succeeds because of this, when it should not.

Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. Deploy a cluster with 3 masters, and 1 worker

Actual results:
Install succeeds

Expected results:
It should fail with ingress being marked as degraded.

Additional info:

NAME                              READY   STATUS    RESTARTS   AGE
router-default-7b9df87dc5-dctc7   0/1     Pending   0          28m
router-default-7b9df87dc5-jnwzf   1/1     Running   0          28m

Operator is not degraded:

NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.ci-2020-04-30-121321   True        False         False      2m4

Comment 1 Miciah Dashiel Butler Masters 2020-05-08 19:22:41 UTC

Can you attach the output of `oc -n openshift-ingress get deploy/router-default -o yaml` (or the deployment spec from must-gather if that is easier)?

Comment 2 Miciah Dashiel Butler Masters 2020-05-08 19:23:47 UTC

Sorry, didn't mean to re-assign at this time.

Comment 3 Andrew McDermott 2020-05-19 15:11:49 UTC

Moving to 4.6.

Comment 4 Stephen Benjamin 2020-06-03 14:48:06 UTC

Created attachment 1694842 [details]
oc -n openshift-ingress get deploy/router-default -o yaml

Comment 5 Miciah Dashiel Butler Masters 2020-06-04 17:31:46 UTC

It seems like the "router-default" deployment is reporting incorrect status.  If the deployment reported Available=False in its status conditions, then the ingress operator would set Degraded=True in the clusteroperator's status conditions.

The "router-default" deployment specifies 2 replicas with 25% maximum unavailable:

    spec:
      replicas: 2
      # ...
      strategy:
        rollingUpdate:
          maxSurge: 0
          maxUnavailable: 25%
        type: RollingUpdate

The deployment has 1 available replica yet reports Available=True:

    status:
      availableReplicas: 1
      conditions:
      - lastTransitionTime: "2020-06-03T14:21:40Z"
        lastUpdateTime: "2020-06-03T14:21:40Z"
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"
        type: Available
      - lastTransitionTime: "2020-06-03T14:31:40Z"
        lastUpdateTime: "2020-06-03T14:31:40Z"
        message: ReplicaSet "router-default-f4c5b8bdd" has timed out progressing.
        reason: ProgressDeadlineExceeded
        status: "False"
        type: Progressing
      observedGeneration: 1
      readyReplicas: 1
      replicas: 2
      unavailableReplicas: 1
      updatedReplicas: 2

The deployment's maxUnavailable parameter has the following meaning:

    	// The maximum number of pods that can be unavailable during the update.
    	// Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%).
    	// Absolute number is calculated from percentage by rounding down.

The deployment's "Available" condition has the following meaning:

    	// Available means the deployment is available, ie. at least the minimum available
    	// replicas required are up and running for at least minReadySeconds.

If the maximum unavailable is ⌊ %maxUnavailable * spec.replicas ⌋ = ⌊ 25% * 2 ⌋ = 0, then that implies that the minimum available is spec.replicas - 0 = 2.  This minimum is not met, so the deployment controller should set Available=False per the API documentation.

Based on the above analysis, I am re-assigning this Bugzilla report to kube-controller-manager, which manages the deployment controller, which sets the deployment's status.

Comment 6 Maciej Szulik 2020-06-05 14:35:11 UTC

Per your configuration:

  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 25%
    type: RollingUpdate

that one replica of the router fulfils minimal requirements, and that's expressed in status:

      - lastTransitionTime: "2020-06-03T14:21:40Z"
        lastUpdateTime: "2020-06-03T14:21:40Z"
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"

For now I'd suggest looking at the above, which will give you the availability of your deployment
and then for full availability compare .spec.replicas with .spec.readyReplicas.

Comment 7 Maciej Szulik 2020-06-05 14:38:25 UTC

I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1844502 to fix the Progressing state in the mean time, until we'll get the new status we'll be working in the upcoming months.

Comment 8 Maciej Szulik 2020-06-05 14:41:03 UTC

Handy code for my previous suggestion lives here: https://github.com/kubernetes/kubernetes/blob/442a69c3bdf6fe8e525b05887e57d89db1e2f3a5/pkg/controller/deployment/util/deployment_util.go#L738-L745

Comment 9 Andrew McDermott 2020-06-17 10:23:01 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 10 Miciah Dashiel Butler Masters 2020-07-09 05:04:18 UTC

I'll work on getting the posted fix reviewed and merged in the upcoming sprint.

Comment 13 Hongan Li 2020-07-27 08:25:13 UTC

Verified with 4.6.0-0.nightly-2020-07-25-091217 and issue has been fixed.

Below three types are added
  - lastTransitionTime: "2020-07-27T07:56:49Z"
    message: 'The deployment has Available status condition set to False (reason:
      MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.'
    reason: DeploymentUnavailable
    status: "False"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-07-27T07:56:48Z"
    message: 1/3 of replicas are available, max unavailable is 1
    reason: DeploymentMinimumReplicasNotMet
    status: "False"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-07-27T07:56:48Z"
    message: 1/3 of replicas are available
    reason: DeploymentReplicasNotAvailable
    status: "False"
    type: DeploymentReplicasAllAvailable

and one type is removed:
  - lastTransitionTime: "2020-07-27T01:59:34Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded

Comment 15 errata-xmlrpc 2020-10-27 15:58:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196