Bug 1830271
| Summary: | cluster-ingress-operator is not marked degraded when replicas not met | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Stephen Benjamin <stbenjam> | ||||
| Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> | ||||
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | unspecified | CC: | amcdermo, aos-bugs, maszulik, mfojtik, mmasters | ||||
| Version: | 4.5 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.6.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-10-27 15:58:32 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Stephen Benjamin
2020-05-01 12:15:25 UTC
Can you attach the output of `oc -n openshift-ingress get deploy/router-default -o yaml` (or the deployment spec from must-gather if that is easier)? Sorry, didn't mean to re-assign at this time. Moving to 4.6. Created attachment 1694842 [details]
oc -n openshift-ingress get deploy/router-default -o yaml
It seems like the "router-default" deployment is reporting incorrect status. If the deployment reported Available=False in its status conditions, then the ingress operator would set Degraded=True in the clusteroperator's status conditions.
The "router-default" deployment specifies 2 replicas with 25% maximum unavailable:
spec:
replicas: 2
# ...
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 25%
type: RollingUpdate
The deployment has 1 available replica yet reports Available=True:
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2020-06-03T14:21:40Z"
lastUpdateTime: "2020-06-03T14:21:40Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2020-06-03T14:31:40Z"
lastUpdateTime: "2020-06-03T14:31:40Z"
message: ReplicaSet "router-default-f4c5b8bdd" has timed out progressing.
reason: ProgressDeadlineExceeded
status: "False"
type: Progressing
observedGeneration: 1
readyReplicas: 1
replicas: 2
unavailableReplicas: 1
updatedReplicas: 2
The deployment's maxUnavailable parameter has the following meaning:
// The maximum number of pods that can be unavailable during the update.
// Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%).
// Absolute number is calculated from percentage by rounding down.
The deployment's "Available" condition has the following meaning:
// Available means the deployment is available, ie. at least the minimum available
// replicas required are up and running for at least minReadySeconds.
If the maximum unavailable is ⌊ %maxUnavailable * spec.replicas ⌋ = ⌊ 25% * 2 ⌋ = 0, then that implies that the minimum available is spec.replicas - 0 = 2. This minimum is not met, so the deployment controller should set Available=False per the API documentation.
Based on the above analysis, I am re-assigning this Bugzilla report to kube-controller-manager, which manages the deployment controller, which sets the deployment's status.
Per your configuration:
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 25%
type: RollingUpdate
that one replica of the router fulfils minimal requirements, and that's expressed in status:
- lastTransitionTime: "2020-06-03T14:21:40Z"
lastUpdateTime: "2020-06-03T14:21:40Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
For now I'd suggest looking at the above, which will give you the availability of your deployment
and then for full availability compare .spec.replicas with .spec.readyReplicas.
I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1844502 to fix the Progressing state in the mean time, until we'll get the new status we'll be working in the upcoming months. Handy code for my previous suggestion lives here: https://github.com/kubernetes/kubernetes/blob/442a69c3bdf6fe8e525b05887e57d89db1e2f3a5/pkg/controller/deployment/util/deployment_util.go#L738-L745 I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I'll work on getting the posted fix reviewed and merged in the upcoming sprint. Verified with 4.6.0-0.nightly-2020-07-25-091217 and issue has been fixed.
Below three types are added
- lastTransitionTime: "2020-07-27T07:56:49Z"
message: 'The deployment has Available status condition set to False (reason:
MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.'
reason: DeploymentUnavailable
status: "False"
type: DeploymentAvailable
- lastTransitionTime: "2020-07-27T07:56:48Z"
message: 1/3 of replicas are available, max unavailable is 1
reason: DeploymentMinimumReplicasNotMet
status: "False"
type: DeploymentReplicasMinAvailable
- lastTransitionTime: "2020-07-27T07:56:48Z"
message: 1/3 of replicas are available
reason: DeploymentReplicasNotAvailable
status: "False"
type: DeploymentReplicasAllAvailable
and one type is removed:
- lastTransitionTime: "2020-07-27T01:59:34Z"
message: The deployment has Available status condition set to True
reason: DeploymentAvailable
status: "False"
type: DeploymentDegraded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |