Bug 1813895
| Summary: | 4.3.1 to 4.3.5 upgrade fails, blocking on syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lili Cosic <lcosic> |
| Component: | Networking | Assignee: | Daneyon Hansen <dhansen> |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | amurdaca, aos-bugs, bbennett, evb, jack.ottofaro, kgarriso, mmasters, sagrawal, vrutkovs, wking |
| Version: | 4.3.z | Keywords: | Upgrades |
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-03-31 01:34:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Lili Cosic
2020-03-16 12:45:43 UTC
Cluster ID -> a5f82f8f-2ded-4303-b0ff-8afb726db1b1 Concerning strings from Lili's must-gather, in case it helps other folks searching for bugs find this one:
cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml
- lastTransitionTime: "2020-03-16T10:03:39Z"
message: 'Cluster operator ingress is reporting a failure: Some ingresscontrollers
are degraded: default'
reason: ClusterOperatorDegraded
status: "True"
type: Failing
cluster-scoped-resources/config.openshift.io/clusteroperators/ingress.yaml
- lastTransitionTime: "2020-03-16T09:55:00Z"
message: 'Some ingresscontrollers are degraded: default'
reason: IngressControllersDegraded
status: "True"
type: Degraded
cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml
- lastTransitionTime: "2020-03-16T09:44:09Z"
message: 'Unable to apply 4.3.5: timed out waiting for the condition during syncRequiredMachineConfigPools:
pool master has not progressed to latest configuration: controller version mismatch
for rendered-master-9145931d893186501c29af8922cf9dc7 expected d5599de7a6b86ec385e0f9c849e93977fcb4eeb8
has 25bb6aeb58135c38a667e849edf5244871be4992, retrying'
reason: RequiredPoolsFailed
status: "True"
type: Degraded
I don't actually see the "refusing to read images.json" line in the must-gather? Not sure where that is from. I'd have expected it to be in the machine-config-operator logs based on its location in the 4.3 source [1]. But the only MCO logs in the must-gather are machine-config-operator-97f8dbdbc-4fsx2 created 2020-03-16T09:53:44Z, which is almost 10m after that operator went Degraded=True.
[1]: https://github.com/openshift/machine-config-operator/blob/ab4d62a3bf3774b77b6f9b04a2028faec1568aca/pkg/operator/sync.go#L117
So I'm not clear on what's going on with this bug; hopefully the MCO folks can figure it out. Here's our usual boilerplate impact-statement template to give the updates team a better feel for whether we need to pull update edges to avoid this bug, and if so, which edges need pulling. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.2.z and 4.3.1 I don't see anything here from the MCO. The controller-mismatch error is transient and resolves once all the MCO components finish upgrading to the latest version. I see a bit of delay on one of the master nodes in updating to the latest config because it still had a pending config update. Perhaps some failure occurred during a previous upgrade? It doesn't seem to be the core issue however as the delay is only about 5 min. All the nodes finish updating within about 30min of the update start time. I think something is going on with the ingress. Assigning to routing to take a look as per EVB's findings above. Please provide an update. Having been labeled an UpgradeBlocker means this bug is blocking at least one upgrade path. Excerpts from the Router Deployment YAML status from comment 0's must-gather [1]: spec: progressDeadlineSeconds: 600 replicas: 2 ... strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 type: RollingUpdate ... status: availableReplicas: 1 conditions: - lastTransitionTime: "2020-02-07T11:53:48Z" lastUpdateTime: "2020-03-16T09:32:56Z" message: ReplicaSet "router-default-57cdd994dd" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2020-03-16T09:54:30Z" lastUpdateTime: "2020-03-16T09:54:30Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available observedGeneration: 3 readyReplicas: 1 replicas: 2 unavailableReplicas: 1 updatedReplicas: 2 So we're failing to surge? But should probably bump maxUnavailable to one, because although having only a single router pod is risky, it's not quite unavailable. [1]: quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f1ffcdcbd684afd61ff1874b47e1c61a0f7adab93b7a21123a1e29b041d3dabf/namespaces/openshift-ingress/apps/deployments.yaml Bumping maxUnavailable to 1 would increase the likelihood that we would drop traffic during a rolling update. Only 1 of 2 pods from the router Deployment is running. The router pod that fails to run shows the following status:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-03-16T10:58:05Z"
message: '0/6 nodes are available: 1 Insufficient cpu, 2 node(s) didn''t match
pod affinity/anti-affinity, 2 node(s) didn''t satisfy existing pods anti-affinity
rules, 3 node(s) didn''t match node selector.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable
A router can only be scheduled to worker nodes. If 1 of the worker nodes does not have enough CPU, the router Deployment will not achieve the desired number of replicas and Ingress Operator will report Available=False.
This looks like a duplicate of bug 1817769: The pod has a node selector for node-rule.kubernetes.io/worker, the pod's anti-affinity rule is keyed on kubernetes.io/hostname, and all worker nodes have the label kubernetes.io/hostname=localhost (2 of the 3 masters also have the label kubernetes.io/hostname=localhost, which is wrong but does not affect the router pods). The workaround is to fix the kubernetes.io/hostname labels as described in bug 1817769, comment 1. The solution would be to make sure nodes are correctly labeled in the first place. *** This bug has been marked as a duplicate of bug 1817769 *** Filling in comment 11 based on the original must-gather here: $ grep -r kubernetes.io/hostname cluster-scoped-resources/core/nodes/ cluster-scoped-resources/core/nodes/ip-10-0-153-246.us-east-2.compute.internal.yaml: kubernetes.io/hostname: localhost cluster-scoped-resources/core/nodes/ip-10-0-169-151.us-east-2.compute.internal.yaml: kubernetes.io/hostname: localhost cluster-scoped-resources/core/nodes/ip-10-0-133-10.us-east-2.compute.internal.yaml: kubernetes.io/hostname: ip-10-0-133-10 cluster-scoped-resources/core/nodes/ip-10-0-162-171.us-east-2.compute.internal.yaml: kubernetes.io/hostname: localhost cluster-scoped-resources/core/nodes/ip-10-0-155-181.us-east-2.compute.internal.yaml: kubernetes.io/hostname: localhost cluster-scoped-resources/core/nodes/ip-10-0-133-149.us-east-2.compute.internal.yaml: kubernetes.io/hostname: localhost Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |