Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1813895

Summary: 4.3.1 to 4.3.5 upgrade fails, blocking on syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch
Product: OpenShift Container Platform Reporter: Lili Cosic <lcosic>
Component: NetworkingAssignee: Daneyon Hansen <dhansen>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: amurdaca, aos-bugs, bbennett, evb, jack.ottofaro, kgarriso, mmasters, sagrawal, vrutkovs, wking
Version: 4.3.zKeywords: Upgrades
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-31 01:34:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lili Cosic 2020-03-16 12:45:43 UTC
Description of problem:

In our long lived cluster we upgraded from 4.3.1 to 4.3.5 and cluster did not finish the upgrade because machine config operator was degraded:
> Failed to resync 4.3.1 because: refusing to read images.json version β€œ4.3.5”, operator version β€œ4.3.1"

That caused two nodes to not be scheduleble and int turn degraded ingress operator as well as it could not schedule pods.

Version-Release number of selected component (if applicable):
4.3.1->4.3.5

Right now I marked some as scheduledable which fixed some issues but ingress pod and one etcd pod cannot be scheduled 

Expected results:

Smooth upgrade.

Additional info:
must-gather -> https://drive.google.com/file/d/19NYn7kCNubH6hfASzSikM1jnwU0NUvJZ/view?usp=sharing

We can also provide access to the cluster.

Comment 1 Lili Cosic 2020-03-16 12:47:51 UTC
Cluster ID -> a5f82f8f-2ded-4303-b0ff-8afb726db1b1

Comment 4 W. Trevor King 2020-03-18 04:20:07 UTC
Concerning strings from Lili's must-gather, in case it helps other folks searching for bugs find this one:

cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml

  - lastTransitionTime: "2020-03-16T10:03:39Z"
    message: 'Cluster operator ingress is reporting a failure: Some ingresscontrollers
      are degraded: default'
    reason: ClusterOperatorDegraded
    status: "True"
    type: Failing

cluster-scoped-resources/config.openshift.io/clusteroperators/ingress.yaml

  - lastTransitionTime: "2020-03-16T09:55:00Z"
    message: 'Some ingresscontrollers are degraded: default'
    reason: IngressControllersDegraded
    status: "True"
    type: Degraded

cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml

  - lastTransitionTime: "2020-03-16T09:44:09Z"
    message: 'Unable to apply 4.3.5: timed out waiting for the condition during syncRequiredMachineConfigPools:
      pool master has not progressed to latest configuration: controller version mismatch
      for rendered-master-9145931d893186501c29af8922cf9dc7 expected d5599de7a6b86ec385e0f9c849e93977fcb4eeb8
      has 25bb6aeb58135c38a667e849edf5244871be4992, retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded

I don't actually see the "refusing to read images.json" line in the must-gather?  Not sure where that is from.  I'd have expected it to be in the machine-config-operator logs based on its location in the 4.3 source [1].  But the only MCO logs in the must-gather are machine-config-operator-97f8dbdbc-4fsx2 created 2020-03-16T09:53:44Z, which is almost 10m after that operator went Degraded=True.

[1]: https://github.com/openshift/machine-config-operator/blob/ab4d62a3bf3774b77b6f9b04a2028faec1568aca/pkg/operator/sync.go#L117

Comment 5 W. Trevor King 2020-03-18 04:33:06 UTC
So I'm not clear on what's going on with this bug; hopefully the MCO folks can figure it out.  Here's our usual boilerplate impact-statement template to give the updates team a better feel for whether we need to pull update edges to avoid this bug, and if so, which edges need pulling.

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

Comment 6 Erica von Buelow 2020-03-18 19:14:35 UTC
I don't see anything here from the MCO. The controller-mismatch error is transient and resolves once all the MCO components finish upgrading to the latest version. I see a bit of delay on one of the master nodes in updating to the latest config because it still had a pending config update. Perhaps some failure occurred during a previous upgrade? It doesn't seem to be the core issue however as the delay is only about 5 min. All the nodes finish updating within about 30min of the update start time. I think something is going on with the ingress.

Comment 7 Kirsten Garrison 2020-03-18 19:19:50 UTC
Assigning to routing to take a look as per EVB's findings above.

Comment 8 Jack Ottofaro 2020-03-30 21:23:49 UTC
Please provide an update. Having been labeled an UpgradeBlocker means this bug is blocking at least one upgrade path.

Comment 9 W. Trevor King 2020-03-30 22:14:07 UTC
Excerpts from the Router Deployment YAML status from comment 0's must-gather [1]:

  spec:
    progressDeadlineSeconds: 600
    replicas: 2
    ...
    strategy:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 0
      type: RollingUpdate
    ...
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2020-02-07T11:53:48Z"
      lastUpdateTime: "2020-03-16T09:32:56Z"
      message: ReplicaSet "router-default-57cdd994dd" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    - lastTransitionTime: "2020-03-16T09:54:30Z"
      lastUpdateTime: "2020-03-16T09:54:30Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    observedGeneration: 3
    readyReplicas: 1
    replicas: 2
    unavailableReplicas: 1
    updatedReplicas: 2

So we're failing to surge?  But should probably bump maxUnavailable to one, because although having only a single router pod is risky, it's not quite unavailable.

[1]: quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f1ffcdcbd684afd61ff1874b47e1c61a0f7adab93b7a21123a1e29b041d3dabf/namespaces/openshift-ingress/apps/deployments.yaml

Comment 10 Daneyon Hansen 2020-03-31 00:11:41 UTC
Bumping maxUnavailable to 1 would increase the likelihood that we would drop traffic during a rolling update. Only 1 of 2 pods from the router Deployment is running. The router pod that fails to run shows the following status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-16T10:58:05Z"
    message: '0/6 nodes are available: 1 Insufficient cpu, 2 node(s) didn''t match
      pod affinity/anti-affinity, 2 node(s) didn''t satisfy existing pods anti-affinity
      rules, 3 node(s) didn''t match node selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

A router can only be scheduled to worker nodes. If 1 of the worker nodes does not have enough CPU, the router Deployment will not achieve the desired number of replicas and Ingress Operator will report Available=False.

Comment 11 Miciah Dashiel Butler Masters 2020-03-31 01:20:17 UTC
This looks like a duplicate of bug 1817769:  The pod has a node selector for node-rule.kubernetes.io/worker, the pod's anti-affinity rule is keyed on kubernetes.io/hostname, and all worker nodes have the label kubernetes.io/hostname=localhost (2 of the 3 masters also have the label kubernetes.io/hostname=localhost, which is wrong but does not affect the router pods).  The workaround is to fix the kubernetes.io/hostname labels as described in bug 1817769, comment 1.  The solution would be to make sure nodes are correctly labeled in the first place.

Comment 12 Daneyon Hansen 2020-03-31 01:34:32 UTC

*** This bug has been marked as a duplicate of bug 1817769 ***

Comment 13 W. Trevor King 2020-03-31 01:50:09 UTC
Filling in comment 11 based on the original must-gather here:

$ grep -r kubernetes.io/hostname cluster-scoped-resources/core/nodes/
cluster-scoped-resources/core/nodes/ip-10-0-153-246.us-east-2.compute.internal.yaml:    kubernetes.io/hostname: localhost
cluster-scoped-resources/core/nodes/ip-10-0-169-151.us-east-2.compute.internal.yaml:    kubernetes.io/hostname: localhost
cluster-scoped-resources/core/nodes/ip-10-0-133-10.us-east-2.compute.internal.yaml:    kubernetes.io/hostname: ip-10-0-133-10
cluster-scoped-resources/core/nodes/ip-10-0-162-171.us-east-2.compute.internal.yaml:    kubernetes.io/hostname: localhost
cluster-scoped-resources/core/nodes/ip-10-0-155-181.us-east-2.compute.internal.yaml:    kubernetes.io/hostname: localhost
cluster-scoped-resources/core/nodes/ip-10-0-133-149.us-east-2.compute.internal.yaml:    kubernetes.io/hostname: localhost

Comment 14 W. Trevor King 2021-04-05 17:46:40 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475