1829116 – Can't upgrade from 4.3.0 to 4.3.13 with error "pool master has not progressed to latest configuration: controller version mismatch"

Bug 1829116 - Can't upgrade from 4.3.0 to 4.3.13 with error "pool master has not progressed to latest configuration: controller version mismatch"

Summary: Can't upgrade from 4.3.0 to 4.3.13 with error "pool master has not progressed...

Keywords:
Status:	CLOSED DUPLICATE of bug 1840881
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Yu Qi Zhang
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-28 22:23 UTC by Hugo Cisneiros (Eitch)
Modified:	2023-10-06 19:48 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-27 19:15:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1754	0	None	open	WIP: Bug 1840881: KubeletConfigController: rework to sync overall state for pool	2021-02-09 00:26:02 UTC

Description Hugo Cisneiros (Eitch) 2020-04-28 22:23:43 UTC

Description of problem:

The customer couldn't upgrade from 4.3.0 to 4.3.13. Upgrades gets stuck in the machine-config-operator with error:

message: 'Unable to apply 4.3.13: timed out waiting for the condition during syncRequiredMachineConfigPools:
pool master has not progressed to latest configuration: controller version mismatch
for 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet expected f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54
has 25bb6aeb58135c38a667e849edf5244871be4992, retrying'

Version-Release number of selected component (if applicable):

OCP 4.3.0 -> 4.3.13

How reproducible:

Install 4.3.0, then upgrade to 4.3.13

Upgrading to 4.3.12 didn't work because of prior bug: https://access.redhat.com/solutions/4972291

Actual results:

Upgrade gets stuck in machine-config-operator.

Expected results:

Upgrade finishes.

Additional info:

* The machine-config-daemon logs shows:

2reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host

curl inside the pod worked fine (Update #24 and #25)

* Tried deleting all pods on openshift-sdn to ensure services and clusterIps working fine (they were).

* Tried to create file /run/machine-config-daemon-force on nodes and restart machine-config-daemons. Didn't work. (Update #26 and #27)

* Tried https://access.redhat.com/solutions/4967301 and it didn't work. Pods were ready too. (Update #29 and #30)

Comment 2 Yu Qi Zhang 2020-05-01 21:00:20 UTC

Hi Hugo,

I looked through the latest must-gather as noted. It seems that the main error is this:

2020-04-27T17:57:38.20799777Z I0427 17:57:38.207925       1 container_runtime_config_controller.go:368] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified
2020-04-27T17:59:00.151924198Z E0427 17:59:00.151868       1 container_runtime_config_controller.go:373] could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified
2020-04-27T17:59:00.151988172Z I0427 17:59:00.151906       1 container_runtime_config_controller.go:374] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified

Which you can see in the machine-config-controller logs. That is being repeated, and given the timestamp it seems to be the main error blocking progress.

Now what's interesting is the error we see:

- lastTransitionTime: "2020-04-27T17:26:46Z"
      message: 'Unable to apply 4.3.13: timed out waiting for the condition during
        syncRequiredMachineConfigPools: pool master has not progressed to latest configuration:
        controller version mismatch for 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet
        expected f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54 has 25bb6aeb58135c38a667e849edf5244871be4992,
        retrying'

Is timestamped at 17:26. But if we look at the machineconfigpools, we actually see: 

- lastTransitionTime: "2020-04-27T17:43:48Z"
    message: All nodes are updated with rendered-master-4856d2624f633186b478ff1e06257964
    reason: ""
    status: "True"
    type: Updated

So at 17:43 it seems the the machines themselves have been properly updated (I think the version from that rendered-config is from 04-27, which should be 4.3.13 as the customer wanted).

Let's try fixing the allowedregistries config first, and see where that gets us.

Comment 5 Yu Qi Zhang 2020-05-05 15:01:37 UTC

So the other thing to note is that of the base configs in the cluster:
00-master.yaml
00-worker.yaml
01-master-container-runtime.yaml
01-master-kubelet.yaml
01-worker-container-runtime.yaml
01-worker-kubelet.yaml
99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet.yaml
99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-registries.yaml
99-master-ssh.yaml
99-worker-bdf115da-b2d0-49dd-9800-d61558d5b384-kubelet.yaml
99-worker-bdf115da-b2d0-49dd-9800-d61558d5b384-registries.yaml
99-worker-ssh.yaml

3 of them are rendered by the old controller:
99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet.yaml
99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-registries.yaml
99-worker-bdf115da-b2d0-49dd-9800-d61558d5b384-registries.yaml

I think the re-renders didn't happen because of the above error. One thing you can check is if the error went away, and if there are any other errors in the MCC (something like oc logs -f -n openshift-machine-config-operator machine-config-controller-59947d965-cncvk)

If there isn't, you can kill the Machine-config-controller pod to restart it. It should try to re-render all the configs.

If you have an updated must-gather that would be good as well.

Comment 6 Hugo Cisneiros (Eitch) 2020-05-05 21:15:29 UTC

Hi Yu, thanks for looking at this. The updated must-gather is uploaded on the case.

Comment 7 Yu Qi Zhang 2020-05-06 17:47:10 UTC

Hi,

I can see that in the config, there is still:

  registrySources:
    allowedRegistries:
    - quay.io
    blockedRegistries:
    - docker.io

Which can be seen at:
cluster-scoped-resources/config.openshift.io/images.yaml
cluster-scoped-resources/config.openshift.io/images/cluster.yaml

So I can still see the looping error as reported before by the machine-config-controller pod

See the docs on this: https://docs.openshift.com/container-platform/4.3/openshift_images/image-configuration.html
Which states: Only one of blockedRegistries or allowedRegistries may be set

It's a whitelist/blacklist so just having
    allowedRegistries:
    - quay.io

explicitly blocks other registries and should be good if you only want quay images.

In fact, the MCO is not the only component reporting this. The MCO just happens to bubble up the error via another error.

Removing the blockedRegistries should allow the MCO (and thus the upgrade) to progress.

@Hugo if you want to take a look, the .gz file for the most recent must-gather is actually a tarball, so you'd need to rename the .gz to .tar and extract.

Also, would you like me to comment on the case directly, so as to communicate faster?

Comment 8 Hugo Cisneiros (Eitch) 2020-05-06 19:01:54 UTC

Hi @Yu,

Thanks for the follow up and sorry that I didn't see the change after the updated must-gather - my bad for overlooking it. I updated the case regarding this, pointing to the specific lines that should be removed.

> Also, would you like me to comment on the case directly, so as to communicate faster?

It's up to you, if you could we would really appreciate :) But I'll always answer if needed.

I'll update this bug as soon as I get another response from the customer.

Thank you

Comment 9 Pablo Alonso Rodriguez 2020-05-09 12:05:29 UTC

Hello,

Restarting machine config controller had no effect. I see that the only machineconfig at a bad revision is owned by a kubeletconfig. Maybe the kubeletconfig was expected  to render that machineconfig in a previous version but not in this one, so that machineconfig ended up being stale.

I am going to get more data.

Regards

Comment 10 Yu Qi Zhang 2020-05-11 14:00:46 UTC

Ok I took a look at the new must-gather. This is very odd, as I can see that the machine-config-controller properly progressed now that the customer has fixed the original bug. The worker config, which is the exact same as the master config, applied properly, but the master config did not get re-rendered. I'm not sure as to why it didn't, since the machine-config-controller is no longer reporting any errors.

Basically, 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet did not see an update.

I will attempt to reproduce this. My guesses so far are:
1. somehow the master kubelet config is considered to be erronous somewhere else
2. some rare race condition that occurred because of the deletion of the previous pod (very unlikely)

Note that since the customer is doing the exact same config for both master and worker:
worker:
      metadata:
        name: master-burst-qps
      spec:
        machineConfigPoolSelector:
          matchLabels:
            custom-type: worker
        kubeletConfig:
          kubeAPIBurst: 7000
          kubeAPIQPS: 4000

master:
      metadata:
        name: master-burst-qps
      spec:
        machineConfigPoolSelector:
          matchLabels:
            custom-type: master
        kubeletConfig:
          kubeAPIBurst: 7000
          kubeAPIQPS: 4000

There should really be no difference.

One suggested method of unblocking this is to delete the bad config from masters, and then re-apply it.

Comment 11 Yu Qi Zhang 2020-05-11 18:42:46 UTC

Oh I think this is what happened:

The customer created 2 custom-resources with the same name:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: master-burst-qps
      spec:
        machineConfigPoolSelector:
          matchLabels:
            custom-type: worker
        kubeletConfig:
          kubeAPIBurst: 7000
          kubeAPIQPS: 4000

      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: master-burst-qps
      spec:
        machineConfigPoolSelector:
          matchLabels:
            custom-type: master
        kubeletConfig:
          kubeAPIBurst: 7000
          kubeAPIQPS: 4000


See how they are both `master-burst-qps` but have different labels to match master and worker. What I think happened is that the customer first wrote the master conf, then updated it with the worker conf, thus overwriting the master config. This means that the master conf no longer exists, but in the MCO I think we don't "delete" the bad generated machineconfig, so it shows up as an error. The issue doesn't manifest until we update and cannot create the new master config.

This is a MCO bug. I will try to verify if this is the case. If I verify the fix, for now we need to delete the bad kubeletconfig, and create new ones correctly named "master-burst-qps" and "worker-burst-qps". We need to potentially delete the bad machineconfig as well.

Comment 12 Yu Qi Zhang 2020-05-11 18:43:43 UTC

To be clear, this is also the wrong way to do custom resources, so this is both user error and the MCO not detecting it. The naming of kube custom resources must be unique.

Comment 13 Yu Qi Zhang 2020-05-11 20:02:15 UTC

Ok I've verified this to be the case, see the comment in: https://access.redhat.com/support/cases/#/case/02637878?commentId=a0a2K00000UhNUlQAN

This bug should actually be called: overwritten kubeletConfigs do not delete the corresponding machineconfig. As far as I can tell this has always been the case. Bumping the severity a bit since this can break upgrades as shown in this customer case.

Comment 14 Hugo Cisneiros (Eitch) 2020-05-13 18:14:47 UTC

Hey Yu, thanks a lot for your effort on identifying this! The problem is resolved, but let me know if you need any more information to help with this bug. I'll create a solution in RH Customer Portal specifying this problem with duplicated custom resources.

Comment 17 Yu Qi Zhang 2020-05-27 19:15:52 UTC

Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1840881 so we can focus on the last issue (of kubeletconfig syncing). The other issues were resolved and workarounds are available.

*** This bug has been marked as a duplicate of bug 1840881 ***

Note You need to log in before you can comment on or make changes to this bug.