Description of problem: The customer couldn't upgrade from 4.3.0 to 4.3.13. Upgrades gets stuck in the machine-config-operator with error: message: 'Unable to apply 4.3.13: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet expected f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54 has 25bb6aeb58135c38a667e849edf5244871be4992, retrying' Version-Release number of selected component (if applicable): OCP 4.3.0 -> 4.3.13 How reproducible: Install 4.3.0, then upgrade to 4.3.13 Upgrading to 4.3.12 didn't work because of prior bug: https://access.redhat.com/solutions/4972291 Actual results: Upgrade gets stuck in machine-config-operator. Expected results: Upgrade finishes. Additional info: * The machine-config-daemon logs shows: 2reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host curl inside the pod worked fine (Update #24 and #25) * Tried deleting all pods on openshift-sdn to ensure services and clusterIps working fine (they were). * Tried to create file /run/machine-config-daemon-force on nodes and restart machine-config-daemons. Didn't work. (Update #26 and #27) * Tried https://access.redhat.com/solutions/4967301 and it didn't work. Pods were ready too. (Update #29 and #30)
Hi Hugo, I looked through the latest must-gather as noted. It seems that the main error is this: 2020-04-27T17:57:38.20799777Z I0427 17:57:38.207925 1 container_runtime_config_controller.go:368] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified 2020-04-27T17:59:00.151924198Z E0427 17:59:00.151868 1 container_runtime_config_controller.go:373] could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified 2020-04-27T17:59:00.151988172Z I0427 17:59:00.151906 1 container_runtime_config_controller.go:374] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified Which you can see in the machine-config-controller logs. That is being repeated, and given the timestamp it seems to be the main error blocking progress. Now what's interesting is the error we see: - lastTransitionTime: "2020-04-27T17:26:46Z" message: 'Unable to apply 4.3.13: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet expected f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54 has 25bb6aeb58135c38a667e849edf5244871be4992, retrying' Is timestamped at 17:26. But if we look at the machineconfigpools, we actually see: - lastTransitionTime: "2020-04-27T17:43:48Z" message: All nodes are updated with rendered-master-4856d2624f633186b478ff1e06257964 reason: "" status: "True" type: Updated So at 17:43 it seems the the machines themselves have been properly updated (I think the version from that rendered-config is from 04-27, which should be 4.3.13 as the customer wanted). Let's try fixing the allowedregistries config first, and see where that gets us.
So the other thing to note is that of the base configs in the cluster: 00-master.yaml 00-worker.yaml 01-master-container-runtime.yaml 01-master-kubelet.yaml 01-worker-container-runtime.yaml 01-worker-kubelet.yaml 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet.yaml 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-registries.yaml 99-master-ssh.yaml 99-worker-bdf115da-b2d0-49dd-9800-d61558d5b384-kubelet.yaml 99-worker-bdf115da-b2d0-49dd-9800-d61558d5b384-registries.yaml 99-worker-ssh.yaml 3 of them are rendered by the old controller: 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet.yaml 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-registries.yaml 99-worker-bdf115da-b2d0-49dd-9800-d61558d5b384-registries.yaml I think the re-renders didn't happen because of the above error. One thing you can check is if the error went away, and if there are any other errors in the MCC (something like oc logs -f -n openshift-machine-config-operator machine-config-controller-59947d965-cncvk) If there isn't, you can kill the Machine-config-controller pod to restart it. It should try to re-render all the configs. If you have an updated must-gather that would be good as well.
Hi Yu, thanks for looking at this. The updated must-gather is uploaded on the case.
Hi, I can see that in the config, there is still: registrySources: allowedRegistries: - quay.io blockedRegistries: - docker.io Which can be seen at: cluster-scoped-resources/config.openshift.io/images.yaml cluster-scoped-resources/config.openshift.io/images/cluster.yaml So I can still see the looping error as reported before by the machine-config-controller pod See the docs on this: https://docs.openshift.com/container-platform/4.3/openshift_images/image-configuration.html Which states: Only one of blockedRegistries or allowedRegistries may be set It's a whitelist/blacklist so just having allowedRegistries: - quay.io explicitly blocks other registries and should be good if you only want quay images. In fact, the MCO is not the only component reporting this. The MCO just happens to bubble up the error via another error. Removing the blockedRegistries should allow the MCO (and thus the upgrade) to progress. @Hugo if you want to take a look, the .gz file for the most recent must-gather is actually a tarball, so you'd need to rename the .gz to .tar and extract. Also, would you like me to comment on the case directly, so as to communicate faster?
Hi @Yu, Thanks for the follow up and sorry that I didn't see the change after the updated must-gather - my bad for overlooking it. I updated the case regarding this, pointing to the specific lines that should be removed. > Also, would you like me to comment on the case directly, so as to communicate faster? It's up to you, if you could we would really appreciate :) But I'll always answer if needed. I'll update this bug as soon as I get another response from the customer. Thank you
Hello, Restarting machine config controller had no effect. I see that the only machineconfig at a bad revision is owned by a kubeletconfig. Maybe the kubeletconfig was expected to render that machineconfig in a previous version but not in this one, so that machineconfig ended up being stale. I am going to get more data. Regards
Ok I took a look at the new must-gather. This is very odd, as I can see that the machine-config-controller properly progressed now that the customer has fixed the original bug. The worker config, which is the exact same as the master config, applied properly, but the master config did not get re-rendered. I'm not sure as to why it didn't, since the machine-config-controller is no longer reporting any errors. Basically, 99-master-17ea7334-7c15-42b3-83f9-b272c1cc48e8-kubelet did not see an update. I will attempt to reproduce this. My guesses so far are: 1. somehow the master kubelet config is considered to be erronous somewhere else 2. some rare race condition that occurred because of the deletion of the previous pod (very unlikely) Note that since the customer is doing the exact same config for both master and worker: worker: metadata: name: master-burst-qps spec: machineConfigPoolSelector: matchLabels: custom-type: worker kubeletConfig: kubeAPIBurst: 7000 kubeAPIQPS: 4000 master: metadata: name: master-burst-qps spec: machineConfigPoolSelector: matchLabels: custom-type: master kubeletConfig: kubeAPIBurst: 7000 kubeAPIQPS: 4000 There should really be no difference. One suggested method of unblocking this is to delete the bad config from masters, and then re-apply it.
Oh I think this is what happened: The customer created 2 custom-resources with the same name: apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: master-burst-qps spec: machineConfigPoolSelector: matchLabels: custom-type: worker kubeletConfig: kubeAPIBurst: 7000 kubeAPIQPS: 4000 apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: master-burst-qps spec: machineConfigPoolSelector: matchLabels: custom-type: master kubeletConfig: kubeAPIBurst: 7000 kubeAPIQPS: 4000 See how they are both `master-burst-qps` but have different labels to match master and worker. What I think happened is that the customer first wrote the master conf, then updated it with the worker conf, thus overwriting the master config. This means that the master conf no longer exists, but in the MCO I think we don't "delete" the bad generated machineconfig, so it shows up as an error. The issue doesn't manifest until we update and cannot create the new master config. This is a MCO bug. I will try to verify if this is the case. If I verify the fix, for now we need to delete the bad kubeletconfig, and create new ones correctly named "master-burst-qps" and "worker-burst-qps". We need to potentially delete the bad machineconfig as well.
To be clear, this is also the wrong way to do custom resources, so this is both user error and the MCO not detecting it. The naming of kube custom resources must be unique.
Ok I've verified this to be the case, see the comment in: https://access.redhat.com/support/cases/#/case/02637878?commentId=a0a2K00000UhNUlQAN This bug should actually be called: overwritten kubeletConfigs do not delete the corresponding machineconfig. As far as I can tell this has always been the case. Bumping the severity a bit since this can break upgrades as shown in this customer case.
Hey Yu, thanks a lot for your effort on identifying this! The problem is resolved, but let me know if you need any more information to help with this bug. I'll create a solution in RH Customer Portal specifying this problem with duplicated custom resources.
Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1840881 so we can focus on the last issue (of kubeletconfig syncing). The other issues were resolved and workarounds are available. *** This bug has been marked as a duplicate of bug 1840881 ***