Bug 1826150
Summary: | Timed out waiting for the condition during syncRequiredMachineConfigPools: "rendered-master-1130dc81212ead470ba89260f56aabb8" not found | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yurii Prokulevych <yprokule> |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.5 | CC: | amurdaca, knarra, lmohanty, palonsor, pdhamdhe, sasha, smilner, stbenjam, vvoronko, wking |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 15:58:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yurii Prokulevych
2020-04-21 05:50:49 UTC
[kni@provisionhost-0-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-04-18-184707 True False False 160m cloud-credential 4.5.0-0.nightly-2020-04-18-184707 True False False 3h20m cluster-autoscaler 4.5.0-0.nightly-2020-04-18-184707 True False False 3h1m config-operator 4.5.0-0.nightly-2020-04-18-184707 True False False 3h2m console 4.5.0-0.nightly-2020-04-18-184707 True False False 163m csi-snapshot-controller 4.5.0-0.nightly-2020-04-18-184707 True False False 3h3m dns 4.5.0-0.nightly-2020-04-18-184707 True False False 3h8m etcd 4.5.0-0.nightly-2020-04-18-184707 True False False 3h8m image-registry 4.5.0-0.nightly-2020-04-18-184707 True False False 3h4m ingress 4.5.0-0.nightly-2020-04-18-184707 True False False 167m insights 4.5.0-0.nightly-2020-04-18-184707 True False False 3h4m kube-apiserver 4.5.0-0.nightly-2020-04-18-184707 True False False 3h6m kube-controller-manager 4.5.0-0.nightly-2020-04-18-184707 True False False 3h8m kube-scheduler 4.5.0-0.nightly-2020-04-18-184707 True False False 3h8m kube-storage-version-migrator 4.5.0-0.nightly-2020-04-18-184707 True False False 3h9m machine-api 4.5.0-0.nightly-2020-04-18-184707 True False False 3h1m machine-config False True True 172m marketplace 4.5.0-0.nightly-2020-04-18-184707 True False False 167m monitoring 4.5.0-0.nightly-2020-04-18-184707 True False False 165m network 4.5.0-0.nightly-2020-04-18-184707 True False False 3h10m node-tuning 4.5.0-0.nightly-2020-04-18-184707 True False False 3h9m openshift-apiserver 4.5.0-0.nightly-2020-04-18-184707 True False False 169m openshift-controller-manager 4.5.0-0.nightly-2020-04-18-184707 True False False 3h4m openshift-samples 4.5.0-0.nightly-2020-04-18-184707 True True True 166m operator-lifecycle-manager 4.5.0-0.nightly-2020-04-18-184707 True False False 3h9m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-04-18-184707 True False False 3h9m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-04-18-184707 True False False 168m service-ca 4.5.0-0.nightly-2020-04-18-184707 True False False 3h9m storage 4.5.0-0.nightly-2020-04-18-184707 True False False 3h3m [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 3h11m v1.18.0-rc.1 master-0-1 Ready master 3h12m v1.18.0-rc.1 master-0-2 Ready master 3h11m v1.18.0-rc.1 worker-0-0 Ready worker 168m v1.18.0-rc.1 worker-0-1 Ready worker 168m v1.18.0-rc.1 [kni@provisionhost-0-0 ~]$ oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-04-21T02:38:36Z Generation: 1 Resource Version: 83029 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 6a4e3278-14dc-4648-8c47-fc3f4937e687 Spec: Status: Conditions: Last Transition Time: 2020-04-21T02:49:23Z Message: Working towards 4.5.0-0.nightly-2020-04-18-184707 Status: True Type: Progressing Last Transition Time: 2020-04-21T03:06:25Z Message: Unable to apply 4.5.0-0.nightly-2020-04-18-184707: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configu ration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node master-0-1 is reporting: \"machineconfig.machineco nfiguration.openshift.io \\\"rendered-master-1130dc81212ead470ba89260f56aabb8\\\" not found\", Node master-0-2 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1130dc 81212ead470ba89260f56aabb8\\\" not found\", Node master-0-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1130dc81212ead470ba89260f56aabb8\\\" not found\"", retryi ng Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-04-21T03:06:25Z Message: Cluster not available for 4.5.0-0.nightly-2020-04-18-184707 Status: False Type: Available Last Transition Time: 2020-04-21T03:06:25Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Events: <none> This happens when something provided at install isn't available in cluster - or there are differences between the two. Antonio, is there some troubleshooting guide for this? I believe the way to debug this is take a diff of /etc/mcs-machine-config-content.json and /etc/machine-config-daemon/currentconfig and see what's different, based on the functionality in https://github.com/openshift/machine-config-operator/pull/1376. @Antonio - I think bugs where the rendered MCO manifests in bootstrap and the cluster differ should stay in the MCO component until it's determined what caused it. I'm not sure it makes sense to put it on the platform installer subcomponent until you know it's something specific to the platform. The last time I saw this happen it was caused by CNO and the installer writing conflicting Proxy resources. Same issue with 4.4.0-0.nightly-2020-04-22-215658 ... time="2020-04-23T04:00:38Z" level=info msg="Cluster operator insights Disabled is True with Disabled: Health reporting is disabled" time="2020-04-23T04:00:38Z" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.4.0-0.nightly-2020-04-22-215658" time="2020-04-23T04:00:38Z" level=info msg="Cluster operator machine-config Progressing is True with : Cluster is bootstrapping 4.4.0-0.nightly-2020-04-22-215658" time="2020-04-23T04:00:38Z" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Failed to resync 4.4.0-0.nightly-2020-04-22-215658 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node master-0-1.ocp-edge07-0.qe.lab.redhat.com is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-201420f461bd94fff617db1277f1d6ad\\\\\\\" not found\\\", Node master-0-0.ocp-edge07-0.qe.lab.redhat.com is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-201420f461bd94fff617db1277f1d6ad\\\\\\\" not found\\\", Node master-0-2.ocp-edge07-0.qe.lab.redhat.com is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-201420f461bd94fff617db1277f1d6ad\\\\\\\" not found\\\"\", retrying" time="2020-04-23T04:00:38Z" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.4.0-0.nightly-2020-04-22-215658 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node master-0-1.ocp-edge07-0.qe.lab.redhat.com is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-201420f461bd94fff617db1277f1d6ad\\\\\\\" not found\\\", Node master-0-0.ocp-edge07-0.qe.lab.redhat.com is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-201420f461bd94fff617db1277f1d6ad\\\\\\\" not found\\\", Node master-0-2.ocp-edge07-0.qe.lab.redhat.com is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-201420f461bd94fff617db1277f1d6ad\\\\\\\" not found\\\"\", retrying" (In reply to Stephen Benjamin from comment #5) > > I believe the way to debug this is take a diff of > /etc/mcs-machine-config-content.json and > /etc/machine-config-daemon/currentconfig and see what's different, based on > the functionality in > https://github.com/openshift/machine-config-operator/pull/1376. Can you compute this diff and attach it here? that could tell us what went wrong at installation > > @Antonio - I think bugs where the rendered MCO manifests in bootstrap and > the cluster differ should stay in the MCO component until it's determined > what caused it. I'm not sure it makes sense to put it on the platform > installer subcomponent until you know it's something specific to the > platform. The last time I saw this happen it was caused by CNO and the > installer writing conflicting Proxy resources. Why is needinfo on me? I believe Antonio is asking for the reporter to generate the diff, see comment #5 for instructions Another failure on 4.4 time="2020-05-15T16:14:26Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-05-15-095335: 98% complete" time="2020-05-15T16:16:11Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.4.0-0.nightly-2020-05-15-095335 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node master-0-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-466ec4217b8bfe415418fc11c84cc8aa\\\\\\\" not found\\\", Node master-0-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-466ec4217b8bfe415418fc11c84cc8aa\\\\\\\" not found\\\", Node master-0-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-466ec4217b8bfe415418fc11c84cc8aa\\\\\\\" not found\\\"\", retrying" time="2020-05-15T16:18:26Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.4.0-0.nightly-2020-05-15-095335 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node master-0-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-466ec4217b8bfe415418fc11c84cc8aa\\\\\\\" not found\\\", Node master-0-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-466ec4217b8bfe415418fc11c84cc8aa\\\\\\\" not found\\\", Node master-0-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-466ec4217b8bfe415418fc11c84cc8aa\\\\\\\" not found\\\"\", retrying" time="2020-05-15T16:19:41Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-05-15-095335: 99% complete" I've assessed what's happening here https://github.com/openshift/machine-config-operator/pull/1756#issue-423782041 The issue isn't a regression and the fix might require some more eyes and testing as well as some planned fixes I have in mind. There's a workaround for the specific issue and requires any custom MachineConfig to be named "98-" as opposed as "99-". The workaround seems to be acceptable as the bug is manifesting only at installation. Thus, I'm pushing the target of this BZ to be 4.6 but I'm actively working on it and we should be able to fully fix this (and get back to 99- prefixes) shortly. Wow, nice find - thanks.
> The issue isn't a regression and the fix might require some more eyes and testing as well as some planned fixes I have in mind.
Is it not a regression? We've used the 99_ prefixes without issue since 4.2.
> Is it not a regression? We've used the 99_ prefixes without issue since 4.2.
Nevermind, I understand. I guess it depends on luck and the registry UUID. Some percentage of installs probably fail with this when using custom 99_ manifests and people just retry and get a UUID that starts with a starting digit of 0-e.
Verified on 4.6.0-0.nightly-2020-06-24-071932 On bootstrap node, the registries MC is named 99-master-generated-registries and 99-worker-generated-registries now On Bootstrap ------------------ [root@ip-10-0-24-228 ~]# oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 0f33cc64182f6cb69609064e235235214411b648 2.2.0 22s 00-worker 0f33cc64182f6cb69609064e235235214411b648 2.2.0 22s 01-master-container-runtime 0f33cc64182f6cb69609064e235235214411b648 2.2.0 22s 01-master-kubelet 0f33cc64182f6cb69609064e235235214411b648 2.2.0 22s 01-worker-container-runtime 0f33cc64182f6cb69609064e235235214411b648 2.2.0 22s 01-worker-kubelet 0f33cc64182f6cb69609064e235235214411b648 2.2.0 22s 99-master-generated-registries 0f33cc64182f6cb69609064e235235214411b648 2.2.0 21s 99-master-ssh 2.2.0 9m32s 99-worker-generated-registries 0f33cc64182f6cb69609064e235235214411b648 2.2.0 22s 99-worker-ssh 2.2.0 9m32s rendered-master-173a2d656092e81fa8e69f133396b98c 0f33cc64182f6cb69609064e235235214411b648 2.2.0 18s rendered-worker-239e761ff77184768e5b8c345726e169 0f33cc64182f6cb69609064e235235214411b648 2.2.0 18s [root@ip-10-0-24-228 ~]# oc get mc/99-master-generated-registries -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: annotations: machineconfiguration.openshift.io/generated-by-controller-version: 0f33cc64182f6cb69609064e235235214411b648 creationTimestamp: "2020-06-24T22:23:14Z" generation: 1 labels: machineconfiguration.openshift.io/role: master managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:machineconfiguration.openshift.io/generated-by-controller-version: {} f:labels: .: {} f:machineconfiguration.openshift.io/role: {} f:ownerReferences: .: {} k:{"uid":"7b691ae8-6ccb-41c5-8e59-3e66a537d643"}: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:uid: {} f:spec: .: {} f:config: .: {} f:ignition: .: {} f:version: {} f:fips: {} f:kernelArguments: {} f:kernelType: {} f:osImageURL: {} manager: machine-config-controller operation: Update time: "2020-06-24T22:23:14Z" name: 99-master-generated-registries ownerReferences: - apiVersion: config.openshift.io/v1 kind: Image name: cluster uid: 7b691ae8-6ccb-41c5-8e59-3e66a537d643 resourceVersion: "5843" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigs/99-master-generated-registries uid: 51d5da45-27d7-45ab-84ef-ef17fa514c7c spec: config: ignition: version: 2.2.0 fips: false kernelArguments: null kernelType: "" osImageURL: "" [root@ip-10-0-24-228 ~]# oc get mc/99-worker-generated-registries -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: annotations: machineconfiguration.openshift.io/generated-by-controller-version: 0f33cc64182f6cb69609064e235235214411b648 creationTimestamp: "2020-06-24T22:23:13Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:machineconfiguration.openshift.io/generated-by-controller-version: {} f:labels: .: {} f:machineconfiguration.openshift.io/role: {} f:ownerReferences: .: {} k:{"uid":"7b691ae8-6ccb-41c5-8e59-3e66a537d643"}: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:uid: {} f:spec: .: {} f:config: .: {} f:ignition: .: {} f:version: {} f:fips: {} f:kernelArguments: {} f:kernelType: {} f:osImageURL: {} manager: machine-config-controller operation: Update time: "2020-06-24T22:23:13Z" name: 99-worker-generated-registries ownerReferences: - apiVersion: config.openshift.io/v1 kind: Image name: cluster uid: 7b691ae8-6ccb-41c5-8e59-3e66a537d643 resourceVersion: "5829" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigs/99-worker-generated-registries uid: a78f204f-9fb3-4b1a-b8f2-536a0ef15152 spec: config: ignition: version: 2.2.0 fips: false kernelArguments: null kernelType: "" osImageURL: "" On Actual Cluster ------------------------ $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 00-worker 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 01-master-container-runtime 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 01-master-kubelet 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 01-worker-container-runtime 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 01-worker-kubelet 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 99-master-generated-registries 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 99-master-ssh 2.2.0 98m 99-worker-generated-registries 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m 99-worker-ssh 2.2.0 98m rendered-master-173a2d656092e81fa8e69f133396b98c 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m rendered-worker-239e761ff77184768e5b8c345726e169 0f33cc64182f6cb69609064e235235214411b648 2.2.0 89m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-06-24-071932 True False 71m Cluster version is 4.6.0-0.nightly-2020-06-24-071932 *** Bug 1851965 has been marked as a duplicate of this bug. *** *** Bug 1859161 has been marked as a duplicate of this bug. *** *** Bug 1859161 has been marked as a duplicate of this bug. *** Hi Antonio, One quick question, do we have any plans of back porting the fix to 4.5.z ? Asking because i tried an upgrade today from 4.5.3 -> 4.5.0-0.nightly-2020-08-06-062632 on matrix "ipi-on-osp/versioned-installer-https_proxy-etcd_encryption-ci"and i hit the issue. But could not find any 4.5.z back port bug so asking. [ramakasturinarra@dhcp35-60 ~]$ oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-08-06T07:22:19Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:spec: f:status: .: f:relatedObjects: Manager: cluster-version-operator Operation: Update Time: 2020-08-06T07:22:19Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: f:versions: Manager: machine-config-operator Operation: Update Time: 2020-08-06T14:59:59Z Resource Version: 188936 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 66ba90fe-50ce-468b-a237-71ac656b38bf Spec: Status: Conditions: Last Transition Time: 2020-08-06T10:29:07Z Message: Working towards 4.5.0-0.nightly-2020-08-06-062632 Status: True Type: Progressing Last Transition Time: 2020-08-06T10:47:39Z Message: Unable to apply 4.5.0-0.nightly-2020-08-06-062632: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-baac985457ac10f2f91931cf3fe32ba9 expected 2d538f4ce27c9e5b51c70c9291d6aef768443f86 has 4173030d89fbf4a7a0976d1665491a4d9a6e54f1, retrying Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-08-06T10:47:39Z Message: Cluster not available for 4.5.0-0.nightly-2020-08-06-062632 Status: False Type: Available Last Transition Time: 2020-08-06T07:27:55Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.5.3 Events: <none> Thanks kasturi Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |