Description of problem: machine-config cluster operator reports Degraded true with error: Failed to resync 4.2.0-0.nightly-2019-08-21-235427 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node control-plane-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-2e89b2a4a4231cd3f4071302f9478eec\\\" not found\"", retrying # oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED master False True True worker False True True oc get machineconfig NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED 00-master 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 00-worker 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 01-master-container-runtime 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 01-master-kubelet 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 01-worker-container-runtime 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 01-worker-kubelet 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 99-master-588d6ce1-c4a6-11e9-8cf0-0050568b2275-registries 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 99-master-ssh 2.2.0 4h30m 99-worker-588e2512-c4a6-11e9-8cf0-0050568b2275-registries 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m 99-worker-ssh 2.2.0 4h30m rendered-master-fb8d673690e8a72c4c75cd1fbc0b2beb 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m rendered-worker-ef7934eab8903cd8cafd32ef72cd69df 66f991220e810d8be5e5793581aaa65e80b636e6 2.2.0 4h30m The must-gather log can be found here http://virt-openshift-05.lab.eng.nay.redhat.com/xiuwang/ Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-21-235427 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This is likely happen on installation, if so, can you provide the install config that you're using please
2019-08-22T08:19:52.610826696Z W0822 08:19:52.610794 1 render.go:137] Warning: the controller config referenced an unsupported platform: vsphere First ^^ Second: we don't support 1 master setup as that goes against the etcd-quorum-guard and we cannot apply any configuration. Please re-run the test using a supported platform (I guess?) and a 3 masters setup at least.
can you grab and share /etc/machine-config-daemon/currentconfig from the masters, I need to identify what's diffing
Seems no such file found on the masters. [core@control-plane-2 ~]$ sudo ls /etc/machine-config-daemon/ node-annotations.json [core@control-plane-2 ~]$ cat /etc/machine-config-daemon/node-annotations.json {"machineconfiguration.openshift.io/currentConfig":"rendered-master-497c8d34be41019278ce560778ca6e0e","machineconfiguration.openshift.io/desiredConfig":"rendered-master-497c8d34be41019278ce560778ca6e0e","machineconfiguration.openshift.io/state":"Done"} Will try to setup using the same payload without proxy to make a comparison.
Which version of the installer is being used? are you extracting the installer from the payload that you're using to install the cluster?
Per questions have been answered in https://coreos.slack.com/archives/CLJSH16J0/p1566547827097400 .
Maybe this help? https://github.com/openshift/installer/pull/2257
I see a similar issue doing an install with proxy enabled: FATAL failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.okd-2019-08-22-211349 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-145-47.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\", Node ip-10-0-128-7.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\", Node ip-10-0-170-126.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\"", retrying I am going to test a proxy-enabled cluster install since https://github.com/openshift/installer/pull/2257 has merged.
Verify this bug using payload 4.2.0-0.nightly-2019-08-24-002347, proxy enabled cluster could be setup successfully. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-08-24-002347 True False 4m20s Cluster version is 4.2.0-0.nightly-2019-08-24-002347 # oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED master rendered-master-aa73b2191c3f4149b16b79efc166917c True False False worker rendered-worker-c700f3dfcfd0fb2831d9e53dd3818a67 True False False
I am hitting this bug even though https://github.com/openshift/installer/pull/2257 has merged. time="2019-08-26T13:58:03-07:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-26-193559: 99% complete" time="2019-08-26T14:00:33-07:00" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.okd-2019-08-26-193559 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-138-24.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\", Node ip-10-0-174-132.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\"\", retrying"
Additional info related to comment #18 $ oc get cm/cluster-config-v1 -n kube-system -o yaml apiVersion: v1 data: install-config: | additionalTrustBundle: | -----BEGIN CERTIFICATE----- <SNIP> -----END CERTIFICATE----- apiVersion: v1 baseDomain: devcluster.openshift.com compute: - hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: hyperthreading: Enabled name: master platform: aws: rootVolume: iops: 0 size: 120 type: gp2 type: m5.xlarge zones: - us-west-2a - us-west-2b - us-west-2c - us-west-2d replicas: 3 metadata: creationTimestamp: null name: jcallen-proxy networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineCIDR: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: aws: region: us-west-2 proxy: httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 noProxy: .s3.us-west-2.amazonaws.com,.us-west-2.compute.internal pullSecret: "" sshKey: | ssh-rsa <SNIP> $ oc get clusteroperator/machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-08-26T19:54:09Z" generation: 1 name: machine-config resourceVersion: "36638" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config uid: 44f27d51-c83b-11e9-b007-02f9ee0624ec spec: {} status: conditions: - lastTransitionTime: "2019-08-26T19:54:09Z" message: Cluster not available for 4.2.0-0.okd-2019-08-26-193559 status: "False" type: Available - lastTransitionTime: "2019-08-26T19:54:09Z" message: Cluster is bootstrapping 4.2.0-0.okd-2019-08-26-193559 status: "True" type: Progressing - lastTransitionTime: "2019-08-26T20:04:37Z" message: 'Failed to resync 4.2.0-0.okd-2019-08-26-193559 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-138-24.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-174-132.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\"", retrying' reason: RequiredPoolsFailed status: "True" type: Degraded - lastTransitionTime: "2019-08-26T20:04:37Z" reason: AsExpected status: "True" type: Upgradeable extension: lastSyncError: 'pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-138-24.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-174-132.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\"", retrying' worker: all 3 nodes are at latest configuration rendered-worker-5afcba28b5a21c885daeb4599215b1fc relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: master resource: machineconfigpools - group: machineconfiguration.openshift.io name: worker resource: machineconfigpools - group: machineconfiguration.openshift.io name: cluster resource: controllerconfigs versions: - name: operator version: 4.2.0-0.okd-2019-08-26-193559
Add for Comment #18 $ oc get proxy/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2019-08-26T19:52:52Z" generation: 1 name: cluster resourceVersion: "1721" selfLink: /apis/config.openshift.io/v1/proxies/cluster uid: 16ef3dab-c83b-11e9-b007-02f9ee0624ec spec: httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 noProxy: .s3.us-west-2.amazonaws.com,.us-west-2.compute.internal trustedCA: name: user-ca-bundle status: httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 noProxy: .cluster.local,.s3.us-west-2.amazonaws.com,.svc,.us-west-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jcallen-proxy.devcluster.openshift.com,api.jcallen-proxy.devcluster.openshift.com,etcd-0.jcallen-proxy.devcluster.openshift.com,etcd-1.jcallen-proxy.devcluster.openshift.com,etcd-2.jcallen-proxy.devcluster.openshift.com,localhost
Note: I was seeing the same bug before adding .s3.us-west-2.amazonaws.com and .us-west-2.compute.internal to noProxy.
(In reply to Daneyon Hansen from comment #21) > Note: I was seeing the same bug before adding .s3.us-west-2.amazonaws.com > and .us-west-2.compute.internal to noProxy. I think the issue could be in the trustedCA - as discussed on Slack, if you can't reproduce w/o the trustedCA, it's prob a different bug which I'll look into it.
The issue here is again a skew between bootstrap and cluster. - cluster does contain 10.0.0.0/16 in noProxy - bootstrap doesn't (for whatever reason) This is similar to https://github.com/openshift/installer/pull/2257 When there's a skew like the above in configs (noProxy here), the bootstrap generates a rendered-machineconfig that the cluster won't be able to exactly regenerate once the cluster is running. When that happens the MCO notices and fails to get the correct machineconfig (which, at installation time, it's the bootstrap one that it's the source of truth for us). So again, I'm not sure why 10.0.0.0/16 is injected into the cluster's noProxy but doesn't get injected at bootstrap. We should have the bootstrap one inject that as well so the MCO is able to grab the correct rendered-machineconfig. Moving to installer I guess but I wondered if this is maybe a network task Also, as Trevor pointed out in the PR linked above, maybe we should have an unified way of generating noProxy and whatever else to avoid such drifts (maybe?)
https://github.com/openshift/installer/pull/2286 fixes this bug.
> In the proxy-enabled cluster on vsphere and baremetal, it doesn't have "10.0.0.0/16" added into noProxy. The proxy controller reads machineCIDR from the cluster config. If the machine CIDR is present, the controller will add the cidr to noProxy: $ oc get cm/cluster-config-v1 -n kube-system -o yaml | grep machine machineCIDR: 10.0.0.0/16 Gaoyun Pei, can you confirm that your vsphere and baremetal installs do not contain a machineCIDR?
Previously, the installer did not add machineCIDR to the default noProxy list. PR https://github.com/openshift/installer/pull/2286 adds this support. Previously, the proxy controller was only adding machineCIDR for cloud provider installation types. PR https://github.com/openshift/cluster-network-operator/pull/305 adds machineCIDR for all installation types.
Ow that the PRs from comment #30 merged, please make sure the IP's used for cluster nodes are within the install-config machinceCIDR range. If not, the machineCIDR should be updated before the install or add the machine IP's or CIDR to proxy noProxy.
Verified this bug with payload 4.2.0-0.nightly-2019-09-08-180038, machine-config operator is running well, machineCIDR was added into noProxy both on bootstrap node and in cluster. UPI-on-AWS: In cluster noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-09091.qe.devcluster.openshift.com,api.gpei-09091.qe.devcluster.openshift.com,etcd-0.gpei-09091.qe.devcluster.openshift.com,etcd-1.gpei-09091.qe.devcluster.openshift.com,etcd-2.gpei-09091.qe.devcluster.openshift.com,localhost,test.no-proxy.com On bootstrap NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-09091.qe.devcluster.openshift.com,api.gpei-09091.qe.devcluster.openshift.com,etcd-0.gpei-09091.qe.devcluster.openshift.com,etcd-1.gpei-09091.qe.devcluster.openshift.com,etcd-2.gpei-09091.qe.devcluster.openshift.com,localhost,test.no-proxy.com UPI-on-Baremetal: # oc get cm/cluster-config-v1 -n kube-system -o yaml | grep machine machineCIDR: 10.0.0.0/16 In cluster noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.wzheng-0909.qe.devcluster.openshift.com,api.wzheng-0909.qe.devcluster.openshift.com,etcd-0.wzheng-0909.qe.devcluster.openshift.com,etcd-1.wzheng-0909.qe.devcluster.openshift.com,etcd-2.wzheng-0909.qe.devcluster.openshift.com,localhost,test.no-proxy.com On bootstrap NO_PROXY=.cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.wzheng-0909.qe.devcluster.openshift.com,api.wzheng-0909.qe.devcluster.openshift.com,etcd-0.wzheng-0909.qe.devcluster.openshift.com,etcd-1.wzheng-0909.qe.devcluster.openshift.com,etcd-2.wzheng-0909.qe.devcluster.openshift.com,localhost,test.no-proxy.com
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days