Bug 1744532
Summary: | [Proxy]machine-config reporting Degraded is true | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | XiuJuan Wang <xiuwang> |
Component: | Installer | Assignee: | Joseph Callen <jcallen> |
Installer sub component: | openshift-installer | QA Contact: | Gaoyun Pei <gpei> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | cdc, dhansen, gpei, lmeyer, wzheng |
Version: | 4.2.0 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-10-16 06:37:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
XiuJuan Wang
2019-08-22 11:04:35 UTC
This is likely happen on installation, if so, can you provide the install config that you're using please 2019-08-22T08:19:52.610826696Z W0822 08:19:52.610794 1 render.go:137] Warning: the controller config referenced an unsupported platform: vsphere First ^^ Second: we don't support 1 master setup as that goes against the etcd-quorum-guard and we cannot apply any configuration. Please re-run the test using a supported platform (I guess?) and a 3 masters setup at least. can you grab and share /etc/machine-config-daemon/currentconfig from the masters, I need to identify what's diffing Seems no such file found on the masters. [core@control-plane-2 ~]$ sudo ls /etc/machine-config-daemon/ node-annotations.json [core@control-plane-2 ~]$ cat /etc/machine-config-daemon/node-annotations.json {"machineconfiguration.openshift.io/currentConfig":"rendered-master-497c8d34be41019278ce560778ca6e0e","machineconfiguration.openshift.io/desiredConfig":"rendered-master-497c8d34be41019278ce560778ca6e0e","machineconfiguration.openshift.io/state":"Done"} Will try to setup using the same payload without proxy to make a comparison. Which version of the installer is being used? are you extracting the installer from the payload that you're using to install the cluster? Per questions have been answered in https://coreos.slack.com/archives/CLJSH16J0/p1566547827097400 . Maybe this help? https://github.com/openshift/installer/pull/2257 I see a similar issue doing an install with proxy enabled: FATAL failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.okd-2019-08-22-211349 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-145-47.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\", Node ip-10-0-128-7.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\", Node ip-10-0-170-126.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\"", retrying I am going to test a proxy-enabled cluster install since https://github.com/openshift/installer/pull/2257 has merged. Verify this bug using payload 4.2.0-0.nightly-2019-08-24-002347, proxy enabled cluster could be setup successfully. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-08-24-002347 True False 4m20s Cluster version is 4.2.0-0.nightly-2019-08-24-002347 # oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED master rendered-master-aa73b2191c3f4149b16b79efc166917c True False False worker rendered-worker-c700f3dfcfd0fb2831d9e53dd3818a67 True False False I am hitting this bug even though https://github.com/openshift/installer/pull/2257 has merged. time="2019-08-26T13:58:03-07:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-26-193559: 99% complete" time="2019-08-26T14:00:33-07:00" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.okd-2019-08-26-193559 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-138-24.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\", Node ip-10-0-174-132.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\"\", retrying" Additional info related to comment #18 $ oc get cm/cluster-config-v1 -n kube-system -o yaml apiVersion: v1 data: install-config: | additionalTrustBundle: | -----BEGIN CERTIFICATE----- <SNIP> -----END CERTIFICATE----- apiVersion: v1 baseDomain: devcluster.openshift.com compute: - hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: hyperthreading: Enabled name: master platform: aws: rootVolume: iops: 0 size: 120 type: gp2 type: m5.xlarge zones: - us-west-2a - us-west-2b - us-west-2c - us-west-2d replicas: 3 metadata: creationTimestamp: null name: jcallen-proxy networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineCIDR: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: aws: region: us-west-2 proxy: httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 noProxy: .s3.us-west-2.amazonaws.com,.us-west-2.compute.internal pullSecret: "" sshKey: | ssh-rsa <SNIP> $ oc get clusteroperator/machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-08-26T19:54:09Z" generation: 1 name: machine-config resourceVersion: "36638" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config uid: 44f27d51-c83b-11e9-b007-02f9ee0624ec spec: {} status: conditions: - lastTransitionTime: "2019-08-26T19:54:09Z" message: Cluster not available for 4.2.0-0.okd-2019-08-26-193559 status: "False" type: Available - lastTransitionTime: "2019-08-26T19:54:09Z" message: Cluster is bootstrapping 4.2.0-0.okd-2019-08-26-193559 status: "True" type: Progressing - lastTransitionTime: "2019-08-26T20:04:37Z" message: 'Failed to resync 4.2.0-0.okd-2019-08-26-193559 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-138-24.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-174-132.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\"", retrying' reason: RequiredPoolsFailed status: "True" type: Degraded - lastTransitionTime: "2019-08-26T20:04:37Z" reason: AsExpected status: "True" type: Upgradeable extension: lastSyncError: 'pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-138-24.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-174-132.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\"", retrying' worker: all 3 nodes are at latest configuration rendered-worker-5afcba28b5a21c885daeb4599215b1fc relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: master resource: machineconfigpools - group: machineconfiguration.openshift.io name: worker resource: machineconfigpools - group: machineconfiguration.openshift.io name: cluster resource: controllerconfigs versions: - name: operator version: 4.2.0-0.okd-2019-08-26-193559 Add for Comment #18 $ oc get proxy/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2019-08-26T19:52:52Z" generation: 1 name: cluster resourceVersion: "1721" selfLink: /apis/config.openshift.io/v1/proxies/cluster uid: 16ef3dab-c83b-11e9-b007-02f9ee0624ec spec: httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 noProxy: .s3.us-west-2.amazonaws.com,.us-west-2.compute.internal trustedCA: name: user-ca-bundle status: httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129 noProxy: .cluster.local,.s3.us-west-2.amazonaws.com,.svc,.us-west-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jcallen-proxy.devcluster.openshift.com,api.jcallen-proxy.devcluster.openshift.com,etcd-0.jcallen-proxy.devcluster.openshift.com,etcd-1.jcallen-proxy.devcluster.openshift.com,etcd-2.jcallen-proxy.devcluster.openshift.com,localhost Note: I was seeing the same bug before adding .s3.us-west-2.amazonaws.com and .us-west-2.compute.internal to noProxy. (In reply to Daneyon Hansen from comment #21) > Note: I was seeing the same bug before adding .s3.us-west-2.amazonaws.com > and .us-west-2.compute.internal to noProxy. I think the issue could be in the trustedCA - as discussed on Slack, if you can't reproduce w/o the trustedCA, it's prob a different bug which I'll look into it. The issue here is again a skew between bootstrap and cluster. - cluster does contain 10.0.0.0/16 in noProxy - bootstrap doesn't (for whatever reason) This is similar to https://github.com/openshift/installer/pull/2257 When there's a skew like the above in configs (noProxy here), the bootstrap generates a rendered-machineconfig that the cluster won't be able to exactly regenerate once the cluster is running. When that happens the MCO notices and fails to get the correct machineconfig (which, at installation time, it's the bootstrap one that it's the source of truth for us). So again, I'm not sure why 10.0.0.0/16 is injected into the cluster's noProxy but doesn't get injected at bootstrap. We should have the bootstrap one inject that as well so the MCO is able to grab the correct rendered-machineconfig. Moving to installer I guess but I wondered if this is maybe a network task Also, as Trevor pointed out in the PR linked above, maybe we should have an unified way of generating noProxy and whatever else to avoid such drifts (maybe?) https://github.com/openshift/installer/pull/2286 fixes this bug. > In the proxy-enabled cluster on vsphere and baremetal, it doesn't have "10.0.0.0/16" added into noProxy.
The proxy controller reads machineCIDR from the cluster config. If the machine CIDR is present, the controller will add the cidr to noProxy:
$ oc get cm/cluster-config-v1 -n kube-system -o yaml | grep machine
machineCIDR: 10.0.0.0/16
Gaoyun Pei, can you confirm that your vsphere and baremetal installs do not contain a machineCIDR?
Previously, the installer did not add machineCIDR to the default noProxy list. PR https://github.com/openshift/installer/pull/2286 adds this support. Previously, the proxy controller was only adding machineCIDR for cloud provider installation types. PR https://github.com/openshift/cluster-network-operator/pull/305 adds machineCIDR for all installation types. Ow that the PRs from comment #30 merged, please make sure the IP's used for cluster nodes are within the install-config machinceCIDR range. If not, the machineCIDR should be updated before the install or add the machine IP's or CIDR to proxy noProxy. Verified this bug with payload 4.2.0-0.nightly-2019-09-08-180038, machine-config operator is running well, machineCIDR was added into noProxy both on bootstrap node and in cluster. UPI-on-AWS: In cluster noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-09091.qe.devcluster.openshift.com,api.gpei-09091.qe.devcluster.openshift.com,etcd-0.gpei-09091.qe.devcluster.openshift.com,etcd-1.gpei-09091.qe.devcluster.openshift.com,etcd-2.gpei-09091.qe.devcluster.openshift.com,localhost,test.no-proxy.com On bootstrap NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-09091.qe.devcluster.openshift.com,api.gpei-09091.qe.devcluster.openshift.com,etcd-0.gpei-09091.qe.devcluster.openshift.com,etcd-1.gpei-09091.qe.devcluster.openshift.com,etcd-2.gpei-09091.qe.devcluster.openshift.com,localhost,test.no-proxy.com UPI-on-Baremetal: # oc get cm/cluster-config-v1 -n kube-system -o yaml | grep machine machineCIDR: 10.0.0.0/16 In cluster noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.wzheng-0909.qe.devcluster.openshift.com,api.wzheng-0909.qe.devcluster.openshift.com,etcd-0.wzheng-0909.qe.devcluster.openshift.com,etcd-1.wzheng-0909.qe.devcluster.openshift.com,etcd-2.wzheng-0909.qe.devcluster.openshift.com,localhost,test.no-proxy.com On bootstrap NO_PROXY=.cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.wzheng-0909.qe.devcluster.openshift.com,api.wzheng-0909.qe.devcluster.openshift.com,etcd-0.wzheng-0909.qe.devcluster.openshift.com,etcd-1.wzheng-0909.qe.devcluster.openshift.com,etcd-2.wzheng-0909.qe.devcluster.openshift.com,localhost,test.no-proxy.com Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |