Bug 1744532 - [Proxy]machine-config reporting Degraded is true [NEEDINFO]
Summary: [Proxy]machine-config reporting Degraded is true
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.0
Assignee: Joseph Callen
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-22 11:04 UTC by XiuJuan Wang
Modified: 2019-10-16 06:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:37:16 UTC
Target Upstream Version:
gpei: needinfo? (lmeyer)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 305 'None' closed Bug 1744532: Adds machineCIDR noProxy support for all platforms 2020-09-09 17:35:13 UTC
Github openshift installer pull 2257 'None' closed Bug 1744532: proxy: add .svc and .cluster.local to default noProxy 2020-09-09 17:35:13 UTC
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:37:27 UTC

Description XiuJuan Wang 2019-08-22 11:04:35 UTC
Description of problem:
machine-config cluster operator reports Degraded true with error:

Failed to resync 4.2.0-0.nightly-2019-08-21-235427 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node control-plane-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-2e89b2a4a4231cd3f4071302f9478eec\\\" not found\"", retrying

# oc get machineconfigpool
NAME     CONFIG   UPDATED   UPDATING   DEGRADED
master            False     True       True
worker            False     True       True

  oc get machineconfig
NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   CREATED
00-master                                                   66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
00-worker                                                   66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
01-master-container-runtime                                 66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
01-master-kubelet                                           66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
01-worker-container-runtime                                 66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
01-worker-kubelet                                           66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
99-master-588d6ce1-c4a6-11e9-8cf0-0050568b2275-registries   66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
99-master-ssh                                                                                          2.2.0             4h30m
99-worker-588e2512-c4a6-11e9-8cf0-0050568b2275-registries   66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
99-worker-ssh                                                                                          2.2.0             4h30m
rendered-master-fb8d673690e8a72c4c75cd1fbc0b2beb            66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m
rendered-worker-ef7934eab8903cd8cafd32ef72cd69df            66f991220e810d8be5e5793581aaa65e80b636e6   2.2.0             4h30m

The must-gather log can be found here http://virt-openshift-05.lab.eng.nay.redhat.com/xiuwang/

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-21-235427

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Antonio Murdaca 2019-08-22 11:40:02 UTC
This is likely happen on installation, if so, can you provide the install config that you're using please

Comment 2 Antonio Murdaca 2019-08-22 12:15:42 UTC
2019-08-22T08:19:52.610826696Z W0822 08:19:52.610794       1 render.go:137] Warning: the controller config referenced an unsupported platform: vsphere

First ^^

Second: we don't support 1 master setup as that goes against the etcd-quorum-guard and we cannot apply any configuration. Please re-run the test using a supported platform (I guess?) and a 3 masters setup at least.

Comment 6 Antonio Murdaca 2019-08-22 17:20:36 UTC
can you grab and share /etc/machine-config-daemon/currentconfig from the masters, I need to identify what's diffing

Comment 7 Gaoyun Pei 2019-08-23 00:41:16 UTC
Seems no such file found on the masters.

[core@control-plane-2 ~]$ sudo ls /etc/machine-config-daemon/
node-annotations.json
[core@control-plane-2 ~]$ cat /etc/machine-config-daemon/node-annotations.json 
{"machineconfiguration.openshift.io/currentConfig":"rendered-master-497c8d34be41019278ce560778ca6e0e","machineconfiguration.openshift.io/desiredConfig":"rendered-master-497c8d34be41019278ce560778ca6e0e","machineconfiguration.openshift.io/state":"Done"}


Will try to setup using the same payload without proxy to make a comparison.

Comment 11 Antonio Murdaca 2019-08-23 08:07:31 UTC
Which version of the installer is being used? are you extracting the installer from the payload that you're using to install the cluster?

Comment 12 XiuJuan Wang 2019-08-23 09:46:35 UTC
Per questions have been answered in https://coreos.slack.com/archives/CLJSH16J0/p1566547827097400 .

Comment 14 Antonio Murdaca 2019-08-23 11:07:43 UTC
Maybe this help? https://github.com/openshift/installer/pull/2257

Comment 15 Daneyon Hansen 2019-08-23 15:34:16 UTC
I see a similar issue doing an install with proxy enabled:

FATAL failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.okd-2019-08-22-211349 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-145-47.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\", Node ip-10-0-128-7.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\", Node ip-10-0-170-126.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-1c883daad8b31710487eb70aebdb6c02\\\" not found\"", retrying 

I am going to test a proxy-enabled cluster install since https://github.com/openshift/installer/pull/2257 has merged.

Comment 17 Gaoyun Pei 2019-08-24 07:43:02 UTC
Verify this bug using payload 4.2.0-0.nightly-2019-08-24-002347, proxy enabled cluster could be setup successfully.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-24-002347   True        False         4m20s   Cluster version is 4.2.0-0.nightly-2019-08-24-002347
# oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-aa73b2191c3f4149b16b79efc166917c   True      False      False
worker   rendered-worker-c700f3dfcfd0fb2831d9e53dd3818a67   True      False      False

Comment 18 Daneyon Hansen 2019-08-26 21:24:53 UTC
I am hitting this bug even though https://github.com/openshift/installer/pull/2257 has merged.

time="2019-08-26T13:58:03-07:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-26-193559: 99% complete"
time="2019-08-26T14:00:33-07:00" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.okd-2019-08-26-193559 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-138-24.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\", Node ip-10-0-174-132.us-west-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\\\\\" not found\\\"\", retrying"

Comment 19 Daneyon Hansen 2019-08-26 21:29:53 UTC
Additional info related to comment #18

$ oc get cm/cluster-config-v1 -n kube-system -o yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundle: |
      -----BEGIN CERTIFICATE-----
            <SNIP>
      -----END CERTIFICATE-----
    apiVersion: v1
    baseDomain: devcluster.openshift.com
    compute:
    - hyperthreading: Enabled
      name: worker
      platform: {}
      replicas: 3
    controlPlane:
      hyperthreading: Enabled
      name: master
      platform:
        aws:
          rootVolume:
            iops: 0
            size: 120
            type: gp2
          type: m5.xlarge
          zones:
          - us-west-2a
          - us-west-2b
          - us-west-2c
          - us-west-2d
      replicas: 3
    metadata:
      creationTimestamp: null
      name: jcallen-proxy
    networking:
      clusterNetwork:
      - cidr: 10.128.0.0/14
        hostPrefix: 23
      machineCIDR: 10.0.0.0/16
      networkType: OpenShiftSDN
      serviceNetwork:
      - 172.30.0.0/16
    platform:
      aws:
        region: us-west-2
    proxy:
      httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129
      httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129
      noProxy: .s3.us-west-2.amazonaws.com,.us-west-2.compute.internal
    pullSecret: ""
    sshKey: |
      ssh-rsa <SNIP>


$ oc get clusteroperator/machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-08-26T19:54:09Z"
  generation: 1
  name: machine-config
  resourceVersion: "36638"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: 44f27d51-c83b-11e9-b007-02f9ee0624ec
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-08-26T19:54:09Z"
    message: Cluster not available for 4.2.0-0.okd-2019-08-26-193559
    status: "False"
    type: Available
  - lastTransitionTime: "2019-08-26T19:54:09Z"
    message: Cluster is bootstrapping 4.2.0-0.okd-2019-08-26-193559
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-08-26T20:04:37Z"
    message: 'Failed to resync 4.2.0-0.okd-2019-08-26-193559 because: timed out waiting
      for the condition during syncRequiredMachineConfigPools: pool master has not
      progressed to latest configuration: configuration status for pool master is
      empty: pool is degraded because nodes fail with "3 nodes are reporting degraded
      status on sync": "Node ip-10-0-138-24.us-west-2.compute.internal is reporting:
      \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\"
      not found\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io
      \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-174-132.us-west-2.compute.internal
      is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\"
      not found\"", retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-08-26T20:04:37Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    lastSyncError: 'pool master has not progressed to latest configuration: configuration
      status for pool master is empty: pool is degraded because nodes fail with "3
      nodes are reporting degraded status on sync": "Node ip-10-0-138-24.us-west-2.compute.internal
      is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\"
      not found\", Node ip-10-0-151-191.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io
      \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\" not found\", Node ip-10-0-174-132.us-west-2.compute.internal
      is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-68f2a90e8db959a3b5a1971f707fea04\\\"
      not found\"", retrying'
    worker: all 3 nodes are at latest configuration rendered-worker-5afcba28b5a21c885daeb4599215b1fc
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: cluster
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.2.0-0.okd-2019-08-26-193559

Comment 20 Daneyon Hansen 2019-08-26 21:33:03 UTC
Add for Comment #18

$ oc get proxy/cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  creationTimestamp: "2019-08-26T19:52:52Z"
  generation: 1
  name: cluster
  resourceVersion: "1721"
  selfLink: /apis/config.openshift.io/v1/proxies/cluster
  uid: 16ef3dab-c83b-11e9-b007-02f9ee0624ec
spec:
  httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129
  httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129
  noProxy: .s3.us-west-2.amazonaws.com,.us-west-2.compute.internal
  trustedCA:
    name: user-ca-bundle
status:
  httpProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129
  httpsProxy: http://jcallen:6cpbEH6uCepwEhNr2iB05ixP@52.73.102.120:3129
  noProxy: .cluster.local,.s3.us-west-2.amazonaws.com,.svc,.us-west-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jcallen-proxy.devcluster.openshift.com,api.jcallen-proxy.devcluster.openshift.com,etcd-0.jcallen-proxy.devcluster.openshift.com,etcd-1.jcallen-proxy.devcluster.openshift.com,etcd-2.jcallen-proxy.devcluster.openshift.com,localhost

Comment 21 Daneyon Hansen 2019-08-26 21:34:18 UTC
Note: I was seeing the same bug before adding .s3.us-west-2.amazonaws.com and .us-west-2.compute.internal to noProxy.

Comment 22 Antonio Murdaca 2019-08-26 21:48:41 UTC
(In reply to Daneyon Hansen from comment #21)
> Note: I was seeing the same bug before adding .s3.us-west-2.amazonaws.com
> and .us-west-2.compute.internal to noProxy.

I think the issue could be in the trustedCA - as discussed on Slack, if you can't reproduce w/o the trustedCA, it's prob a different bug which I'll look into it.

Comment 24 Antonio Murdaca 2019-08-27 21:26:11 UTC
The issue here is again a skew between bootstrap and cluster.

- cluster does contain 10.0.0.0/16 in noProxy
- bootstrap doesn't (for whatever reason)

This is similar to https://github.com/openshift/installer/pull/2257

When there's a skew like the above in configs (noProxy here), the bootstrap generates a rendered-machineconfig that the cluster won't be able to exactly regenerate once the cluster is running. When that happens the MCO notices and fails to get the correct machineconfig (which, at installation time, it's the bootstrap one that it's the source of truth for us).

So again, I'm not sure why 10.0.0.0/16 is injected into the cluster's noProxy but doesn't get injected at bootstrap. We should have the bootstrap one inject that as well so the MCO is able to grab the correct rendered-machineconfig.

Moving to installer I guess but I wondered if this is maybe a network task

Also, as Trevor pointed out in the PR linked above, maybe we should have an unified way of generating noProxy and whatever else to avoid such drifts (maybe?)

Comment 26 Daneyon Hansen 2019-08-28 22:32:23 UTC
https://github.com/openshift/installer/pull/2286 fixes this bug.

Comment 27 Daneyon Hansen 2019-08-28 22:36:24 UTC
> In the proxy-enabled cluster on vsphere and baremetal, it doesn't have "10.0.0.0/16" added into noProxy.

The proxy controller reads machineCIDR from the cluster config. If the machine CIDR is present, the controller will add the cidr to noProxy:

$ oc get cm/cluster-config-v1 -n kube-system -o yaml | grep machine
      machineCIDR: 10.0.0.0/16

Gaoyun Pei, can you confirm that your vsphere and baremetal installs do not contain a machineCIDR?

Comment 30 Daneyon Hansen 2019-08-29 04:47:33 UTC
Previously, the installer did not add machineCIDR to the default noProxy list. PR https://github.com/openshift/installer/pull/2286 adds this support. Previously, the proxy controller was only adding machineCIDR for cloud provider installation types. PR https://github.com/openshift/cluster-network-operator/pull/305 adds machineCIDR for all installation types.

Comment 33 Daneyon Hansen 2019-09-03 17:56:59 UTC
Ow that the PRs from comment #30 merged, please make sure the IP's used for cluster nodes are within the install-config machinceCIDR range. If not, the machineCIDR should be updated before the install or add the machine IP's or CIDR to proxy noProxy.

Comment 37 Gaoyun Pei 2019-09-09 03:50:34 UTC
Verified this bug with payload 4.2.0-0.nightly-2019-09-08-180038, machine-config operator is running well, machineCIDR was added into noProxy both on bootstrap node and in cluster.


UPI-on-AWS:

In cluster
noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-09091.qe.devcluster.openshift.com,api.gpei-09091.qe.devcluster.openshift.com,etcd-0.gpei-09091.qe.devcluster.openshift.com,etcd-1.gpei-09091.qe.devcluster.openshift.com,etcd-2.gpei-09091.qe.devcluster.openshift.com,localhost,test.no-proxy.com

On bootstrap
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-09091.qe.devcluster.openshift.com,api.gpei-09091.qe.devcluster.openshift.com,etcd-0.gpei-09091.qe.devcluster.openshift.com,etcd-1.gpei-09091.qe.devcluster.openshift.com,etcd-2.gpei-09091.qe.devcluster.openshift.com,localhost,test.no-proxy.com



UPI-on-Baremetal:

# oc get cm/cluster-config-v1 -n kube-system -o yaml | grep machine
      machineCIDR: 10.0.0.0/16

In cluster
noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.wzheng-0909.qe.devcluster.openshift.com,api.wzheng-0909.qe.devcluster.openshift.com,etcd-0.wzheng-0909.qe.devcluster.openshift.com,etcd-1.wzheng-0909.qe.devcluster.openshift.com,etcd-2.wzheng-0909.qe.devcluster.openshift.com,localhost,test.no-proxy.com

On bootstrap
NO_PROXY=.cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.wzheng-0909.qe.devcluster.openshift.com,api.wzheng-0909.qe.devcluster.openshift.com,etcd-0.wzheng-0909.qe.devcluster.openshift.com,etcd-1.wzheng-0909.qe.devcluster.openshift.com,etcd-2.wzheng-0909.qe.devcluster.openshift.com,localhost,test.no-proxy.com

Comment 38 errata-xmlrpc 2019-10-16 06:37:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.