1901034 – NO_PROXY is not matched between bootstrap and global cluster setting which lead to desired master machineconfig is not found

Bug 1901034 - NO_PROXY is not matched between bootstrap and global cluster setting which lead to desired master machineconfig is not found

Summary: NO_PROXY is not matched between bootstrap and global cluster setting which l...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Matthew Staebler
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (6):	1899979 1901577 1904231 1906321 1906620 1916904 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-24 10:47 UTC by Johnny Liu
Modified:	2021-01-27 14:26 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-sippy] install should work
Last Closed:	2021-01-21 07:11:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Johnny Liu 2020-11-24 10:47:55 UTC

When installing a cluster behind proxy, installation get failed.
level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-11-23-195308
level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-11-23-195308: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-52-10.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\", Node ip-10-0-77-145.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\", Node ip-10-0-48-13.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\"", retrying
level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-11-23-195308
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating

After the failure, dig into it, we found the root cause is mismatch of NO_PROXY setting between bootstrap and cluster global setting
<       Environment=NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-471.qe.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,test.no-proxy.com
---
>       Environment=NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-471.qe.devcluster.openshift.com,etcd-0.gpei-471.qe.devcluster.openshift.com,etcd-1.gpei-471.qe.devcluster.openshift.com,etcd-2.gpei-471.qe.devcluster.openshift.com,localhost,test.no-proxy.com

$ oc get proxies.config.openshift.io cluster  -o yaml
<--snip-->
spec:
  httpProxy: http://user:password@10.0.99.4:3128
  httpsProxy: http://user:password@10.0.99.4:3128
  noProxy: test.no-proxy.com
  trustedCA:
    name: ""
status:
  httpProxy: http://user:password@10.0.99.4:3128
  httpsProxy: http://user:password@10.0.99.4:3128
  noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.miyadav24azur.qe.azure.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,test.no-proxy.com

Seem like it missing domain for etcd fqdn, while boostrap has it.


Version:
4.7.0-0.nightly-2020-11-23-195308


Platform:
gcp/azure/vsphere/aws


both IPI and UPI

What happened?
After


What did you expect to happen?
When running installation behind proxy, installation get failed.


How to reproduce it (as minimally and precisely as possible)?
1. inject proxy setting into install-config.yaml, such as:
---
apiVersion: v1
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
metadata:
  name: gpei-472
platform:
  aws:
    region: us-east-2
    subnets:
    - subnet-0c56d13268c3b8d24
    - subnet-02c73c48c59eca6b0
    - subnet-0908bd983fe473287
    - subnet-080c64ca63b15fb2d
pullSecret: HIDDEN
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
publish: External
proxy:
  httpProxy: http://user:password@proxy.example.com:3128
  httpsProxy: http://user:password@proxy.example.com:3128
  noProxy: test.no-proxy.com
baseDomain: qe.devcluster.openshift.com
2. Run the installation


Anything else we need to know?
1. Did not hit such issues on 4.7.0-0.nightly-2020-11-18-085225
2. We see this issue for the 1st time on 4.7.0-0.nightly-2020-11-20-234717

Comment 1 Johnny Liu 2020-11-24 11:20:45 UTC

$ oc get infrastructures.config.openshift.io cluster -o yaml
<--snip-->
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: Azure
status:
  apiServerInternalURI: https://api-int.miyadav24azur.qe.azure.devcluster.openshift.com:6443
  apiServerURL: https://api.miyadav24azur.qe.azure.devcluster.openshift.com:6443
  etcdDiscoveryDomain: ""
  infrastructureName: miyadav24azur-t8cqb
  platform: Azure
  platformStatus:
    azure:
      cloudName: AzurePublicCloud
      networkResourceGroupName: miyadav24azur-rg
      resourceGroupName: miyadav24azur-t8cqb-rg
    type: Azure

The etcdDiscoveryDomain is set to empty, I think that lead to etcd fqdn missing domain.

Comment 2 jamo luhrsen 2020-11-25 04:08:40 UTC

This is causing some jobs to fail very frequently now and bringing down some of the release indicator percentages we track.
Here is an example job:
  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy

I'm assuming it's the same problem because I see this in a failed job [0]:

  cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/00-master.yaml:            Environment=NO_PROXY=.cluster.local,.ec2.internal,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-q4qfig05-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-0.,etcd-1.,etcd-2.,localhost



which seems to be missing some values like we see in the last job [1] that did not fail (it ran a week ago):

  cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/00-master.yaml:            Environment=NO_PROXY=.cluster.local,.svc,.us-west-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-0.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-1.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-2.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,localhost



[0] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331391826971594752
[1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1328922989076418560

Comment 4 Kirsten Garrison 2020-12-04 20:59:25 UTC

Can we up the priority on this I've found 2 bugs related to this:

https://bugzilla.redhat.com/show_bug.cgi?id=1899979
https://bugzilla.redhat.com/show_bug.cgi?id=1904231

Also the entire aws-proxy job is red: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy?buildId=

I'm going to dupe the bzs into this so we can have this bug as the tracking bug.

Comment 5 Kirsten Garrison 2020-12-04 21:01:53 UTC

*** Bug 1899979 has been marked as a duplicate of this bug. ***

Comment 6 Kirsten Garrison 2020-12-04 21:03:06 UTC

*** Bug 1904231 has been marked as a duplicate of this bug. ***

Comment 7 Kirsten Garrison 2020-12-04 21:13:23 UTC

*** Bug 1901577 has been marked as a duplicate of this bug. ***

Comment 8 Matthew Staebler 2020-12-04 21:47:46 UTC

(In reply to Kirsten Garrison from comment #4)
> Can we up the priority on this I've found 2 bugs related to this:

The installer team will be looking at this early next sprint.

Comment 9 Matthew Staebler 2020-12-09 20:17:59 UTC

This regression was introduced in https://github.com/openshift/installer/pull/4067 with removal of the code that sets the status.etcdDiscoveryDomain in infrastructure.config.openshift.io. The cluster-network-operator is relying on that field to fill out the status.noProxy field in proxy.config.openshift.io [1].

[1] https://github.com/openshift/cluster-network-operator/blob/c23495cf6e6ffeffc0290c85ee4608102f7b47d1/pkg/util/proxyconfig/no_proxy.go#L113

Comment 10 Matthew Staebler 2020-12-11 00:22:01 UTC

*** Bug 1906620 has been marked as a duplicate of this bug. ***

Comment 11 Dan Li 2021-01-07 13:35:19 UTC

*** Bug 1906321 has been marked as a duplicate of this bug. ***

Comment 12 lmcfadde 2021-01-12 14:28:32 UTC

Is there any update or proposed fix for this one yet?  Since it is blocking and I don't see any update since mid dec, checking on status here.

Comment 13 Matthew Staebler 2021-01-12 14:47:15 UTC

(In reply to lmcfadde from comment #12)
> Is there any update or proposed fix for this one yet?  Since it is blocking
> and I don't see any update since mid dec, checking on status here.

This bug will be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1909502.

*** This bug has been marked as a duplicate of bug 1909502 ***

Comment 14 Gaoyun Pei 2021-01-21 07:11:21 UTC

This issue got fixed on nightly payload 4.7.0-0.nightly-2021-01-21-012810, close it now.
Refer to https://bugzilla.redhat.com/show_bug.cgi?id=1909502#c21 for detailed verification steps.

Comment 15 egarcia 2021-01-27 14:26:29 UTC

*** Bug 1916904 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.