Bug 1901034

Summary:	NO_PROXY is not matched between bootstrap and global cluster setting which lead to desired master machineconfig is not found
Product:	OpenShift Container Platform	Reporter:	Johnny Liu <jialiu>
Component:	Installer	Assignee:	Matthew Staebler <mstaeble>
Installer sub component:	openshift-installer	QA Contact:	Johnny Liu <jialiu>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	high
Priority:	high	CC:	adam.kaplan, akashem, aprabhu, bleanhar, ecordell, esimard, gpei, jluhrsen, kgarriso, lmcfadde, lsm5, mstaeble, mtarsel, rheinzma, sgreene, tsze, wduan, wking, yanyang
Version:	4.7	Keywords:	Regression, Reopened, TestBlocker
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-sippy] install should work
Last Closed:	2021-01-21 07:11:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Johnny Liu 2020-11-24 10:47:55 UTC

When installing a cluster behind proxy, installation get failed.
level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-11-23-195308
level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-11-23-195308: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-52-10.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\", Node ip-10-0-77-145.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\", Node ip-10-0-48-13.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\"", retrying
level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-11-23-195308
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating

After the failure, dig into it, we found the root cause is mismatch of NO_PROXY setting between bootstrap and cluster global setting
<       Environment=NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-471.qe.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,test.no-proxy.com
---
>       Environment=NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-471.qe.devcluster.openshift.com,etcd-0.gpei-471.qe.devcluster.openshift.com,etcd-1.gpei-471.qe.devcluster.openshift.com,etcd-2.gpei-471.qe.devcluster.openshift.com,localhost,test.no-proxy.com

$ oc get proxies.config.openshift.io cluster  -o yaml
<--snip-->
spec:
  httpProxy: http://user:password@10.0.99.4:3128
  httpsProxy: http://user:password@10.0.99.4:3128
  noProxy: test.no-proxy.com
  trustedCA:
    name: ""
status:
  httpProxy: http://user:password@10.0.99.4:3128
  httpsProxy: http://user:password@10.0.99.4:3128
  noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.miyadav24azur.qe.azure.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,test.no-proxy.com

Seem like it missing domain for etcd fqdn, while boostrap has it.


Version:
4.7.0-0.nightly-2020-11-23-195308


Platform:
gcp/azure/vsphere/aws


both IPI and UPI

What happened?
After


What did you expect to happen?
When running installation behind proxy, installation get failed.


How to reproduce it (as minimally and precisely as possible)?
1. inject proxy setting into install-config.yaml, such as:
---
apiVersion: v1
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
metadata:
  name: gpei-472
platform:
  aws:
    region: us-east-2
    subnets:
    - subnet-0c56d13268c3b8d24
    - subnet-02c73c48c59eca6b0
    - subnet-0908bd983fe473287
    - subnet-080c64ca63b15fb2d
pullSecret: HIDDEN
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
publish: External
proxy:
  httpProxy: http://user:password@proxy.example.com:3128
  httpsProxy: http://user:password@proxy.example.com:3128
  noProxy: test.no-proxy.com
baseDomain: qe.devcluster.openshift.com
2. Run the installation


Anything else we need to know?
1. Did not hit such issues on 4.7.0-0.nightly-2020-11-18-085225
2. We see this issue for the 1st time on 4.7.0-0.nightly-2020-11-20-234717

Comment 1 Johnny Liu 2020-11-24 11:20:45 UTC

$ oc get infrastructures.config.openshift.io cluster -o yaml
<--snip-->
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: Azure
status:
  apiServerInternalURI: https://api-int.miyadav24azur.qe.azure.devcluster.openshift.com:6443
  apiServerURL: https://api.miyadav24azur.qe.azure.devcluster.openshift.com:6443
  etcdDiscoveryDomain: ""
  infrastructureName: miyadav24azur-t8cqb
  platform: Azure
  platformStatus:
    azure:
      cloudName: AzurePublicCloud
      networkResourceGroupName: miyadav24azur-rg
      resourceGroupName: miyadav24azur-t8cqb-rg
    type: Azure

The etcdDiscoveryDomain is set to empty, I think that lead to etcd fqdn missing domain.

Comment 2 jamo luhrsen 2020-11-25 04:08:40 UTC

This is causing some jobs to fail very frequently now and bringing down some of the release indicator percentages we track.
Here is an example job:
  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy

I'm assuming it's the same problem because I see this in a failed job [0]:

  cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/00-master.yaml:            Environment=NO_PROXY=.cluster.local,.ec2.internal,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-q4qfig05-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-0.,etcd-1.,etcd-2.,localhost



which seems to be missing some values like we see in the last job [1] that did not fail (it ran a week ago):

  cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/00-master.yaml:            Environment=NO_PROXY=.cluster.local,.svc,.us-west-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-0.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-1.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-2.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,localhost



[0] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331391826971594752
[1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1328922989076418560

Comment 4 Kirsten Garrison 2020-12-04 20:59:25 UTC

Can we up the priority on this I've found 2 bugs related to this:

https://bugzilla.redhat.com/show_bug.cgi?id=1899979
https://bugzilla.redhat.com/show_bug.cgi?id=1904231

Also the entire aws-proxy job is red: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy?buildId=

I'm going to dupe the bzs into this so we can have this bug as the tracking bug.

Comment 5 Kirsten Garrison 2020-12-04 21:01:53 UTC

*** Bug 1899979 has been marked as a duplicate of this bug. ***

Comment 6 Kirsten Garrison 2020-12-04 21:03:06 UTC

*** Bug 1904231 has been marked as a duplicate of this bug. ***

Comment 7 Kirsten Garrison 2020-12-04 21:13:23 UTC

*** Bug 1901577 has been marked as a duplicate of this bug. ***

Comment 8 Matthew Staebler 2020-12-04 21:47:46 UTC

(In reply to Kirsten Garrison from comment #4)
> Can we up the priority on this I've found 2 bugs related to this:

The installer team will be looking at this early next sprint.

Comment 9 Matthew Staebler 2020-12-09 20:17:59 UTC

This regression was introduced in https://github.com/openshift/installer/pull/4067 with removal of the code that sets the status.etcdDiscoveryDomain in infrastructure.config.openshift.io. The cluster-network-operator is relying on that field to fill out the status.noProxy field in proxy.config.openshift.io [1].

[1] https://github.com/openshift/cluster-network-operator/blob/c23495cf6e6ffeffc0290c85ee4608102f7b47d1/pkg/util/proxyconfig/no_proxy.go#L113

Comment 10 Matthew Staebler 2020-12-11 00:22:01 UTC

*** Bug 1906620 has been marked as a duplicate of this bug. ***

Comment 11 Dan Li 2021-01-07 13:35:19 UTC

*** Bug 1906321 has been marked as a duplicate of this bug. ***

Comment 12 lmcfadde 2021-01-12 14:28:32 UTC

Is there any update or proposed fix for this one yet?  Since it is blocking and I don't see any update since mid dec, checking on status here.

Comment 13 Matthew Staebler 2021-01-12 14:47:15 UTC

(In reply to lmcfadde from comment #12)
> Is there any update or proposed fix for this one yet?  Since it is blocking
> and I don't see any update since mid dec, checking on status here.

This bug will be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1909502.

*** This bug has been marked as a duplicate of bug 1909502 ***

Comment 14 Gaoyun Pei 2021-01-21 07:11:21 UTC

This issue got fixed on nightly payload 4.7.0-0.nightly-2021-01-21-012810, close it now.
Refer to https://bugzilla.redhat.com/show_bug.cgi?id=1909502#c21 for detailed verification steps.

Comment 15 egarcia 2021-01-27 14:26:29 UTC

*** Bug 1916904 has been marked as a duplicate of this bug. ***