1909502 – NO_PROXY is not matched between bootstrap and global cluster setting which lead to desired master machineconfig is not found

Bug 1909502 - NO_PROXY is not matched between bootstrap and global cluster setting which lead to desired master machineconfig is not found

Summary: NO_PROXY is not matched between bootstrap and global cluster setting which le...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sam Batschelet
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1919386 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-20 14:16 UTC by Ori Amizur
Modified:	2021-02-24 15:47 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:47:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Must gather in one of the master nodes (8.89 MB, application/gzip) 2020-12-20 14:16 UTC, Ori Amizur	no flags	Details
Log bundle from bootstrap (4.41 MB, application/gzip) 2020-12-20 14:20 UTC, Ori Amizur	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 930	None	closed	Bug 1909502: pkg/util/proxyconfig: remove unused etcd records from proxy config	2021-02-16 17:22:58 UTC
Github	openshift installer pull 4518	None	closed	Bug 1909502: pkg/asset/manifests: remove etcd records from proxy config	2021-02-16 17:22:57 UTC
Github	openshift machine-config-operator pull 2315	None	closed	Bug 1909502: pkg/operator: tolerate removal of etcd records from proxy config	2021-02-16 17:22:58 UTC
Github	openshift machine-config-operator pull 2346	None	closed	Revert "Bug 1909502: pkg/operator: tolerate removal of etcd records from proxy config"	2021-02-16 17:22:58 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:47:59 UTC

Description Ori Amizur 2020-12-20 14:16:06 UTC

Created attachment 1740665 [details]
Must gather in one of the master nodes

Description of problem:
This is a follow up issue to https://bugzilla.redhat.com/show_bug.cgi?id=1907786

When attempting to install on 4.7 with proxy configuration, the installation fails.
This happens for both IPv4 and IPv6 installations, regardless of the network type used.

After the bootstrap reboots, it fails to pull the ignition.  In the master logs, it can be seen that it attempts to pull ignition for master that does not exist in the MCs.

Version-Release number of selected component (if applicable):

4.7

How reproducible:

Install Openshift with Assisted Installer on 4.7 with proxy configuration.


Actual results:

Installation fails.

Expected results:

Successful installation.

Additional info:

Comment 1 Ori Amizur 2020-12-20 14:20:43 UTC

Created attachment 1740666 [details]
Log bundle from bootstrap

Comment 2 Ori Amizur 2020-12-20 14:53:44 UTC

Some additional info:

The possible difference between the master and the bootstrap are the file
/etc/NetworkManager/dispatcher.d/30-resolv-prepender
And also dropins for 10-mco-default-env.conf and a bunch of the MCO services.
The difference is
==================================================================================
<                                 "contents": "[Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://[1001:db8::1]:3128\nEnvironment=HTTPS_PROXY=http://[1001:db8::1]:3128\nEnvironment=NO_PROXY=.cluster.local,.svc,.test-infra-cluster-assisted-installer.redhat.com,1001:db8::/120,127.0.0.1,2002:db8::/53,2003:db8::/112,api-int.test-infra-cluster-assisted-installer.redhat.com,etcd-0.test-infra-cluster-assisted-installer.redhat.com,etcd-1.test-infra-cluster-assisted-installer.redhat.com,etcd-2.test-infra-cluster-assisted-installer.redhat.com,localhost\n",
---
>                                 "contents": "[Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://[1001:db8::1]:3128\nEnvironment=HTTPS_PROXY=http://[1001:db8::1]:3128\nEnvironment=NO_PROXY=.cluster.local,.svc,.test-infra-cluster-assisted-installer.redhat.com,1001:db8::/120,127.0.0.1,2002:db8::/53,2003:db8::/112,api-int.test-infra-cluster-assisted-installer.redhat.com,etcd-0.,etcd-1.,etcd-2.,localhost\n",
==================================================================================

One has domain suffix ".test-infra-cluster-assisted-installer.redhat.com" for etcd=%d hosts, and the other one does not.

The rendered-master in "oc get mc" has different hash than the one that the rebooted host attempts to pull.  The error message can be seen in the machine config daemon log.

Comment 3 yevgeny shnaidman 2020-12-27 15:45:31 UTC

The reason for the failure is https://github.com/openshift/installer/commit/24e2573b119d10698a71fcf55b9ef439bedb109e.
This commit removes the EtcdDiscoveryDomain from the Infrastructure CR produces by openshift-installer.
Network operator uses EtcdDiscoveryDomain value from Infrastructure to update the NoProxy field Status of the Proxy CR, and therefore updates it to etcd-0., etcd-1., etcd.2
The NoProxy field of the Status of Proxy is used by the MCO to update the NoProxy of the Status of the ControllerConfig of the MCO. 
In its turn this field is used by the MCO controller running on masters to render different different system unit.
The NoProxy field of the Status of the ControllerConfig produced by the openshift-installer does contain domain, and therefore hashes calculated by bootstrap node and by master nodes are different and bootstrap node cannot pull ignition from master once it reboots

Comment 4 Sam Batschelet 2020-12-28 14:23:09 UTC

Since OCP 4.4 etcd no longer has a dependency on DNS. These records (etcd-0, etcd-1, etcd-2) should be removed from the proxy config.

Comment 5 Matthew Staebler 2021-01-04 19:40:16 UTC

This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1901034.

Comment 6 Michal Fojtik 2021-01-07 08:16:57 UTC

Matthew agree, moving this one over to installer team as there are PR's linked to this BZ I don't want to close it as duplicate.

Comment 7 Matthew Staebler 2021-01-07 16:58:27 UTC

(In reply to Michal Fojtik from comment #6)
> Matthew agree, moving this one over to installer team as there are PR's
> linked to this BZ I don't want to close it as duplicate.

I was actually thinking of closing https://bugzilla.redhat.com/show_bug.cgi?id=1901034 as a duplicate of this bug instead. The PR linked in that bug is obsoleted by the PRs linked for this bug.

Comment 9 Matthew Staebler 2021-01-12 14:46:35 UTC

*** Bug 1901034 has been marked as a duplicate of this bug. ***

Comment 10 Sam Batschelet 2021-01-12 15:38:22 UTC

MCO#2315 is ready for review. As outlined in the PR[1] I feel a possible path here is to tolerate the old etcd records in MCO and add docs to release with details on removal? Meanwhile, we will clean up installer and network-operator so new clusters do not have these records.

[1]https://github.com/openshift/machine-config-operator/pull/2315

Comment 11 Johnny Liu 2021-01-12 16:00:34 UTC

Hi Sam, 

What I can help here?

Comment 12 Sam Batschelet 2021-01-12 17:31:56 UTC

> What I can help here?

sorry for the noise I meant to ping Matthew on the review of https://github.com/openshift/machine-config-operator/pull/2315 if you have input it is welcome.

Comment 13 Matthew Staebler 2021-01-12 20:37:18 UTC

(In reply to Sam Batschelet from comment #12)
> > What I can help here?
> 
> sorry for the noise I meant to ping Matthew on the review of
> https://github.com/openshift/machine-config-operator/pull/2315 if you have
> input it is welcome.

@sbatsche The MCO changes look fine to me, but I am not an owner in that repo.

Comment 21 Gaoyun Pei 2021-01-21 07:08:25 UTC

Verify this bug on payload 4.7.0-0.nightly-2021-01-21-012810.

1. For fresh install with global proxy enabled on 4.7.0-0.nightly-2021-01-21-012810, after installation completed successfully, check the noProxy list set in proxy/cluster:

# oc get proxy cluster -o yaml
status:
...
noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-0121.qe.devcluster.openshift.com,localhost,test.no-proxy.com

The same noProxy set on bootstrap node, etcd records were removed from the noProxy list.
[root@ip-10-0-10-163 core]# env |grep -i proxy
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-0121.qe.devcluster.openshift.com,localhost,test.no-proxy.com

MCO is running well.
# oc get co |grep machine-config
machine-config 4.7.0-0.nightly-2021-01-21-012810 True False False 163m

2. For 4.6.13 cluster with global proxy enabled, upgrade the cluster to 4.7.0-0.nightly-2021-01-21-012810.

Before upgrade, check the noProxy list in proxy/cluster:
# oc get proxy cluster -o yaml
...
status:
noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-4613.qe.devcluster.openshift.com,etcd-0.gpei-4613.qe.devcluster.openshift.com,etcd-1.gpei-4613.qe.devcluster.openshift.com,etcd-2.gpei-4613.qe.devcluster.openshift.com,localhost,test.no-proxy.com

After cluster upgraded to 4.7.0-0.nightly-2021-01-21-012810 successfully, check the noProxy list again, etcd records were removed.
# oc get proxy cluster -o yaml
...
status:
noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-4613.qe.devcluster.openshift.com,localhost,test.no-proxy.com

No MCO degraded issue.

Comment 22 Matthew Staebler 2021-01-22 19:18:40 UTC

*** Bug 1919386 has been marked as a duplicate of this bug. ***

Comment 23 Satwinder Singh 2021-01-25 13:32:09 UTC

Verified the bug on 4.7.0-0.nightly-ppc64le-2021-01-24-004926

For fresh install with global proxy enabled on 4.7.0-0.nightly-ppc64le-2021-01-24-004926, after installation completed successfully, check the noProxy list set in proxy/cluster:

Co status:
---
machine-config                             4.7.0-0.nightly-ppc64le-2021-01-24-004926   True        False         False      143m

oc get proxy cluster -o yaml
---
status:
  noProxy: .cluster.local,.satwsin1-proxy.redhat.com,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,9.114.96.0/22,api-int.satwsin1-proxy.redhat.com,localhost

Comment 26 errata-xmlrpc 2021-02-24 15:47:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.