Hide Forgot
Created attachment 1740665 [details] Must gather in one of the master nodes Description of problem: This is a follow up issue to https://bugzilla.redhat.com/show_bug.cgi?id=1907786 When attempting to install on 4.7 with proxy configuration, the installation fails. This happens for both IPv4 and IPv6 installations, regardless of the network type used. After the bootstrap reboots, it fails to pull the ignition. In the master logs, it can be seen that it attempts to pull ignition for master that does not exist in the MCs. Version-Release number of selected component (if applicable): 4.7 How reproducible: Install Openshift with Assisted Installer on 4.7 with proxy configuration. Actual results: Installation fails. Expected results: Successful installation. Additional info:
Created attachment 1740666 [details] Log bundle from bootstrap
Some additional info: The possible difference between the master and the bootstrap are the file /etc/NetworkManager/dispatcher.d/30-resolv-prepender And also dropins for 10-mco-default-env.conf and a bunch of the MCO services. The difference is ================================================================================== < "contents": "[Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://[1001:db8::1]:3128\nEnvironment=HTTPS_PROXY=http://[1001:db8::1]:3128\nEnvironment=NO_PROXY=.cluster.local,.svc,.test-infra-cluster-assisted-installer.redhat.com,1001:db8::/120,127.0.0.1,2002:db8::/53,2003:db8::/112,api-int.test-infra-cluster-assisted-installer.redhat.com,etcd-0.test-infra-cluster-assisted-installer.redhat.com,etcd-1.test-infra-cluster-assisted-installer.redhat.com,etcd-2.test-infra-cluster-assisted-installer.redhat.com,localhost\n", --- > "contents": "[Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://[1001:db8::1]:3128\nEnvironment=HTTPS_PROXY=http://[1001:db8::1]:3128\nEnvironment=NO_PROXY=.cluster.local,.svc,.test-infra-cluster-assisted-installer.redhat.com,1001:db8::/120,127.0.0.1,2002:db8::/53,2003:db8::/112,api-int.test-infra-cluster-assisted-installer.redhat.com,etcd-0.,etcd-1.,etcd-2.,localhost\n", ================================================================================== One has domain suffix ".test-infra-cluster-assisted-installer.redhat.com" for etcd=%d hosts, and the other one does not. The rendered-master in "oc get mc" has different hash than the one that the rebooted host attempts to pull. The error message can be seen in the machine config daemon log.
The reason for the failure is https://github.com/openshift/installer/commit/24e2573b119d10698a71fcf55b9ef439bedb109e. This commit removes the EtcdDiscoveryDomain from the Infrastructure CR produces by openshift-installer. Network operator uses EtcdDiscoveryDomain value from Infrastructure to update the NoProxy field Status of the Proxy CR, and therefore updates it to etcd-0., etcd-1., etcd.2 The NoProxy field of the Status of Proxy is used by the MCO to update the NoProxy of the Status of the ControllerConfig of the MCO. In its turn this field is used by the MCO controller running on masters to render different different system unit. The NoProxy field of the Status of the ControllerConfig produced by the openshift-installer does contain domain, and therefore hashes calculated by bootstrap node and by master nodes are different and bootstrap node cannot pull ignition from master once it reboots
Since OCP 4.4 etcd no longer has a dependency on DNS. These records (etcd-0, etcd-1, etcd-2) should be removed from the proxy config.
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1901034.
Matthew agree, moving this one over to installer team as there are PR's linked to this BZ I don't want to close it as duplicate.
(In reply to Michal Fojtik from comment #6) > Matthew agree, moving this one over to installer team as there are PR's > linked to this BZ I don't want to close it as duplicate. I was actually thinking of closing https://bugzilla.redhat.com/show_bug.cgi?id=1901034 as a duplicate of this bug instead. The PR linked in that bug is obsoleted by the PRs linked for this bug.
*** Bug 1901034 has been marked as a duplicate of this bug. ***
MCO#2315 is ready for review. As outlined in the PR[1] I feel a possible path here is to tolerate the old etcd records in MCO and add docs to release with details on removal? Meanwhile, we will clean up installer and network-operator so new clusters do not have these records. [1]https://github.com/openshift/machine-config-operator/pull/2315
Hi Sam, What I can help here?
> What I can help here? sorry for the noise I meant to ping Matthew on the review of https://github.com/openshift/machine-config-operator/pull/2315 if you have input it is welcome.
(In reply to Sam Batschelet from comment #12) > > What I can help here? > > sorry for the noise I meant to ping Matthew on the review of > https://github.com/openshift/machine-config-operator/pull/2315 if you have > input it is welcome. @sbatsche The MCO changes look fine to me, but I am not an owner in that repo.
Verify this bug on payload 4.7.0-0.nightly-2021-01-21-012810. 1. For fresh install with global proxy enabled on 4.7.0-0.nightly-2021-01-21-012810, after installation completed successfully, check the noProxy list set in proxy/cluster: # oc get proxy cluster -o yaml status: ... noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-0121.qe.devcluster.openshift.com,localhost,test.no-proxy.com The same noProxy set on bootstrap node, etcd records were removed from the noProxy list. [root@ip-10-0-10-163 core]# env |grep -i proxy NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-0121.qe.devcluster.openshift.com,localhost,test.no-proxy.com MCO is running well. # oc get co |grep machine-config machine-config 4.7.0-0.nightly-2021-01-21-012810 True False False 163m 2. For 4.6.13 cluster with global proxy enabled, upgrade the cluster to 4.7.0-0.nightly-2021-01-21-012810. Before upgrade, check the noProxy list in proxy/cluster: # oc get proxy cluster -o yaml ... status: noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-4613.qe.devcluster.openshift.com,etcd-0.gpei-4613.qe.devcluster.openshift.com,etcd-1.gpei-4613.qe.devcluster.openshift.com,etcd-2.gpei-4613.qe.devcluster.openshift.com,localhost,test.no-proxy.com After cluster upgraded to 4.7.0-0.nightly-2021-01-21-012810 successfully, check the noProxy list again, etcd records were removed. # oc get proxy cluster -o yaml ... status: noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-4613.qe.devcluster.openshift.com,localhost,test.no-proxy.com No MCO degraded issue.
*** Bug 1919386 has been marked as a duplicate of this bug. ***
Verified the bug on 4.7.0-0.nightly-ppc64le-2021-01-24-004926 For fresh install with global proxy enabled on 4.7.0-0.nightly-ppc64le-2021-01-24-004926, after installation completed successfully, check the noProxy list set in proxy/cluster: Co status: --- machine-config 4.7.0-0.nightly-ppc64le-2021-01-24-004926 True False False 143m oc get proxy cluster -o yaml --- status: noProxy: .cluster.local,.satwsin1-proxy.redhat.com,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,9.114.96.0/22,api-int.satwsin1-proxy.redhat.com,localhost
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633