Bug 1909502
Summary: | NO_PROXY is not matched between bootstrap and global cluster setting which lead to desired master machineconfig is not found | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ori Amizur <oamizur> | ||||||
Component: | Installer | Assignee: | Sam Batschelet <sbatsche> | ||||||
Installer sub component: | openshift-installer | QA Contact: | Gaoyun Pei <gpei> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | urgent | ||||||||
Priority: | urgent | CC: | aos-bugs, aygarg, bleanhar, jialiu, kgarriso, mfojtik, mstaeble, mtarsel, ohochman, satwsing, sbatsche, tsze, wking, xtian | ||||||
Version: | 4.7 | Keywords: | TestBlocker | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-02-24 15:47:16 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ori Amizur
2020-12-20 14:16:06 UTC
Created attachment 1740666 [details]
Log bundle from bootstrap
Some additional info: The possible difference between the master and the bootstrap are the file /etc/NetworkManager/dispatcher.d/30-resolv-prepender And also dropins for 10-mco-default-env.conf and a bunch of the MCO services. The difference is ================================================================================== < "contents": "[Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://[1001:db8::1]:3128\nEnvironment=HTTPS_PROXY=http://[1001:db8::1]:3128\nEnvironment=NO_PROXY=.cluster.local,.svc,.test-infra-cluster-assisted-installer.redhat.com,1001:db8::/120,127.0.0.1,2002:db8::/53,2003:db8::/112,api-int.test-infra-cluster-assisted-installer.redhat.com,etcd-0.test-infra-cluster-assisted-installer.redhat.com,etcd-1.test-infra-cluster-assisted-installer.redhat.com,etcd-2.test-infra-cluster-assisted-installer.redhat.com,localhost\n", --- > "contents": "[Unit]\n[Service]\nEnvironment=HTTP_PROXY=http://[1001:db8::1]:3128\nEnvironment=HTTPS_PROXY=http://[1001:db8::1]:3128\nEnvironment=NO_PROXY=.cluster.local,.svc,.test-infra-cluster-assisted-installer.redhat.com,1001:db8::/120,127.0.0.1,2002:db8::/53,2003:db8::/112,api-int.test-infra-cluster-assisted-installer.redhat.com,etcd-0.,etcd-1.,etcd-2.,localhost\n", ================================================================================== One has domain suffix ".test-infra-cluster-assisted-installer.redhat.com" for etcd=%d hosts, and the other one does not. The rendered-master in "oc get mc" has different hash than the one that the rebooted host attempts to pull. The error message can be seen in the machine config daemon log. The reason for the failure is https://github.com/openshift/installer/commit/24e2573b119d10698a71fcf55b9ef439bedb109e. This commit removes the EtcdDiscoveryDomain from the Infrastructure CR produces by openshift-installer. Network operator uses EtcdDiscoveryDomain value from Infrastructure to update the NoProxy field Status of the Proxy CR, and therefore updates it to etcd-0., etcd-1., etcd.2 The NoProxy field of the Status of Proxy is used by the MCO to update the NoProxy of the Status of the ControllerConfig of the MCO. In its turn this field is used by the MCO controller running on masters to render different different system unit. The NoProxy field of the Status of the ControllerConfig produced by the openshift-installer does contain domain, and therefore hashes calculated by bootstrap node and by master nodes are different and bootstrap node cannot pull ignition from master once it reboots Since OCP 4.4 etcd no longer has a dependency on DNS. These records (etcd-0, etcd-1, etcd-2) should be removed from the proxy config. This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1901034. Matthew agree, moving this one over to installer team as there are PR's linked to this BZ I don't want to close it as duplicate. (In reply to Michal Fojtik from comment #6) > Matthew agree, moving this one over to installer team as there are PR's > linked to this BZ I don't want to close it as duplicate. I was actually thinking of closing https://bugzilla.redhat.com/show_bug.cgi?id=1901034 as a duplicate of this bug instead. The PR linked in that bug is obsoleted by the PRs linked for this bug. *** Bug 1901034 has been marked as a duplicate of this bug. *** MCO#2315 is ready for review. As outlined in the PR[1] I feel a possible path here is to tolerate the old etcd records in MCO and add docs to release with details on removal? Meanwhile, we will clean up installer and network-operator so new clusters do not have these records. [1]https://github.com/openshift/machine-config-operator/pull/2315 Hi Sam, What I can help here? > What I can help here? sorry for the noise I meant to ping Matthew on the review of https://github.com/openshift/machine-config-operator/pull/2315 if you have input it is welcome. (In reply to Sam Batschelet from comment #12) > > What I can help here? > > sorry for the noise I meant to ping Matthew on the review of > https://github.com/openshift/machine-config-operator/pull/2315 if you have > input it is welcome. @sbatsche The MCO changes look fine to me, but I am not an owner in that repo. Verify this bug on payload 4.7.0-0.nightly-2021-01-21-012810. 1. For fresh install with global proxy enabled on 4.7.0-0.nightly-2021-01-21-012810, after installation completed successfully, check the noProxy list set in proxy/cluster: # oc get proxy cluster -o yaml status: ... noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-0121.qe.devcluster.openshift.com,localhost,test.no-proxy.com The same noProxy set on bootstrap node, etcd records were removed from the noProxy list. [root@ip-10-0-10-163 core]# env |grep -i proxy NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-0121.qe.devcluster.openshift.com,localhost,test.no-proxy.com MCO is running well. # oc get co |grep machine-config machine-config 4.7.0-0.nightly-2021-01-21-012810 True False False 163m 2. For 4.6.13 cluster with global proxy enabled, upgrade the cluster to 4.7.0-0.nightly-2021-01-21-012810. Before upgrade, check the noProxy list in proxy/cluster: # oc get proxy cluster -o yaml ... status: noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-4613.qe.devcluster.openshift.com,etcd-0.gpei-4613.qe.devcluster.openshift.com,etcd-1.gpei-4613.qe.devcluster.openshift.com,etcd-2.gpei-4613.qe.devcluster.openshift.com,localhost,test.no-proxy.com After cluster upgraded to 4.7.0-0.nightly-2021-01-21-012810 successfully, check the noProxy list again, etcd records were removed. # oc get proxy cluster -o yaml ... status: noProxy: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-4613.qe.devcluster.openshift.com,localhost,test.no-proxy.com No MCO degraded issue. *** Bug 1919386 has been marked as a duplicate of this bug. *** Verified the bug on 4.7.0-0.nightly-ppc64le-2021-01-24-004926 For fresh install with global proxy enabled on 4.7.0-0.nightly-ppc64le-2021-01-24-004926, after installation completed successfully, check the noProxy list set in proxy/cluster: Co status: --- machine-config 4.7.0-0.nightly-ppc64le-2021-01-24-004926 True False False 143m oc get proxy cluster -o yaml --- status: noProxy: .cluster.local,.satwsin1-proxy.redhat.com,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,9.114.96.0/22,api-int.satwsin1-proxy.redhat.com,localhost Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |