When installing a cluster behind proxy, installation get failed. level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-11-23-195308 level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-11-23-195308: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-52-10.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\", Node ip-10-0-77-145.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\", Node ip-10-0-48-13.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-c12bd4e2165c598321688af24bf672b9\\\" not found\"", retrying level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-11-23-195308 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating After the failure, dig into it, we found the root cause is mismatch of NO_PROXY setting between bootstrap and cluster global setting < Environment=NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-471.qe.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,test.no-proxy.com --- > Environment=NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-471.qe.devcluster.openshift.com,etcd-0.gpei-471.qe.devcluster.openshift.com,etcd-1.gpei-471.qe.devcluster.openshift.com,etcd-2.gpei-471.qe.devcluster.openshift.com,localhost,test.no-proxy.com $ oc get proxies.config.openshift.io cluster -o yaml <--snip--> spec: httpProxy: http://user:password@10.0.99.4:3128 httpsProxy: http://user:password@10.0.99.4:3128 noProxy: test.no-proxy.com trustedCA: name: "" status: httpProxy: http://user:password@10.0.99.4:3128 httpsProxy: http://user:password@10.0.99.4:3128 noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.miyadav24azur.qe.azure.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,test.no-proxy.com Seem like it missing domain for etcd fqdn, while boostrap has it. Version: 4.7.0-0.nightly-2020-11-23-195308 Platform: gcp/azure/vsphere/aws both IPI and UPI What happened? After What did you expect to happen? When running installation behind proxy, installation get failed. How to reproduce it (as minimally and precisely as possible)? 1. inject proxy setting into install-config.yaml, such as: --- apiVersion: v1 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 metadata: name: gpei-472 platform: aws: region: us-east-2 subnets: - subnet-0c56d13268c3b8d24 - subnet-02c73c48c59eca6b0 - subnet-0908bd983fe473287 - subnet-080c64ca63b15fb2d pullSecret: HIDDEN networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 serviceNetwork: - 172.30.0.0/16 machineNetwork: - cidr: 10.0.0.0/16 networkType: OpenShiftSDN publish: External proxy: httpProxy: http://user:password@proxy.example.com:3128 httpsProxy: http://user:password@proxy.example.com:3128 noProxy: test.no-proxy.com baseDomain: qe.devcluster.openshift.com 2. Run the installation Anything else we need to know? 1. Did not hit such issues on 4.7.0-0.nightly-2020-11-18-085225 2. We see this issue for the 1st time on 4.7.0-0.nightly-2020-11-20-234717
$ oc get infrastructures.config.openshift.io cluster -o yaml <--snip--> spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: Azure status: apiServerInternalURI: https://api-int.miyadav24azur.qe.azure.devcluster.openshift.com:6443 apiServerURL: https://api.miyadav24azur.qe.azure.devcluster.openshift.com:6443 etcdDiscoveryDomain: "" infrastructureName: miyadav24azur-t8cqb platform: Azure platformStatus: azure: cloudName: AzurePublicCloud networkResourceGroupName: miyadav24azur-rg resourceGroupName: miyadav24azur-t8cqb-rg type: Azure The etcdDiscoveryDomain is set to empty, I think that lead to etcd fqdn missing domain.
This is causing some jobs to fail very frequently now and bringing down some of the release indicator percentages we track. Here is an example job: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy I'm assuming it's the same problem because I see this in a failed job [0]: cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/00-master.yaml: Environment=NO_PROXY=.cluster.local,.ec2.internal,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-q4qfig05-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-0.,etcd-1.,etcd-2.,localhost which seems to be missing some values like we see in the last job [1] that did not fail (it ran a week ago): cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/00-master.yaml: Environment=NO_PROXY=.cluster.local,.svc,.us-west-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-0.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-1.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,etcd-2.ci-op-llgw500b-2659c.origin-ci-int-aws.dev.rhcloud.com,localhost [0] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331391826971594752 [1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1328922989076418560
Can we up the priority on this I've found 2 bugs related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1899979 https://bugzilla.redhat.com/show_bug.cgi?id=1904231 Also the entire aws-proxy job is red: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy?buildId= I'm going to dupe the bzs into this so we can have this bug as the tracking bug.
*** Bug 1899979 has been marked as a duplicate of this bug. ***
*** Bug 1904231 has been marked as a duplicate of this bug. ***
*** Bug 1901577 has been marked as a duplicate of this bug. ***
(In reply to Kirsten Garrison from comment #4) > Can we up the priority on this I've found 2 bugs related to this: The installer team will be looking at this early next sprint.
This regression was introduced in https://github.com/openshift/installer/pull/4067 with removal of the code that sets the status.etcdDiscoveryDomain in infrastructure.config.openshift.io. The cluster-network-operator is relying on that field to fill out the status.noProxy field in proxy.config.openshift.io [1]. [1] https://github.com/openshift/cluster-network-operator/blob/c23495cf6e6ffeffc0290c85ee4608102f7b47d1/pkg/util/proxyconfig/no_proxy.go#L113
*** Bug 1906620 has been marked as a duplicate of this bug. ***
*** Bug 1906321 has been marked as a duplicate of this bug. ***
Is there any update or proposed fix for this one yet? Since it is blocking and I don't see any update since mid dec, checking on status here.
(In reply to lmcfadde from comment #12) > Is there any update or proposed fix for this one yet? Since it is blocking > and I don't see any update since mid dec, checking on status here. This bug will be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1909502. *** This bug has been marked as a duplicate of bug 1909502 ***
This issue got fixed on nightly payload 4.7.0-0.nightly-2021-01-21-012810, close it now. Refer to https://bugzilla.redhat.com/show_bug.cgi?id=1909502#c21 for detailed verification steps.
*** Bug 1916904 has been marked as a duplicate of this bug. ***