Description of problem: Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-21-131655 How reproducible: Always Steps to Reproduce: 1. Prepare a restrict network on GCP, even can not connect google cloud api. 2. inject proxy into install-config.yaml proxy: httpProxy: http://proxy-user1:xxxx@QE_PROXY_PLACEHOLDER:3128 httpsProxy: http://proxy-user1:xxxx@QE_PROXY_PLACEHOLDER:3128 noProxy: test.no-proxy.com 3. Trigger an install Actual results: Installation failed. level=info msg=Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods level=info msg=Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-csi-drivers gcp-pd-csi-driver-controller-59d6cc788d-kk94s NAME READY STATUS RESTARTS AGE gcp-pd-csi-driver-controller-59d6cc788d-kk94s 5/6 CrashLoopBackOff 12 78m [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-csi-drivers logs gcp-pd-csi-driver-controller-59d6cc788d-kk94s -c csi-driver I1224 09:16:22.494469 1 main.go:68] Driver vendor version v4.7.0-202012190243.p0-0-gded384e-dirty I1224 09:16:22.494585 1 gce.go:83] Using GCE provider config <nil> I1224 09:16:22.494741 1 gce.go:134] GOOGLE_APPLICATION_CREDENTIALS env var set /etc/cloud-sa/service_account.json I1224 09:16:22.494752 1 gce.go:138] Using DefaultTokenSource &oauth2.reuseTokenSource{new:jwt.jwtSource{ctx:(*context.cancelCtx)(0xc0003aa8c0), conf:(*jwt.Config)(0xc0001d4aa0)}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)} E1224 09:16:52.496867 1 gce.go:195] error fetching initial token: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 172.217.214.95:443: i/o timeout E1224 09:17:27.499041 1 gce.go:195] error fetching initial token: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 74.125.129.95:443: i/o timeout E1224 09:17:57.501518 1 gce.go:195] error fetching initial token: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 64.233.181.95:443: i/o timeout F1224 09:17:57.501562 1 main.go:84] Failed to get cloud provider: timed out waiting for the condition csi-driver is trying to connect google cloud api, but failed. Because in this restrict network, all outgoing traffic should be go through proxy. But no any proxy ENV is injected into these pods. [root@preserve-jialiu-ansible ~]# oc get proxies.config.openshift.io cluster -o yaml <--snip--> spec: httpProxy: http://proxy-user1:xxx@10.0.0.2:3128 httpsProxy: http://proxy-user1:xxx@10.0.0.2:3128 noProxy: test.no-proxy.com trustedCA: name: "" <--snip--> [root@preserve-jialiu-ansible ~]# oc describe po gcp-pd-csi-driver-operator-7b84b8ff44-dxzk7 -n openshift-cluster-csi-drivers |grep -i proxy [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-csi-drivers rsh -c csi-liveness-probe gcp-pd-csi-driver-controller-59d6cc788d-kk94s sh-4.4# env|grep -i proxy sh-4.4# exit Compare with other components, take machine-api as example: [root@preserve-jialiu-ansible ~]# oc describe po machine-api-controllers-fc7687f9b-xgksl -n openshift-machine-api |grep -i proxy HTTP_PROXY: http://proxy-user1:xxx@10.0.0.2:3128 HTTPS_PROXY: http://proxy-user1:xxx@10.0.0.2:3128 NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jialiu-12538.qe.gcp.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,metadata,metadata.google.internal,metadata.google.internal.,test.no-proxy.com Expected results: installation get completed successfully behind proxy. Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
This needs to be fixed in several storage operators: - library-go itself (where common CSI driver functionality is). - all CSI driver operators (!). - vsphere-problem-detector. - maybe cluster-storage-operator to start Manila operator with the proxy too. - csi-snapshot-controller-operator (snapshot-controller does not talk to anything but API server, but to be on the safe side...) Continuing with library-go in this BZ, will clone the rest to separate bugs.
I pruned TestBlocker keyword in all clones except for https://bugzilla.redhat.com/show_bug.cgi?id=1912946 - that's the GCE CSI driver one that blocks testing.
Status update: I have a PoC for the library-go bits here: https://github.com/openshift/aws-ebs-csi-driver-operator/pull/106 It's implemented in the AWS operator for testing purposes, but I'll move it to library-go tomorrow.
Status update: library-go changes under review: https://github.com/openshift/library-go/pull/976 Once that's merged, CSI driver operators can bump library-go and do a small code change (along with RBAC adjustments to read the proxy resource): https://github.com/openshift/aws-ebs-csi-driver-operator/pull/106/commits/303cf12fd9525a2b12e65bdb2e4f69e0eb838062
Moving to MODIFIED because the PR has been merged.
Verified with: 4.7.0-0.nightly-2021-01-21-235301 and aws ebs csi driver.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633