Bug 1910581 - library-go: proxy ENV is not injected into csi-driver-controller which lead to storage operator never get ready
Summary: library-go: proxy ENV is not injected into csi-driver-controller which lead t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Fabio Bertinatto
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks: 1912942 1912944 1912945 1912946 1912947 1912948 1912949 1912950
TreeView+ depends on / blocked
 
Reported: 2020-12-24 09:48 UTC by Johnny Liu
Modified: 2021-02-24 15:49 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1912942 1912944 1912945 1912946 1912947 1912948 1912949 1912950 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:48:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-storage-operator pull 131 0 None closed Bug 1910581: CSO shouldn't overwrite clustercsidriver objects 2021-02-03 06:31:39 UTC
Github openshift library-go pull 976 0 None closed csi: add configobserver controller and hook helpers 2021-02-02 09:19:24 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:49:18 UTC

Description Johnny Liu 2020-12-24 09:48:55 UTC
Description of problem:

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-21-131655

How reproducible:
Always

Steps to Reproduce:
1. Prepare a restrict network on GCP, even can not connect google cloud api.
2. inject proxy into install-config.yaml
proxy:
  httpProxy: http://proxy-user1:xxxx@QE_PROXY_PLACEHOLDER:3128
  httpsProxy: http://proxy-user1:xxxx@QE_PROXY_PLACEHOLDER:3128
  noProxy: test.no-proxy.com
3. Trigger an install

Actual results:
Installation failed. 
level=info msg=Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods
level=info msg=Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service

[root@preserve-jialiu-ansible ~]#  oc get po -n openshift-cluster-csi-drivers gcp-pd-csi-driver-controller-59d6cc788d-kk94s
NAME                                            READY   STATUS             RESTARTS   AGE
gcp-pd-csi-driver-controller-59d6cc788d-kk94s   5/6     CrashLoopBackOff   12         78m

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-csi-drivers logs gcp-pd-csi-driver-controller-59d6cc788d-kk94s -c csi-driver
I1224 09:16:22.494469       1 main.go:68] Driver vendor version v4.7.0-202012190243.p0-0-gded384e-dirty
I1224 09:16:22.494585       1 gce.go:83] Using GCE provider config <nil>
I1224 09:16:22.494741       1 gce.go:134] GOOGLE_APPLICATION_CREDENTIALS env var set /etc/cloud-sa/service_account.json
I1224 09:16:22.494752       1 gce.go:138] Using DefaultTokenSource &oauth2.reuseTokenSource{new:jwt.jwtSource{ctx:(*context.cancelCtx)(0xc0003aa8c0), conf:(*jwt.Config)(0xc0001d4aa0)}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
E1224 09:16:52.496867       1 gce.go:195] error fetching initial token: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 172.217.214.95:443: i/o timeout
E1224 09:17:27.499041       1 gce.go:195] error fetching initial token: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 74.125.129.95:443: i/o timeout
E1224 09:17:57.501518       1 gce.go:195] error fetching initial token: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 64.233.181.95:443: i/o timeout
F1224 09:17:57.501562       1 main.go:84] Failed to get cloud provider: timed out waiting for the condition

csi-driver is trying to connect google cloud api, but failed. Because in this restrict network, all outgoing traffic should be go through proxy. But no any proxy ENV is injected into these pods.

[root@preserve-jialiu-ansible ~]# oc get proxies.config.openshift.io cluster -o yaml
<--snip-->
spec:
  httpProxy: http://proxy-user1:xxx@10.0.0.2:3128
  httpsProxy: http://proxy-user1:xxx@10.0.0.2:3128
  noProxy: test.no-proxy.com
  trustedCA:
    name: ""
<--snip-->

[root@preserve-jialiu-ansible ~]# oc describe po gcp-pd-csi-driver-operator-7b84b8ff44-dxzk7 -n openshift-cluster-csi-drivers |grep -i proxy

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-csi-drivers rsh -c csi-liveness-probe gcp-pd-csi-driver-controller-59d6cc788d-kk94s 
sh-4.4# env|grep -i proxy
sh-4.4# exit


Compare with other components, take machine-api as example:
[root@preserve-jialiu-ansible ~]# oc describe po machine-api-controllers-fc7687f9b-xgksl -n openshift-machine-api |grep -i proxy
      HTTP_PROXY:   http://proxy-user1:xxx@10.0.0.2:3128
      HTTPS_PROXY:  http://proxy-user1:xxx@10.0.0.2:3128
      NO_PROXY:     .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jialiu-12538.qe.gcp.devcluster.openshift.com,etcd-0.,etcd-1.,etcd-2.,localhost,metadata,metadata.google.internal,metadata.google.internal.,test.no-proxy.com


Expected results:
installation get completed successfully behind proxy.

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 2 Jan Safranek 2021-01-05 15:50:43 UTC
This needs to be fixed in several storage operators:

- library-go itself (where common CSI driver functionality is).
- all CSI driver operators (!).
- vsphere-problem-detector.
- maybe cluster-storage-operator to start Manila operator with the proxy too.
- csi-snapshot-controller-operator (snapshot-controller does not talk to anything but API server, but to be on the safe side...)

Continuing with library-go in this BZ, will clone the rest to separate bugs.

Comment 3 Jan Safranek 2021-01-06 16:07:30 UTC
I pruned TestBlocker keyword in all clones except for https://bugzilla.redhat.com/show_bug.cgi?id=1912946 - that's the GCE CSI driver one that blocks testing.

Comment 4 Fabio Bertinatto 2021-01-06 22:26:29 UTC
Status update: I have a PoC for the library-go bits here: https://github.com/openshift/aws-ebs-csi-driver-operator/pull/106

It's implemented in the AWS operator for testing purposes, but I'll move it to library-go tomorrow.

Comment 5 Fabio Bertinatto 2021-01-11 15:19:36 UTC
Status update: library-go changes under review: https://github.com/openshift/library-go/pull/976

Once that's merged, CSI driver operators can bump library-go and do a small code change (along with RBAC adjustments to read the proxy resource): https://github.com/openshift/aws-ebs-csi-driver-operator/pull/106/commits/303cf12fd9525a2b12e65bdb2e4f69e0eb838062

Comment 6 Fabio Bertinatto 2021-01-13 20:28:44 UTC
Moving to MODIFIED because the PR has been merged.

Comment 11 Qin Ping 2021-01-22 06:24:08 UTC
Verified with: 4.7.0-0.nightly-2021-01-21-235301 and aws ebs csi driver.

Comment 14 errata-xmlrpc 2021-02-24 15:48:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.