Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1769759

Summary:	oc can not work behind https proxy which is using some self-signed certificates
Product:	OpenShift Container Platform	Reporter:	Johnny Liu <jialiu>
Component:	oc	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED NOTABUG	QA Contact:	zhou ying <yinzhou>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.0	CC:	aos-bugs, jokerman, mfojtik, sdodson, sople
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-21 14:33:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Johnny Liu 2019-11-07 11:55:53 UTC

Description of problem:


Version-Release number of selected component (if applicable):
openshift-clients-4.3.0-201910250623.git.1.4c88e02.el8.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Trigger an installation, enable https proxy
additionalTrustBundle: |
  -----BEGIN CERTIFICATE-----
CA CONTENTS
  -----END CERTIFICATE-----
proxy:
  httpProxy: http://AA:BB@10.0.76.148:3128
  httpsProxy: https://AA:BB@10.0.76.148:3130
  noProxy: test.no-proxy.com
2. Trigger an install.
3. Bootstrap failed.

Actual results:
$ openshift-install wait-for bootstrap-complete

level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.jialiu-377.qe.devcluster.openshift.com:6443..."

level=info msg="API v1.16.2 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."

E1107 05:44:55.764255   30866 reflector.go:280] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=5159&timeoutSeconds=341&watch=true: net/http: TLS handshake timeout

E1107 05:45:06.662764   30866 reflector.go:123] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF

Log into bootstrap node, do some debugging:
Bootkube service is completed.
# journalctl -f -u bootkube
<--snip-->
Nov 07 10:45:30 jialiu-377-mzdnm-bootstrap-0 bootkube.sh[1940]: bootkube.service complete

But report-progress.sh report some failure.
# journalctl -f --system
Nov 07 11:38:39 jialiu-377-mzdnm-bootstrap-0 report-progress.sh[1941]: error: unable to recognize "STDIN": Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority
Nov 07 11:38:44 jialiu-377-mzdnm-bootstrap-0 report-progress.sh[1941]: error: unable to recognize "STDIN": Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority

In report-progress.sh, it is trying to run oc command to create boot completed event to notify installer to move on. But unfortunately it failed.

I did some check:
1. ca cert for https proxy already installed on bootstrap node.
2. run `oc get node` behind https proxy, the same failure.
# oc get node --loglevel=8
I1107 11:24:03.404801   14968 loader.go:375] Config loaded from file:  ./kubeconfig
I1107 11:24:03.406214   14968 round_trippers.go:420] GET https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s
I1107 11:24:03.406230   14968 round_trippers.go:427] Request Headers:
I1107 11:24:03.406238   14968 round_trippers.go:431]     Accept: application/json, */*
I1107 11:24:03.406250   14968 round_trippers.go:431]     User-Agent: oc/v0.0.0 (linux/amd64) kubernetes/$Format
I1107 11:24:03.409424   14968 round_trippers.go:446] Response Status:  in 3 milliseconds
I1107 11:24:03.409457   14968 round_trippers.go:449] Response Headers:
I1107 11:24:03.409509   14968 cached_discovery.go:121] skipped caching discovery info due to Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority
I1107 11:24:03.409862   14968 round_trippers.go:420] GET https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s
I1107 11:24:03.409938   14968 round_trippers.go:427] Request Headers:
I1107 11:24:03.409977   14968 round_trippers.go:431]     Accept: application/json, */*
I1107 11:24:03.410030   14968 round_trippers.go:431]     User-Agent: oc/v0.0.0 (linux/amd64) kubernetes/$Format
I1107 11:24:03.412697   14968 round_trippers.go:446] Response Status:  in 2 milliseconds
I1107 11:24:03.412716   14968 round_trippers.go:449] Response Headers:
I1107 11:24:03.412744   14968 cached_discovery.go:121] skipped caching discovery info due to Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority
I1107 11:24:03.412776   14968 shortcut.go:89] Error loading discovery information: Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority

From proxy server log, get the following lines:
1573126987.413      0 10.0.96.66 NONE/000 0 NONE error:transaction-end-before-headers - HIER_NONE/- -
3. Tried 'curl 7.64.0' behind the same https proxy, curl command could load self-signed cert files successfully, and access api url successfully behind https proxy.
4. Tried 'curl 7.29.0' behind the same https proxy, curl can not access api url, from proxy server log, saw the same error - "error:transaction-end-before-headers"

Expected results:
Install behind https proxy succeed.

Additional info:
Install behind http proxy succeed.

This is blocking QE's testing behind https proxy.

Comment 1 Michal Fojtik 2019-11-12 10:47:03 UTC

Installer team: My suspicion is that the proxy CA is not making it to bootkube.sh container? oc needs to see the CA in system trust (/etc/ssl/...)

Comment 2 Scott Dodson 2019-11-12 15:07:47 UTC

This is `openshift-install` and `oc` utilizing proxy settings. This is managed by setting appropriate HTTP_PROXY HTTPS_PROXY and NO_PROXY environment variables when executing `openshift-install` and `oc`.

`oc` specifically has flags to set a certificate authority however `openshift-install` does not so you'd need to add the authority to your host trust store.

In my opinion, this is not a bug, we cannot assume that because a cluster is being installed behind a proxy that the public API is only accessible via a proxy. We have to rely on the admin configuring their client tools appropriately.

Comment 3 Scott Dodson 2019-11-12 15:14:38 UTC

ON_QA to confirm this is possible as described in comment 2, please close if confirmed.

Comment 4 Johnny Liu 2019-11-13 01:56:30 UTC

> Installer team: My suspicion is that the proxy CA is not making it to
> bootkube.sh container? oc needs to see the CA in system trust (/etc/ssl/...)

Pls notice my check list in comment 0, I already confirmed that.
> 1. ca cert for https proxy already installed on bootstrap node.

> `oc` specifically has flags to set a certificate authority however `openshift-install` does not so you'd need to add the authority to your host trust store.
1. The flag of setting a certificate authority is for api certs, right? Here the issue is oc client can not connected to proxy server via ssl certs. 
2. And just like the above comments, I already confirmed, https proxy certs files already are installed to bootstrap trust store. `additionalTrustBundle` setting in install-config.yaml is those ca content for https proxy. And `additionalTrustBundle` is indeed for the purpose of installing user customer ca authority onto host trust store. My understanding is wrong?
3. `oc` command is being executing on boostrap. And bootstrap is totally a self-booting process, how user can interfere with the process, I do not think that is acceptable.

Comment 5 Scott Dodson 2019-11-13 20:26:34 UTC

Sorry, it wasn't clear to me which of the output was from the host where the installer was running versus debugging efforts on the bootstrap host.

There's a few things going on here, the report-progress.sh errors should've been cleaned up in https://bugzilla.redhat.com/show_bug.cgi?id=1762618 so those should no longer exist.

I agree that `oc` on the bootstrap host is expected to work properly both in terms of having the environment variables populated for proxy and having the CA added to the host's trust store. If it doesn't we run the risk of additional either existing or new components relying on `oc` breaking. Can you check on the status of `coreos-update-ca-trust.service` which is responsible for updating the host's trust store? When you say curl was used, did you supply certificate related arguments or was it loading the bootstrap host's trust store?

Comment 6 Johnny Liu 2019-11-14 04:42:04 UTC

(In reply to Scott Dodson from comment #5)
> Sorry, it wasn't clear to me which of the output was from the host where the
> installer was running versus debugging efforts on the bootstrap host.
> 
> There's a few things going on here, the report-progress.sh errors should've
> been cleaned up in https://bugzilla.redhat.com/show_bug.cgi?id=1762618 so
> those should no longer exist.
Yeah, the fix of 1762618 would hide `oc` issue behind https proxy. But would hit another issue of cloudprovider initialization behind https proxy. I will later launch a new env to reproduce. 
> 
> I agree that `oc` on the bootstrap host is expected to work properly both in
> terms of having the environment variables populated for proxy and having the
> CA added to the host's trust store. If it doesn't we run the risk of
> additional either existing or new components relying on `oc` breaking.
+1. 
On bootstrap, we can use no_proxy to skip this issue, If customer run oc command on their own host behind https proxy, would hit such issue again.

> Can you check on the status of `coreos-update-ca-trust.service` which is
> responsible for updating the host's trust store? 
Later I will create a new env to reproduce it, and attach output.

> When you say curl was used, did you supply certificate related arguments 
> or was it loading the bootstrap host's trust store?
To make curl work behind https proxy, it depends on curl version. An old version of curl does NOT work even load the bootstrap host's trust store.
Unfortunately, the curl version on bootstrap is "7.29.0", which does not support https proxy. It is a known issue.
So I have to move my testbed to another rhel host, install a newer version of curl, its version is "7.64.0". And copy https proxy CA file from bootstrap's trust store to my test machine, load the cert files to host's trust store, the new version of curl works. Of course, I also tired the curl command with certificate related arguments supplied. Also works.
So I almost 100% ensure https proxy certs is loaded into host trust store without any problem. The issue probably is whether oc command really load these installed certs to get the connections to https proxy passed.

If only reproduce oc behind https proxy issue, actually no need run a real install. Just create a https proxy, load https proxy server's CA into your host's trust store, then run oc command with https_proxy ENV is exported against a working cluster.

Comment 7 Scott Dodson 2019-11-14 16:04:30 UTC

(In reply to Johnny Liu from comment #6)

> If only reproduce oc behind https proxy issue, actually no need run a real
> install. Just create a https proxy, load https proxy server's CA into your
> host's trust store, then run oc command with https_proxy ENV is exported
> against a working cluster.

Yeah, I did that yesterday and I got errors below, though not the exact same errors you were getting. I'm going to move this back to CLI under the presumption that the trust store and proxy are properly configured. Having access to an environment that matches your reproducer will be key to debugging this. Is there anyway you can preserve this environment so that both the CLI and Installer team can validate things? 

Here's what i get when i attempt to use a proxy that's trusted.

sdodson@t460: ~/clusters/test$ KUBECONFIG=$(pwd)/auth/kubeconfig ./oc get nodes
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-137-196.ec2.internal   Ready    worker   31m   v1.14.6+c07e432da
ip-10-0-142-151.ec2.internal   Ready    master   37m   v1.14.6+c07e432da
ip-10-0-148-123.ec2.internal   Ready    worker   31m   v1.14.6+c07e432da
ip-10-0-155-112.ec2.internal   Ready    master   36m   v1.14.6+c07e432da
ip-10-0-166-58.ec2.internal    Ready    master   37m   v1.14.6+c07e432da
ip-10-0-169-176.ec2.internal   Ready    worker   31m   v1.14.6+c07e432da

sdodson@t460: ~/clusters/test$ HTTPS_PROXY=https://squid.corp.redhat.com:3128 KUBECONFIG=$(pwd)/auth/kubeconfig ./oc get nodes --loglevel=9
I1114 11:01:34.423336   16257 loader.go:375] Config loaded from file:  /home/rdu/sdodson/clusters/test/auth/kubeconfig
I1114 11:01:34.436669   16257 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json" -H "User-Agent: oc/v0.0.0 (linux/amd64) kubernetes/$Format" 'https://api.sdodson.devcluster.openshift.com:6443/api/v1/nodes?limit=500'
I1114 11:01:34.466851   16257 round_trippers.go:443] GET https://api.sdodson.devcluster.openshift.com:6443/api/v1/nodes?limit=500  in 30 milliseconds
I1114 11:01:34.466881   16257 round_trippers.go:449] Response Headers:
I1114 11:01:34.466976   16257 helpers.go:217] Connection error: Get https://api.sdodson.devcluster.openshift.com:6443/api/v1/nodes?limit=500: proxyconnect tcp: tls: first record does not look like a TLS handshake
F1114 11:01:34.467003   16257 helpers.go:114] Unable to connect to the server: proxyconnect tcp: tls: first record does not look like a TLS handshake

Proof that the proxy is trusted, via curl to google.com

sdodson@t460: ~/clusters/test$ HTTPS_PROXY=https://squid.corp.redhat.com:3128 curl https://google.com/ -v
* About to connect() to proxy squid.corp.redhat.com port 3128 (#0)
*   Trying 10.11.5.35...
* Connected to squid.corp.redhat.com (10.11.5.35) port 3128 (#0)
* Establish HTTP proxy tunnel to google.com:443
> CONNECT google.com:443 HTTP/1.1
> Host: google.com:443
> User-Agent: curl/7.29.0
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.0 200 Connection established
< 
* Proxy replied OK to CONNECT request
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* SSL connection using TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=*.google.com,O=Google LLC,L=Mountain View,ST=California,C=US
*       start date: Oct 16 12:36:57 2019 GMT
*       expire date: Jan 08 12:36:57 2020 GMT
*       common name: *.google.com
*       issuer: CN=GTS CA 1O1,O=Google Trust Services,C=US
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: google.com
> Accept: */*

Comment 10 Johnny Liu 2019-11-15 05:47:30 UTC

I opened 1772756 to track cloud provider initialization issue behind proxy. And based on the fix of 1762618 will help my installation get completed, so remove testblocker keyword.

Comment 11 Maciej Szulik 2019-12-10 12:59:18 UTC

*** Bug 1779989 has been marked as a duplicate of this bug. ***

Comment 12 Maciej Szulik 2019-12-12 11:24:25 UTC

Given this is not a testblocker anymore I'm moving this to 4.4, with the possibility of backporting a fix to 4.3.z

Comment 17 Maciej Szulik 2020-02-20 17:25:10 UTC

From my current investigation it looks like client-go creates its won CertPool with just a single CA, 
see https://github.com/kubernetes/kubernetes/blob/8fb66ae9655da110f921d97c7e4d7f27e8a88bb5/staging/src/k8s.io/client-go/transport/transport.go#L79-L81
where I think it should rather read the system-wide certpool and add the necessary bits on top.
In other words, https://github.com/kubernetes/kubernetes/blob/8fb66ae9655da110f921d97c7e4d7f27e8a88bb5/staging/src/k8s.io/client-go/transport/transport.go#L171
should use x509.SystemCertPool, instead.

Need to confirm my theory, still.

Comment 18 Maciej Szulik 2020-02-21 14:33:29 UTC

After talking with other developers it was brought to my attention that client-go is intentionally written so that it 
uses only a single root CA. Thus the same requirement applies to oc and kubectl. In other words, you need to combine
all of the root CA from both proxy and cluster (ideally they are the same) and pass them as a single bundle using
--certificate-authority flag. 

Having said that, I'm closing this as this is not a bug, but intentional behaviour.