Description of problem: Version-Release number of selected component (if applicable): openshift-clients-4.3.0-201910250623.git.1.4c88e02.el8.x86_64 How reproducible: Always Steps to Reproduce: 1. Trigger an installation, enable https proxy additionalTrustBundle: | -----BEGIN CERTIFICATE----- CA CONTENTS -----END CERTIFICATE----- proxy: httpProxy: http://AA:BB@10.0.76.148:3128 httpsProxy: https://AA:BB@10.0.76.148:3130 noProxy: test.no-proxy.com 2. Trigger an install. 3. Bootstrap failed. Actual results: $ openshift-install wait-for bootstrap-complete level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.jialiu-377.qe.devcluster.openshift.com:6443..." level=info msg="API v1.16.2 up" level=info msg="Waiting up to 30m0s for bootstrapping to complete..." E1107 05:44:55.764255 30866 reflector.go:280] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=5159&timeoutSeconds=341&watch=true: net/http: TLS handshake timeout E1107 05:45:06.662764 30866 reflector.go:123] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: EOF Log into bootstrap node, do some debugging: Bootkube service is completed. # journalctl -f -u bootkube <--snip--> Nov 07 10:45:30 jialiu-377-mzdnm-bootstrap-0 bootkube.sh[1940]: bootkube.service complete But report-progress.sh report some failure. # journalctl -f --system Nov 07 11:38:39 jialiu-377-mzdnm-bootstrap-0 report-progress.sh[1941]: error: unable to recognize "STDIN": Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority Nov 07 11:38:44 jialiu-377-mzdnm-bootstrap-0 report-progress.sh[1941]: error: unable to recognize "STDIN": Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority In report-progress.sh, it is trying to run oc command to create boot completed event to notify installer to move on. But unfortunately it failed. I did some check: 1. ca cert for https proxy already installed on bootstrap node. 2. run `oc get node` behind https proxy, the same failure. # oc get node --loglevel=8 I1107 11:24:03.404801 14968 loader.go:375] Config loaded from file: ./kubeconfig I1107 11:24:03.406214 14968 round_trippers.go:420] GET https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s I1107 11:24:03.406230 14968 round_trippers.go:427] Request Headers: I1107 11:24:03.406238 14968 round_trippers.go:431] Accept: application/json, */* I1107 11:24:03.406250 14968 round_trippers.go:431] User-Agent: oc/v0.0.0 (linux/amd64) kubernetes/$Format I1107 11:24:03.409424 14968 round_trippers.go:446] Response Status: in 3 milliseconds I1107 11:24:03.409457 14968 round_trippers.go:449] Response Headers: I1107 11:24:03.409509 14968 cached_discovery.go:121] skipped caching discovery info due to Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority I1107 11:24:03.409862 14968 round_trippers.go:420] GET https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s I1107 11:24:03.409938 14968 round_trippers.go:427] Request Headers: I1107 11:24:03.409977 14968 round_trippers.go:431] Accept: application/json, */* I1107 11:24:03.410030 14968 round_trippers.go:431] User-Agent: oc/v0.0.0 (linux/amd64) kubernetes/$Format I1107 11:24:03.412697 14968 round_trippers.go:446] Response Status: in 2 milliseconds I1107 11:24:03.412716 14968 round_trippers.go:449] Response Headers: I1107 11:24:03.412744 14968 cached_discovery.go:121] skipped caching discovery info due to Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority I1107 11:24:03.412776 14968 shortcut.go:89] Error loading discovery information: Get https://api.jialiu-377.qe.devcluster.openshift.com:6443/api?timeout=32s: proxyconnect tcp: x509: certificate signed by unknown authority From proxy server log, get the following lines: 1573126987.413 0 10.0.96.66 NONE/000 0 NONE error:transaction-end-before-headers - HIER_NONE/- - 3. Tried 'curl 7.64.0' behind the same https proxy, curl command could load self-signed cert files successfully, and access api url successfully behind https proxy. 4. Tried 'curl 7.29.0' behind the same https proxy, curl can not access api url, from proxy server log, saw the same error - "error:transaction-end-before-headers" Expected results: Install behind https proxy succeed. Additional info: Install behind http proxy succeed. This is blocking QE's testing behind https proxy.
Installer team: My suspicion is that the proxy CA is not making it to bootkube.sh container? oc needs to see the CA in system trust (/etc/ssl/...)
This is `openshift-install` and `oc` utilizing proxy settings. This is managed by setting appropriate HTTP_PROXY HTTPS_PROXY and NO_PROXY environment variables when executing `openshift-install` and `oc`. `oc` specifically has flags to set a certificate authority however `openshift-install` does not so you'd need to add the authority to your host trust store. In my opinion, this is not a bug, we cannot assume that because a cluster is being installed behind a proxy that the public API is only accessible via a proxy. We have to rely on the admin configuring their client tools appropriately.
ON_QA to confirm this is possible as described in comment 2, please close if confirmed.
> Installer team: My suspicion is that the proxy CA is not making it to > bootkube.sh container? oc needs to see the CA in system trust (/etc/ssl/...) Pls notice my check list in comment 0, I already confirmed that. > 1. ca cert for https proxy already installed on bootstrap node. > `oc` specifically has flags to set a certificate authority however `openshift-install` does not so you'd need to add the authority to your host trust store. 1. The flag of setting a certificate authority is for api certs, right? Here the issue is oc client can not connected to proxy server via ssl certs. 2. And just like the above comments, I already confirmed, https proxy certs files already are installed to bootstrap trust store. `additionalTrustBundle` setting in install-config.yaml is those ca content for https proxy. And `additionalTrustBundle` is indeed for the purpose of installing user customer ca authority onto host trust store. My understanding is wrong? 3. `oc` command is being executing on boostrap. And bootstrap is totally a self-booting process, how user can interfere with the process, I do not think that is acceptable.
Sorry, it wasn't clear to me which of the output was from the host where the installer was running versus debugging efforts on the bootstrap host. There's a few things going on here, the report-progress.sh errors should've been cleaned up in https://bugzilla.redhat.com/show_bug.cgi?id=1762618 so those should no longer exist. I agree that `oc` on the bootstrap host is expected to work properly both in terms of having the environment variables populated for proxy and having the CA added to the host's trust store. If it doesn't we run the risk of additional either existing or new components relying on `oc` breaking. Can you check on the status of `coreos-update-ca-trust.service` which is responsible for updating the host's trust store? When you say curl was used, did you supply certificate related arguments or was it loading the bootstrap host's trust store?
(In reply to Scott Dodson from comment #5) > Sorry, it wasn't clear to me which of the output was from the host where the > installer was running versus debugging efforts on the bootstrap host. > > There's a few things going on here, the report-progress.sh errors should've > been cleaned up in https://bugzilla.redhat.com/show_bug.cgi?id=1762618 so > those should no longer exist. Yeah, the fix of 1762618 would hide `oc` issue behind https proxy. But would hit another issue of cloudprovider initialization behind https proxy. I will later launch a new env to reproduce. > > I agree that `oc` on the bootstrap host is expected to work properly both in > terms of having the environment variables populated for proxy and having the > CA added to the host's trust store. If it doesn't we run the risk of > additional either existing or new components relying on `oc` breaking. +1. On bootstrap, we can use no_proxy to skip this issue, If customer run oc command on their own host behind https proxy, would hit such issue again. > Can you check on the status of `coreos-update-ca-trust.service` which is > responsible for updating the host's trust store? Later I will create a new env to reproduce it, and attach output. > When you say curl was used, did you supply certificate related arguments > or was it loading the bootstrap host's trust store? To make curl work behind https proxy, it depends on curl version. An old version of curl does NOT work even load the bootstrap host's trust store. Unfortunately, the curl version on bootstrap is "7.29.0", which does not support https proxy. It is a known issue. So I have to move my testbed to another rhel host, install a newer version of curl, its version is "7.64.0". And copy https proxy CA file from bootstrap's trust store to my test machine, load the cert files to host's trust store, the new version of curl works. Of course, I also tired the curl command with certificate related arguments supplied. Also works. So I almost 100% ensure https proxy certs is loaded into host trust store without any problem. The issue probably is whether oc command really load these installed certs to get the connections to https proxy passed. If only reproduce oc behind https proxy issue, actually no need run a real install. Just create a https proxy, load https proxy server's CA into your host's trust store, then run oc command with https_proxy ENV is exported against a working cluster.
(In reply to Johnny Liu from comment #6) > If only reproduce oc behind https proxy issue, actually no need run a real > install. Just create a https proxy, load https proxy server's CA into your > host's trust store, then run oc command with https_proxy ENV is exported > against a working cluster. Yeah, I did that yesterday and I got errors below, though not the exact same errors you were getting. I'm going to move this back to CLI under the presumption that the trust store and proxy are properly configured. Having access to an environment that matches your reproducer will be key to debugging this. Is there anyway you can preserve this environment so that both the CLI and Installer team can validate things? Here's what i get when i attempt to use a proxy that's trusted. sdodson@t460: ~/clusters/test$ KUBECONFIG=$(pwd)/auth/kubeconfig ./oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-137-196.ec2.internal Ready worker 31m v1.14.6+c07e432da ip-10-0-142-151.ec2.internal Ready master 37m v1.14.6+c07e432da ip-10-0-148-123.ec2.internal Ready worker 31m v1.14.6+c07e432da ip-10-0-155-112.ec2.internal Ready master 36m v1.14.6+c07e432da ip-10-0-166-58.ec2.internal Ready master 37m v1.14.6+c07e432da ip-10-0-169-176.ec2.internal Ready worker 31m v1.14.6+c07e432da sdodson@t460: ~/clusters/test$ HTTPS_PROXY=https://squid.corp.redhat.com:3128 KUBECONFIG=$(pwd)/auth/kubeconfig ./oc get nodes --loglevel=9 I1114 11:01:34.423336 16257 loader.go:375] Config loaded from file: /home/rdu/sdodson/clusters/test/auth/kubeconfig I1114 11:01:34.436669 16257 round_trippers.go:423] curl -k -v -XGET -H "Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json" -H "User-Agent: oc/v0.0.0 (linux/amd64) kubernetes/$Format" 'https://api.sdodson.devcluster.openshift.com:6443/api/v1/nodes?limit=500' I1114 11:01:34.466851 16257 round_trippers.go:443] GET https://api.sdodson.devcluster.openshift.com:6443/api/v1/nodes?limit=500 in 30 milliseconds I1114 11:01:34.466881 16257 round_trippers.go:449] Response Headers: I1114 11:01:34.466976 16257 helpers.go:217] Connection error: Get https://api.sdodson.devcluster.openshift.com:6443/api/v1/nodes?limit=500: proxyconnect tcp: tls: first record does not look like a TLS handshake F1114 11:01:34.467003 16257 helpers.go:114] Unable to connect to the server: proxyconnect tcp: tls: first record does not look like a TLS handshake Proof that the proxy is trusted, via curl to google.com sdodson@t460: ~/clusters/test$ HTTPS_PROXY=https://squid.corp.redhat.com:3128 curl https://google.com/ -v * About to connect() to proxy squid.corp.redhat.com port 3128 (#0) * Trying 10.11.5.35... * Connected to squid.corp.redhat.com (10.11.5.35) port 3128 (#0) * Establish HTTP proxy tunnel to google.com:443 > CONNECT google.com:443 HTTP/1.1 > Host: google.com:443 > User-Agent: curl/7.29.0 > Proxy-Connection: Keep-Alive > < HTTP/1.0 200 Connection established < * Proxy replied OK to CONNECT request * Initializing NSS with certpath: sql:/etc/pki/nssdb * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * SSL connection using TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 * Server certificate: * subject: CN=*.google.com,O=Google LLC,L=Mountain View,ST=California,C=US * start date: Oct 16 12:36:57 2019 GMT * expire date: Jan 08 12:36:57 2020 GMT * common name: *.google.com * issuer: CN=GTS CA 1O1,O=Google Trust Services,C=US > GET / HTTP/1.1 > User-Agent: curl/7.29.0 > Host: google.com > Accept: */*
I opened 1772756 to track cloud provider initialization issue behind proxy. And based on the fix of 1762618 will help my installation get completed, so remove testblocker keyword.
*** Bug 1779989 has been marked as a duplicate of this bug. ***
Given this is not a testblocker anymore I'm moving this to 4.4, with the possibility of backporting a fix to 4.3.z
From my current investigation it looks like client-go creates its won CertPool with just a single CA, see https://github.com/kubernetes/kubernetes/blob/8fb66ae9655da110f921d97c7e4d7f27e8a88bb5/staging/src/k8s.io/client-go/transport/transport.go#L79-L81 where I think it should rather read the system-wide certpool and add the necessary bits on top. In other words, https://github.com/kubernetes/kubernetes/blob/8fb66ae9655da110f921d97c7e4d7f27e8a88bb5/staging/src/k8s.io/client-go/transport/transport.go#L171 should use x509.SystemCertPool, instead. Need to confirm my theory, still.
After talking with other developers it was brought to my attention that client-go is intentionally written so that it uses only a single root CA. Thus the same requirement applies to oc and kubectl. In other words, you need to combine all of the root CA from both proxy and cluster (ideally they are the same) and pass them as a single bundle using --certificate-authority flag. Having said that, I'm closing this as this is not a bug, but intentional behaviour.