Description of problem:
Cannot login to an OCP 4.x OCP cluster (reproduced on 4.2.x & 4.3.x) with --insecure-skip-tls-verify=true from MacOS with keychain containing more than 250 certificates with trust policies
Version-Release number of selected component (if applicable):
oc Client Version 4.5.2 or above
Steps to Reproduce:
1.Load around 250 certificates with Trust Policy in your MacOS keychain
2.oc login -u kubeadmin -p <mypwd> api.myocp.mydomain.local:6443 --insecure-skip-tls-verify=true
Login fails with error message: "error: net/http: TLS handshake timeout"
Login should succeed. As the user has explicitly asked for server certificates not to be checked by using the option: '--insecure-skip-tls-verify=true', the error above which is coming from an attempt to verify server certificate should not occur.
- See the following url for an analysis of this bug (part1 of the issue) https://github.com/openshift/oc/issues/496
- See the following url for a proposed PR to fix this bug: https://github.com/openshift/oc/pull/514
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
I saw it was parked for 4.7 so I stopped pushing for the fix. The alternative I'm using is to compile my own version of oc to benefit from the fix...
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.
& happy new year, all the best for 2021 !
I see that you're re-opening this, are u willing/planning to get it integrated in 4.7 or am I getting it wrong ?
I had closed the related issue & PR as I had understood from previous comment in the PR.
Hello, Happy New year to you too. I am happy to have it fixed in 4.7 and backported to 4.6 as well. However i would like to know if there is any workaround available as of now.
Don't know if you looked at the analysis of the problem and in particular the inputs provided in the PR: https://github.com/openshift/oc/pull/514 on Sept 9, 2020. The reason of the problem is that even if the user selects the option "--insecure-skip-tls-verify=true" oc makes a first https request to the cluster and tries to validate the certificate (this validation should not happen, as per the request of the user adding the option insecure-skip-tls-verify=true and this is actually the reason for this bug report) which can be fixed with exactly one line of code.
The root cause of the bug is that on mac, certificates verification implemented in the go libraries can be very long and actually as written in the PR updates from Sep 9 (https://github.com/openshift/oc/pull/514#issuecomment-689348366): "...the duration of the checking is mostly coming from certificates with trust policies..."
- So one workaround can be to delete some (unused) certificates with trust policies in the keychain of the end user's mac so that the checking does not take so long, so the timeout does not occur and the connection can be made.
Two other workarounds are also mentioned in the initial issue that lead to the PR mentioned above and to this bug report, see here: https://github.com/openshift/oc/issues/496
- do a 'sudo oc ...' from the mac, in this case the certificates verification only applies to a subset of the certificates in the keychain (the ones owned by root) and the timeout does not occur(post from July 22nd https://github.com/openshift/oc/issues/496#issuecomment-662164068)
- or add a cacert.pem file as parameter when invoking oc (even a wrong one) to pass the "hurdle" of the first HEAD https request that fails and then continue by selecting 'Use insecure connections? (y/n): y' (See here https://github.com/openshift/oc/issues/496#issuecomment-662177056)
Hope this helps.
Lowering the priority to reflect the customer case.
Thanks Frank. We tried the following but still it did not work. We will try the others and update here.
oc login -u kubeadmin -p ##### --server=https://api.ocp1.example.com:6443 --certificate-authority=./ca.pem --loglevel=9
I0112 14:09:19.747241 11299 loader.go:375] Config loaded from file: /Users/a600000/.kube/config
I0112 14:09:19.747911 11299 round_trippers.go:423] curl -k -v -XHEAD 'https://api.ocp1.example.com:6443/'
I0112 14:09:29.912408 11299 round_trippers.go:443] HEAD https://api.ocp1.example.com:6443/ in 10164 milliseconds
I0112 14:09:29.912478 11299 round_trippers.go:449] Response Headers:
F0112 14:09:29.912563 11299 helpers.go:115] error: net/http: TLS handshake timeout'
do you have the problem at a customer site, or are you trying to reproduce the problem on your test system ?
Have you tried:
GODEBUG=x509roots=1 oc login -u kubeadmin -p <myPassword> api.<myCluster>.<myDomain>:6443
to make sure this is the same problem ?
(In reply to Franck Boudinet from comment #15)
> do you have the problem at a customer site, or are you trying to reproduce
> the problem on your test system ?
> Have you tried:
> GODEBUG=x509roots=1 oc login -u kubeadmin -p <myPassword>
> to make sure this is the same problem ?
Frank the problem is with customer site. I do not have a mac to test this.
Reading through https://github.com/golang/go/issues/18203#issuecomment-273005127 I found
a suggestion to try CGO, we have another issue with mac + vpn (https://bugzilla.redhat.com/show_bug.cgi?id=1838512)
for which we have an open fix https://github.com/openshift/oc/pull/644.
Unfortunately, we can't merge that yet, see the PR comments.
Can either of you verify if compiling oc with CGO_ENABLED=1 helps with this issue?
A similar issue popped up in https://bugzilla.redhat.com/show_bug.cgi?id=1917756
*** Bug 1917756 has been marked as a duplicate of this bug. ***
(In reply to Maciej Szulik from comment #18)
> Reading through
> https://github.com/golang/go/issues/18203#issuecomment-273005127 I found
> a suggestion to try CGO, we have another issue with mac + vpn
> for which we have an open fix https://github.com/openshift/oc/pull/644.
> Unfortunately, we can't merge that yet, see the PR comments.
> Can either of you verify if compiling oc with CGO_ENABLED=1 helps with this
I don't see any difference. It still fails with the same error.
oc login --token=##### --server=https://api.os4.xxx.yyy.com:6443 --loglevel=10
I0120 09:32:09.274984 84808 loader.go:375] Config loaded from file: /Users/a6002246/.kube/config
I0120 09:32:09.275441 84808 round_trippers.go:423] curl -k -v -XHEAD 'https://api.os4.xxx.yyy.com:6443/'
I0120 09:32:20.029254 84808 round_trippers.go:443] HEAD https://api.os4.xxx.yyy.com:6443/ in 10753 milliseconds
I0120 09:32:20.029311 84808 round_trippers.go:449] Response Headers:
F0120 09:32:20.029379 84808 helpers.go:115] error: net/http: TLS handshake timeout
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc000639d40, 0x55, 0xb7)
k8s.io/klog/v2.(*loggingT).output(0x53defc0, 0xc000000003, 0x0, 0x0, 0xc0015b44d0, 0x5183380, 0xa, 0x73, 0x100b300)
k8s.io/klog/v2.(*loggingT).printDepth(0x53defc0, 0x3, 0x0, 0x0, 0x2, 0xc00123f5e0, 0x1, 0x1)
k8s.io/kubectl/pkg/cmd/util.fatal(0xc0005b80f0, 0x26, 0x1)
k8s.io/kubectl/pkg/cmd/util.checkErr(0x3e29860, 0x54137c0, 0x3b84f40)
github.com/openshift/oc/pkg/cli/login.NewCmdLogin.func1(0xc00110edc0, 0xc0015819b0, 0x0, 0x3)
github.com/spf13/cobra.(*Command).execute(0xc00110edc0, 0xc001581980, 0x3, 0x3, 0xc00110edc0, 0xc001581980)
github.com/spf13/cobra.(*Command).ExecuteC(0xc00110eb00, 0x2, 0xc00110eb00, 0x2)
goroutine 6 [chan receive]:
created by k8s.io/klog/v2.init.0
goroutine 49 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x3b84e68, 0x3e24d60, 0xc0005a2000, 0x1, 0xc000060420)
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x3b84e68, 0x12a05f200, 0x0, 0xc000d05101, 0xc000060420)
created by k8s.io/component-base/logs.InitLogs
goroutine 50 [select]:
io.(*pipe).Read(0xc001391920, 0xc0005ae000, 0x1000, 0x1000, 0x349ec20, 0x1, 0xc0005ae000)
io.(*PipeReader).Read(0xc0002fa250, 0xc0005ae000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
github.com/openshift/oc/pkg/cli/admin/mustgather.newPrefixWriter.func1(0xc00147e400, 0x3e26720, 0xc00000e018, 0x39a8830, 0x17)
created by github.com/openshift/oc/pkg/cli/admin/mustgather.newPrefixWriter
did you try with export GODEBUG=x509roots=1 to really make sure that the problem you're facing is similar to the one I described in my issue ? You should see a list of certificate being one by one checkded and at some point the timeout occuring as they are too many.
Also may I ask which version of oc you're using ? I had put a lot of traces at the time when I investigated, I can have a look if I still have the VM where I did the cross compile to generate the oc mac version with these traces & provided it to you if that helps.
Yes tried this already and it is failing too.
GODEBUG=x509roots=1 oc login --token=### --server=https://api.os4.xxx.yyy.com:6443 --loglevel=9
I0115 09:45:13.504639 84481 loader.go:375] Config loaded from file: /Users/000000/.kube/config
I0115 09:45:13.506198 84481 round_trippers.go:423] curl -k -v -XHEAD 'https://api.os4.xxx.yyy.com:6443/'
I0115 09:45:24.032962 84481 round_trippers.go:443] HEAD https://api.os4.test.xxx.yyy.com:6443/ in 10526 milliseconds
I0115 09:45:24.032993 84481 round_trippers.go:449] Response Headers:
F0115 09:45:24.033054 84481 helpers.go:115] error: net/http: TLS handshake timeout
We are using oc 4.6.9
based on what I read above, my conclusion is that :
- either you did the test with the export of GODEBUG=x509roots=1 on linux, & I confirm no additional traces are displayed on linux with this option & I think there is no point to do that because the problem I reported was on mac only
- or either the problem you are having is different from the one I reported
I just double checked with the oc client 4.6.9 on both linux & mac and:
- on linux the option GODEBUG=x509roots=1 does not display any additional traces as mentionned above
- but on mac it does, see below an example of the trace your customer should get:
GODEBUG=x509roots=1 oc login -u <username> -p <password> https://<myclusterapiurl>:6443 --loglevel=9
I0122 18:23:13.269377 98842 loader.go:375] Config loaded from file: /Users/fb/.kube/config
I0122 18:23:13.269766 98842 round_trippers.go:423] curl -k -v -XHEAD 'https://<myclusterapiurl>:6443/'
crypto/x509: trust settings for CN=localhost,OU=defaultServer,O=ibm,C=us: 4
crypto/x509: trust settings for CN=localhost:2998,OU=garage,O=hal,L=nice,ST=paca,C=fr,1.2.840.113518.104.22.168=#0c0f66636b626f407961686f6f2e636f6d: 1
crypto/x509: trust settings for CN=sma-unleash.ratl.swg.usma.ibm.com,O=IBM: 4
btw I can also see that compared to the time where I raised the issue using 4.5.x, the oc client behavior has changed from that prospective, most probably some updates in the go librairies being used compared to oc 4.6.9 because here is the trace using oc client 4.5.2 & trying to connect to the same OCP server from exactly the same mac & user:
oc.4.5.2 login -u <username> -p <password> https://<myclusterapiurl>:6443 --loglevel=9
I0122 18:32:31.529731 99156 loader.go:375] Config loaded from file: /Users/fb/.kube/config
I0122 18:32:31.531035 99156 round_trippers.go:423] curl -k -v -XHEAD 'https://<myclusterapiurl>:6443/'
crypto/x509: 189 certs have a trust policy
crypto/x509: verify-cert approved CN=IBM JSS Built-in Certificate Authority
crypto/x509: verify-cert approved CN=IBM Internal Root CA,O=International Business Machines Corporation,C=US
crypto/x509: verify-cert approved CN=ucopia.zonezei,OU=DGSI,O=DA,L=St Cloud,ST=St Cloud,C=FR
crypto/x509: verify-cert approved CN=icp-router,O=IBM Cloud Private,L=Xi'an,ST=Shaanxi,C=CN
crypto/x509: verify-cert approved CN=http://openvpn.net/localca.html #1567088347
crypto/x509: verify-cert rejected CN=22.214.171.124,OU=DeviceSSL (WebAuth),O=Cisco Systems Inc.,C=US: "Cert Verify Result: CSSMERR_TP_NOT_TRUSTED"
crypto/x509: verify-cert approved CN=www.ibm.com,O=IBM Cloud Private,L=Armonk,ST=New York,C=US
crypto/x509: verify-cert approved CN=IBM CA,O=International Business Machines Corporation,C=US
You can see that the output is different.
So please confirm that you're getting the 'crypto/x509:' kind of traces when running the oc login from a mac with the GODEBUG=x509roots=1 exports. If yes then hopefully it will help to confirm that the issue is due to the TLS certificates checking taking too long because of too many certificates and if not then it means the problem you're having is different from the one I raised
Have a good WE.
I spoke with other folks, we'll try to fix the bug in oc in 4.7
Due to other commitments we weren't able to get that resolved for 4.7, I'm pushing this over to 4.8
See the PR from the initial comment, I spoke with Standa and we've decided to go with that fix.
I have opened a new PR to make this change at github.com/openshift/oc/pull/746
Confirmed with latest oc version :
Zhous-MacBook-Pro:~ zhouying$ oc version --client -o yaml
Zhous-MacBook-Pro:~ zhouying$ time oc login https://api.yinzhou8aws.qe.devcluster.openshift.com:6443 -u pm305 --insecure-skip-tls-verify
WARNING: Using insecure TLS client config. Setting this option is not supported!
Logged into "https://api.yinzhou8aws.qe.devcluster.openshift.com:6443" as "pm305" using existing credentials.
You don't have any projects. You can try to create a new project, by running
oc new-project <projectname>
I still have not had a chance to check with the upstream auth folks about the proper way (if any) that we can allow both of these flags to be set.
See https://bugzilla.redhat.com/show_bug.cgi?id=1935102#c2 for more context on why this was reverted. It appears the limitation is not just oc (as previously thought in this bugzilla) but an internal k8s client restriction that will require more work.