Description of problem: In https://github.com/openshiftio/openshift.io/issues/2755 users reported seeing ERR_CERT_COMMON_NAME_INVALID occasionally. These requests are for deployments sitting on dsaas.openshift.com OSD. Upon investigation I hit one of our endpoints (api.openshift.io) 1264 times with a 1s delay. 1257 times the openshift.io wildcard certificate was returned, the remaining 7 served the openshiftapp wildcard. Version-Release number of selected component (if applicable): OpenShift Master: v3.7.23 Kubernetes Master: v1.7.6+a08f5eeb62 How reproducible: while true; do echo "Q" | openssl s_client -servername api.openshift.io -connect api.openshift.io:443 | grep subject >> certlog ; sleep 1; done Steps to Reproduce: 1. Run the above 2. grep through certlog for returned certificate. Actual results: openshiftapps certificate returned 7/1264 times Expected results: openshift.io certificate returned 1264/1264 times Additional info:
Did a new run today. Out of 3042 requests on a 3s sleep we got 12 bad replies.
Can this be reproduced your side too? Any ideas on cause? Router reload etc?
We suspect that browser side caching of certificates is causing the impact to be wider than first thought.
How is the route for api.openshift.io set up? What cert should it be serving? I ran it for 2,997 tests and only saw *.openshift.io. Are you still running OpenShift 3.7?
----------- apiVersion: v1 kind: Route metadata: labels: service: core name: core-api-route spec: host: api.openshift.io port: targetPort: 8080 tls: caCertificate: | REDACTED certificate: | REDACTED key: | REDACTED insecureEdgeTerminationPolicy: Redirect termination: edge to: kind: Service name: core weight: 100 wildcardPolicy: None ----------- Nothing unusual. This is the dsaas Openshift dedicated cluster. It is running v3.7.23 / v1.7.6+a08f5eeb62 Could there be a very time window when the routers deploy/reload when haproxy mappings are not available which would cause a request to go to the default route/backend? The fact we sometimes see the openshiftapps.com certificate is suspicious.
The cluster has 3 infra nodes but only 2 router replicas. Could that be a problem?
No, as long as the loadbalancer is only hitting the two routers it is fine. It _could_ be the router reload, but that usually results in connections that fail to connect, not cert errors. (BTW that is fixed in 3.9) But it is more likely that the pod is failing health checks which may result in haproxy removing the endpoint, and then you fall through and return a 404 with the default certificate. We could turn off health checks on that route to see if that makes a difference... Try setting the annotation router.openshift.io/haproxy.health.check.interval to 7d (i.e. 7 days) so that it is effectively disabled. https://docs.openshift.com/container-platform/3.7/architecture/networking/routes.html has the relevant docs.
I tried to reproduce this today using the command in the description and I was unable to. I ran it both from my local workstation and another machine in us-east-1. Between the two machines I ran the check 16203 times and saw the *.openshift.io CN every time. To rule out a transient issue with the router pods, I rescaled them at UTC 20:11:29 today.
I ran it (openssl) overnight (20:30-09:30 CEST) and hit it 66 / 13188 times. I've had a hard time reproducing with curl, for whatever reason. I wanted to see what status code was actually served when doing a full HTTP request (and not just a SSL handshake). Had a go around disabling client connection reuse etc to see if that helped, but to no avail. This let me to think that possibly the server is reusing the connection. So just to make sure with regards to your reproduction efforts, are you doing full http requests here or only the SSL handshake?
I ran: ab -n 50000 -c 100 https://api.openshift.io/ Out of the 50,000 runs all of them returned 404s, but none returned the wrong cert. Is there a URL I can hit that doesn't return 404? For funsies, I tried both of the ELB addresses: ab -n 10000 -c 100 -H 'Host: api.openshift.io' https://52.86.181.249/ ab -n 10000 -c 100 -H 'Host: api.openshift.io' https://54.164.160.217/ And they also returned the same :-( I am not using keepalive.
You could hit https://openshift.io/_gettingstarted. Got openshiftapps certificate 2/693 attempts today - but not doing full http requests. Will give that a try too.
From testing it doesn't look like my version of ab would catch certificate errors. Running the following produces no error for me. I do however see people results for people wanting to disable the certificate checks, so that's a bit confusing. ab -e file -n 1 -c 1 https://wrong.host.badssl.com/ Using httpd-tools-2.4.6-67.el7.centos.x86_64
Still nothing :-( Overnight, I ran: $ while true; do echo "Q" | openssl s_client -servername api.openshift.io -connect api.openshift.io:443 | grep subject >> certlog ; sleep .1; done Then I see: $ uniq -c certlog 168092 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io So... 168,092 clean responses. What is different between my test case (or the network that my test machine is on) and yours? I'm stumped.
I'm baffled as well. I have run 110660 requests on another machine and 315870 on yet another - all return the correct certificate. I thought I had a good reproducer here, but apparently that only works from my laptop (!?). There is nothing really extraordinary wrt to my connection, 50mbit fiber with a single layer of NAT, in use exclusively by myself. Tried the exact same command you ran overnight and hit it 2 times in 629. The github issue speaks to the fact that others are seeing it too, but getting it consistently reproduced is proving to be tricky.
I just got it from another machine which is bare metal with a public IP address. Also located in Norway, but on a separate ISP. Common for the two machines I've now seen it on is Centos 7. Nothing really makes sense at this point, but at least it's data.
I've refactored the test script a bit to add timestamps and output only when bad cert is returned. I can observe this issue a LOT when running from centos7, centos6 and fedora21 containers (openssl-1.0.z) but no issues at all when running from the fedora27 host, fedora27 container or alpine container which are on newer openssl versions... From the user side I believe Paul has asked folks to provide more info about their OS version, Browser and openssl version if possible. It looks like what we are seeing may be down to an openssl issue which is maybe triggered by something on the routers side? I have a feeling this is SNI related but looking at the haproxy config template I don't see anything that stand out to me as red flag. My haproxy expertise is also fairly limited. I've made notes here: https://gist.github.com/jfchevrette/3fe297e7df68abeed7937326dc52c541
I can also observe the same pattern when sending requests to a route on the rh-idev cluster.
Paul: Since you can reproduce it, can you run tcpdump to capture the traffic and get it to me please? If you can set port and ip filters so tcpdump doesn't grab everything, that would be even better. Thanks
Ok, so I ran: docker run --rm -it fedora:21 bash yum install -y openssl while true; do echo "Q" | openssl s_client -servername api.openshift.io -connect api.openshift.io:443 | grep subject >> certlog ; sleep .1; done uniq -c certlog And got: 684 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io 1 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.09b5.dsaas.openshiftapps.com 350 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io So... it is something about the older version of openssl that makes it happen... No earthly clue what. But it's a lead.
Glad you see the same Ben. The reporters on the github issue are all seeing this with OpenSSL 0.9.8zh on various versions of Chrome on Mac OS 10.12.6. I am not entirely sure Chrome is linked / built with openssl on mac os, but I will have access to one I can poke around on. Posting as leads regardless.
We're going to compare requests across different versions to see if we can see why they would behave differently. But we're grasping at straws, so yeah, any info you have is helpful.
Chrome on macos is using boringssl, which is a fork of openssl.
Argh. So I can get it to happen about 1/1000 times against api.openshift.io with Fedora 21. I can see no difference between a working request and a broken one. I can not reproduce it if I run a local router and hit it with Fedora 21 :-( Anyway, I just wanted to post a status update, even though I don't have much to show yet.
Weibin, can you try to reproduce this please?
I saw the same issue: openssl.x86_64 1:1.0.1k-12.fc21 Fedora 21 bash-4.3# uniq -c certlog 3027 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io 1 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.09b5.dsaas.openshiftapps.com 1785 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io bash-4.3#
I can not reproduce the problem when run a local router and hit it with Fedora 21, no luck on both 3.7 and 3.10. ## v3.7 bash-4.3# uniq -c certlog 18924 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com bash-4.3# uniq -c certlog 294392 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com bash-4.3# uniq -c certlog 502396 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com ## v3.10 bash-4.3# uniq -c certlog 386478 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com bash-4.3# uniq -c certlog 647481 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com bash-4.3# uniq -c certlog 3183091 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com
Still observed vs Openshift 3.11.43 (different cluster) on CentOS Linux release 7.6.1810 using OpenSSL 1.0.2k-fips. 7 subject=/CN=*.<redacted>.<redacted>.openshiftapps.com 47124 subject=/CN=<redacted>.com Removing the actual URLs because we don't want this (prod) endpoint to be used for reproducing.
It seems clear we don't intend to do anything else with this report. I'm going to close it due to age unless somebody has an argument why we should keep it open. Thanks.
For anyone else that comes across this in the future, it looks like this was likely caused by the issue that has since been documented here: https://access.redhat.com/solutions/5603871 I was able to reproduce it far more reliably by simulating packet loss at the client with: tc qdisc add dev <device_name> root netem loss 60% I could see this issue in 3-4% of the requests with that simulated packet loss (note this is on OCP 4.x).