Bug 1566595 - dsaas routers sometimes returning the wrong certificate
Summary: dsaas routers sometimes returning the wrong certificate
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.10.z
Assignee: Weibin Liang
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-12 15:10 UTC by Paul Bergene
Modified: 2022-08-04 22:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-11 00:08:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Paul Bergene 2018-04-12 15:10:11 UTC
Description of problem:
In https://github.com/openshiftio/openshift.io/issues/2755 users reported seeing ERR_CERT_COMMON_NAME_INVALID occasionally.

These requests are for deployments sitting on dsaas.openshift.com OSD.

Upon investigation I hit one of our endpoints (api.openshift.io) 1264 times with a 1s delay. 1257 times the openshift.io wildcard certificate was returned, the remaining 7 served the openshiftapp wildcard.


Version-Release number of selected component (if applicable):

OpenShift Master: v3.7.23
Kubernetes Master: v1.7.6+a08f5eeb62

How reproducible:

while true; do echo "Q" | openssl s_client -servername api.openshift.io -connect api.openshift.io:443 | grep subject >> certlog ; sleep 1; done

Steps to Reproduce:
1. Run the above
2. grep through certlog for returned certificate.

Actual results:
openshiftapps certificate returned 7/1264 times


Expected results:
openshift.io certificate returned 1264/1264 times

Additional info:

Comment 1 Paul Bergene 2018-04-13 10:07:09 UTC
Did a new run today.  Out of 3042 requests on a 3s sleep we got 12 bad replies.

Comment 2 Paul Bergene 2018-04-26 13:20:00 UTC
Can this be reproduced your side too? Any ideas on cause? Router reload etc?

Comment 3 Paul Bergene 2018-04-26 13:35:42 UTC
We suspect that browser side caching of certificates is causing the impact to be wider than first thought.

Comment 4 Ben Bennett 2018-04-26 19:45:48 UTC
How is the route for api.openshift.io set up?  What cert should it be serving?

I ran it for 2,997 tests and only saw *.openshift.io.

Are you still running OpenShift 3.7?

Comment 5 jchevret 2018-04-26 20:01:52 UTC
-----------
apiVersion: v1
kind: Route
metadata:
  labels:
    service: core
  name: core-api-route
spec:
  host: api.openshift.io
  port:
    targetPort: 8080
  tls:
    caCertificate: |
      REDACTED
    certificate: |
      REDACTED
    key: |
      REDACTED
    insecureEdgeTerminationPolicy: Redirect
    termination: edge
  to:
    kind: Service
    name: core
    weight: 100
  wildcardPolicy: None
-----------

Nothing unusual. This is the dsaas Openshift dedicated cluster. It is running v3.7.23 / v1.7.6+a08f5eeb62 

Could there be a very time window when the routers deploy/reload when haproxy mappings are not available which would cause a request to go to the default route/backend?

The fact we sometimes see the openshiftapps.com certificate is suspicious.

Comment 6 jchevret 2018-04-26 20:09:09 UTC
The cluster has 3 infra nodes but only 2 router replicas. Could that be a problem?

Comment 7 Ben Bennett 2018-04-26 20:31:00 UTC
No, as long as the loadbalancer is only hitting the two routers it is fine.

It _could_ be the router reload, but that usually results in connections that fail to connect, not cert errors.  (BTW that is fixed in 3.9)

But it is more likely that the pod is failing health checks which may result in haproxy removing the endpoint, and then you fall through and return a 404 with the default certificate.

We could turn off health checks on that route to see if that makes a difference... Try setting the annotation router.openshift.io/haproxy.health.check.interval to 7d (i.e. 7 days) so that it is effectively disabled.

https://docs.openshift.com/container-platform/3.7/architecture/networking/routes.html has the relevant docs.

Comment 8 bmorriso 2018-04-26 21:08:31 UTC
I tried to reproduce this today using the command in the description and I was unable to. I ran it both from my local workstation and another machine in us-east-1. Between the two machines I ran the check 16203 times and saw the *.openshift.io CN every time. 

To rule out a transient issue with the router pods, I rescaled them at UTC 20:11:29 today.

Comment 9 Paul Bergene 2018-04-27 08:08:55 UTC
I ran it (openssl) overnight (20:30-09:30 CEST) and hit it 66 / 13188 times. 

I've had a hard time reproducing with curl, for whatever reason.  I wanted to see what status code was actually served when doing a full HTTP request (and not just a SSL handshake). 

Had a go around disabling client connection reuse etc to see if that helped, but to no avail. This let me to think that possibly the server is reusing the connection. 

So just to make sure with regards to your reproduction efforts, are you doing full http requests here or only the SSL handshake?

Comment 10 Ben Bennett 2018-04-30 14:16:30 UTC
I ran: ab -n 50000 -c 100 https://api.openshift.io/

Out of the 50,000 runs all of them returned 404s, but none returned the wrong cert.

Is there a URL I can hit that doesn't return 404?

For funsies, I tried both of the ELB addresses:
 ab -n 10000 -c 100 -H 'Host: api.openshift.io' https://52.86.181.249/
 ab -n 10000 -c 100 -H 'Host: api.openshift.io' https://54.164.160.217/

And they also returned the same :-(

I am not using keepalive.

Comment 11 Paul Bergene 2018-04-30 14:32:21 UTC
You could hit https://openshift.io/_gettingstarted.  Got openshiftapps certificate 2/693 attempts today - but not doing full http requests.  Will give that a try too.

Comment 12 Paul Bergene 2018-04-30 15:15:35 UTC
From testing it doesn't look like my version of ab would catch certificate errors.  Running the following produces no error for me.  I do however see people results for people wanting to disable the certificate checks, so that's a bit confusing.

ab -e file -n 1 -c 1 https://wrong.host.badssl.com/

Using httpd-tools-2.4.6-67.el7.centos.x86_64

Comment 13 Ben Bennett 2018-05-01 14:19:59 UTC
Still nothing :-(

Overnight, I ran:

 $ while true; do echo "Q" | openssl s_client -servername api.openshift.io -connect api.openshift.io:443 | grep subject >> certlog ; sleep .1; done

Then I see:

 $ uniq -c certlog 
 168092 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io

So... 168,092 clean responses.

What is different between my test case (or the network that my test machine is on) and yours?  I'm stumped.

Comment 14 Paul Bergene 2018-05-02 07:52:58 UTC
I'm baffled as well.  I have run 110660 requests on another machine and 315870 on yet another - all return the correct certificate.

I thought I had a good reproducer here, but apparently that only works from my laptop (!?).

There is nothing really extraordinary wrt to my connection, 50mbit fiber with a single layer of NAT, in use exclusively by myself.

Tried the exact same command you ran overnight and hit it 2 times in 629.

The github issue speaks to the fact that others are seeing it too, but getting it consistently reproduced is proving to be tricky.

Comment 15 Paul Bergene 2018-05-02 14:43:26 UTC
I just got it from another machine which is bare metal with a public IP address. Also located in Norway, but on a separate ISP.

Common for the two machines I've now seen it on is Centos 7.  Nothing really makes sense at this point, but at least it's data.

Comment 16 jchevret 2018-05-02 16:07:42 UTC
I've refactored the test script a bit to add timestamps and output only when bad cert is returned.

I can observe this issue a LOT when running from centos7, centos6  and fedora21 containers (openssl-1.0.z) but no issues at all when running from the fedora27 host, fedora27 container or alpine container which are on newer openssl versions...

From the user side I believe Paul has asked folks to provide more info about their OS version, Browser and openssl version if possible.

It looks like what we are seeing may be down to an openssl issue which is maybe triggered by something on the routers side?

I have a feeling this is SNI related but looking at the haproxy config template I don't see anything that stand out to me as red flag. My haproxy expertise is also fairly limited.

I've made notes here:
https://gist.github.com/jfchevrette/3fe297e7df68abeed7937326dc52c541

Comment 17 jchevret 2018-05-02 16:19:28 UTC
I can also observe the same pattern when sending requests to a route on the rh-idev cluster.

Comment 18 Ben Bennett 2018-05-03 15:24:35 UTC
Paul: Since you can reproduce it, can you run tcpdump to capture the traffic and get it to me please?  If you can set port and ip filters so tcpdump doesn't grab everything, that would be even better.

Thanks

Comment 20 Ben Bennett 2018-05-08 20:13:54 UTC
Ok, so I ran:

docker run --rm -it fedora:21 bash
yum install -y openssl
while true; do echo "Q" | openssl s_client -servername api.openshift.io -connect api.openshift.io:443 | grep subject >> certlog ; sleep .1; done
uniq -c certlog 

And got:
    684 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io
      1 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.09b5.dsaas.openshiftapps.com
    350 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io

So... it is something about the older version of openssl that makes it happen...  No earthly clue what.  But it's a lead.

Comment 21 Paul Bergene 2018-05-09 07:14:39 UTC
Glad you see the same Ben.  The reporters on the github issue are all seeing this with OpenSSL 0.9.8zh on various versions of Chrome on Mac OS 10.12.6. 

I am not entirely sure Chrome is linked / built with openssl on mac os, but I will have access to one I can poke around on. 

Posting as leads regardless.

Comment 22 Ben Bennett 2018-05-09 17:06:07 UTC
We're going to compare requests across different versions to see if we can see why they would behave differently.  But we're grasping at straws, so yeah, any info you have is helpful.

Comment 23 Paul Bergene 2018-05-14 07:15:43 UTC
Chrome on macos is using boringssl, which is a fork of openssl.

Comment 24 Ben Bennett 2018-05-15 14:20:56 UTC
Argh.  So I can get it to happen about 1/1000 times against api.openshift.io with Fedora 21.  I can see no difference between a working request and a broken one.

I can not reproduce it if I run a local router and hit it with Fedora 21 :-(

Anyway, I just wanted to post a status update, even though I don't have much to show yet.

Comment 25 Ben Bennett 2018-05-23 13:42:02 UTC
Weibin, can you try to reproduce this please?

Comment 26 Weibin Liang 2018-05-23 19:24:06 UTC
I saw the same issue:

openssl.x86_64 1:1.0.1k-12.fc21  
Fedora 21

bash-4.3# uniq -c certlog 
   3027 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io
      1 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.09b5.dsaas.openshiftapps.com
   1785 subject=/C=US/ST=North Carolina/L=Raleigh/O=Red Hat Inc./CN=*.openshift.io
bash-4.3#

Comment 27 Weibin Liang 2018-05-30 13:28:56 UTC
I can not reproduce the problem when run a local router and hit it with Fedora 21, no luck on both 3.7 and 3.10.


## v3.7
bash-4.3# uniq -c certlog 
  18924 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com
bash-4.3# uniq -c certlog 
 294392 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com
bash-4.3# uniq -c certlog 
 502396 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com


## v3.10
bash-4.3# uniq -c certlog 
 386478 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com
bash-4.3# uniq -c certlog 
 647481 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com
bash-4.3# uniq -c certlog 
3183091 subject=/C=US/ST=CA/L=Mountain View/O=OS3/OU=Eng/CN=red.example.com

Comment 28 Paul Bergene 2019-08-29 07:40:36 UTC
Still observed vs Openshift 3.11.43 (different cluster) on CentOS Linux release 7.6.1810 using OpenSSL 1.0.2k-fips.

7 subject=/CN=*.<redacted>.<redacted>.openshiftapps.com
47124 subject=/CN=<redacted>.com

Removing the actual URLs because we don't want this (prod) endpoint to be used for reproducing.

Comment 29 Dan Mace 2019-10-11 00:08:04 UTC
It seems clear we don't intend to do anything else with this report. I'm going to close it due to age unless somebody has an argument why we should keep it open. Thanks.

Comment 30 Steve Teahan 2022-01-28 16:50:43 UTC
For anyone else that comes across this in the future, it looks like this was likely caused by the issue that has since been documented here: https://access.redhat.com/solutions/5603871

I was able to reproduce it far more reliably by simulating packet loss at the client with: tc qdisc add dev <device_name> root netem loss 60%

I could see this issue in 3-4% of the requests with that simulated packet loss (note this is on OCP 4.x).


Note You need to log in before you can comment on or make changes to this bug.