Bug 1420121 - Java console link fails with TLS handshake timeout error
Summary: Java console link fails with TLS handshake timeout error
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-07 21:57 UTC by Travis Rogers
Modified: 2017-09-07 19:37 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-07 19:30:38 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible issues 3017 0 None closed dnsmasq does not serve PTR records, making nodes not compliant with Kube DNS 2020-12-08 19:40:55 UTC
Red Hat Bugzilla 1489581 0 medium CLOSED Provide negative responses for services and pod CIDR PTR requests 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker CLOUD-1322 0 Major Closed Java console link fails for A-MQ based pod with TLS handshake timeout error 2020-12-08 19:40:54 UTC
Red Hat Issue Tracker OSFUSE-554 0 Major Closed Java console is not accessible for FIS quickstart 2020-12-08 19:40:26 UTC

Internal Links: 1489581

Description Travis Rogers 2017-02-07 21:57:38 UTC
Description of problem:
After creating an AMQ based pod using an xPaaS provided template, clicking on the Open Java Console link from the Pod details screen of the Openshift web console fails with a TLS handshake timeout error.

Version-Release number of selected component (if applicable):
OCP 3.3
A-MQ xPaaS image 1.3-5 to latest

How reproducible:
Some environments it's consistent, others not so.

Steps to Reproduce:
Refer to steps in https://issues.jboss.org/browse/CLOUD-1322

Actual results:
TLS handshake timeout error

Expected results:
Java console is displayed from running container

Comment 1 Travis Rogers 2017-02-07 22:16:01 UTC
An example from troubleshooting shows a a successful curl directly from the master to the pod.  A subsequent curl from the master to the pod via the kube-proxy fails with a TLS handshake timeout error.

The following does work:
 
[rkieley@node1-ocp33 test]$ oc describe pods broker-amq-3-anaj7
Name:                   broker-amq-3-anaj7
.
.
.
Status:                 Running
IP:                     10.1.0.2
 
 
[root@master-ocp33 rkieley]# curl -v -k -s -L --cert /etc/origin/master/master.proxy-client.crt --key /etc/origin/master/master.proxy-client.key https://10.1.0.2:8778/jolokia/search/org.apache.activemq.broker:*                             
* About to connect() to 10.1.0.2 port 8778 (#0)
*   Trying 10.1.0.2...
* Connected to 10.1.0.2 (10.1.0.2) port 8778 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate from file
*       subject: CN=system:master-proxy
*       start date: Oct 07 18:16:40 2016 GMT
*       expire date: Oct 07 18:16:41 2018 GMT
*       common name: system:master-proxy
*       issuer: CN=openshift-signer@1475864196
* SSL connection using TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
*       subject: CN=Jolokia Agent 1.3.2,OU=JVM,O=jolokia.org,L=Pegnitz,ST=Franconia,C=DE
*       start date: Feb 03 13:12:51 2017 GMT
*       expire date: Feb 01 13:12:51 2027 GMT
*       common name: Jolokia Agent 1.3.2
*       issuer: CN=Jolokia Agent 1.3.2,OU=JVM,O=jolokia.org,L=Pegnitz,ST=Franconia,C=DE
> GET /jolokia/search/org.apache.activemq.broker:* HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.1.0.2:8778
> Accept: */*
> 
< HTTP/1.1 200 OK
< Pragma: no-cache
< Date: Tue, 07 Feb 2017 13:05:47 GMT
< Content-type: text/plain; charset=utf-8
< Expires: Tue, 07 Feb 2017 12:05:46 GMT
< Content-length: 115
< Cache-control: no-cache
< 
* Connection #0 to host 10.1.0.2 left intact
{"request":{"mbean":"org.apache.activemq.broker:*","type":"search"},"value":[],"timestamp":1486472747,"status":200}[root@master-ocp33 rkieley]#
 
 
While the following does not:
 
[root@master-ocp33 rkieley]# curl -v -k -s -L --cert /etc/origin/master/master.proxy-client.crt --key /etc/origin/master/master.proxy-client.key  https://master-ocp33.kieley.ca:8443/api/v1/namespaces/017108385/pods/https:broker-amq-3-anaj7:8778/proxy/jolokia/search/org.apache.activemq.broker:* -H "Authorization: Bearer $token"                                                                                                                                                      
* About to connect() to master-ocp33.kieley.ca port 8443 (#0)
*   Trying 192.168.2.231...
* Connected to master-ocp33.kieley.ca (192.168.2.231) port 8443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate from file
*       subject: CN=system:master-proxy
*       start date: Oct 07 18:16:40 2016 GMT
*       expire date: Oct 07 18:16:41 2018 GMT
*       common name: system:master-proxy
*       issuer: CN=openshift-signer@1475864196
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=172.30.0.1
*       start date: Oct 07 18:16:39 2016 GMT
*       expire date: Oct 07 18:16:40 2018 GMT
*       common name: 172.30.0.1
*       issuer: CN=openshift-signer@1475864196
> GET /api/v1/namespaces/017108385/pods/https:broker-amq-3-anaj7:8778/proxy/jolokia/search/org.apache.activemq.broker:* HTTP/1.1
> User-Agent: curl/7.29.0
> Host: master-ocp33.kieley.ca:8443
> Accept: */*
> Authorization: Bearer lWZmpw4ixefQ-6oeMmg-5ZLjKXGn9ThYYDeRcVq680c
> 
< HTTP/1.1 503 Service Unavailable
< Cache-Control: no-store
< Date: Tue, 07 Feb 2017 13:11:40 GMT
< Content-Length: 127
< Content-Type: text/plain; charset=utf-8
< 
Error: 'net/http: TLS handshake timeout'
* Connection #0 to host master-ocp33.kieley.ca left intact
Trying to reach: 'https://10.1.0.2:8778/jolokia/search/org.apache.activemq.broker:%2A'

Comment 2 Marek Schmidt 2017-02-08 14:53:51 UTC
As discussed in https://issues.jboss.org/browse/OSFUSE-554, one of the likely causes is a delay caused by blocked DNS resolution when the jolokia tries to do the reverse DNS query to translate the master IP address into hostname.

(if "nslookup <master node IP>" from inside the pod takes more than 5s)

Not sure why that happens, though,  could be a skydns issue?

Comment 3 Jessica Forrester 2017-02-08 15:03:52 UTC
We probably need someone from either Cluster Infra or Networking to help with this.

Comment 6 Maru Newby 2017-02-08 16:43:35 UTC
Can you con(In reply to Marek Schmidt from comment #2)
> As discussed in https://issues.jboss.org/browse/OSFUSE-554, one of the
> likely causes is a delay caused by blocked DNS resolution when the jolokia
> tries to do the reverse DNS query to translate the master IP address into
> hostname.
> 
> (if "nslookup <master node IP>" from inside the pod takes more than 5s)
> 
> Not sure why that happens, though,  could be a skydns issue?

Does reverse lookup on the kubernetes service work (e.g. nslookoup 172.30.0.1)?  If so, the problem may not be with skydns.  Skydns forwards requests for which it is not authoritative, and since skydns is not responsible for maintaining dns for the nodes, requests would be forwarded to the dns server(s) configured on the master (assuming openshift deployment is not containerized).  What happens when the reverse lookup that exhibits problems in the pod is run on the master?

Comment 7 Marek Schmidt 2017-02-08 16:52:35 UTC
nslookup 172.30.0.1  ends immediately, it is just the master pod IP that makes nslookup to hang...

Reproducing this on https://open.paas.redhat.com 

sh-4.2$ nslookup 10.1.5.1                                                                                                                                                                                                                                                   
;; connection timed out; trying next origin                                                                                                                                                                                                                                 
;; connection timed out; trying next origin                                                                                                                                                                                                                                 
;; connection timed out; trying next origin                                                                                                                                                                                                                                 
;; connection timed out; trying next origin                                                                                                                                                                                                                                 
;; connection timed out; no servers could be reached                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
sh-4.2$ nslookup 172.30.0.1                                                                                                                                                                                                                                                 
Server:         10.29.66.115                                                                                                                                                                                                                                                
Address:        10.29.66.115#53                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                            
** server can't find 1.0.30.172.in-addr.arpa.: NXDOMAIN 


I don't control this environment, so unfortunately cannot do try it on the master node.

Comment 8 Maru Newby 2017-02-08 17:15:38 UTC
(In reply to Marek Schmidt from comment #7)
> nslookup 172.30.0.1  ends immediately, it is just the master pod IP that
> makes nslookup to hang...
> 
> Reproducing this on https://open.paas.redhat.com 
> 
> sh-4.2$ nslookup 10.1.5.1                                                   
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; no servers could be reached                        

It appears that skydns is attempting to forward the request but none of the configured servers are reachable.  What happens when you try to resolve a random external name (e.g. nslookup google.com)?

>                                                                             
> 
> sh-4.2$ nslookup 172.30.0.1                                                 
> 
> Server:         10.29.66.115                                                
> 
> Address:        10.29.66.115#53                                             
> 
>                                                                             
> 
> ** server can't find 1.0.30.172.in-addr.arpa.: NXDOMAIN 

172.30.0.1 is the default cluster ip for the api service, but it can be configured.  What is KUBERNETES_SERVICE_HOST in a pod on the cluster?  I'm not entirely sure why this request wouldn't have been forwarded if skydns didn't have a name to resolve it to.

> 
> 
> I don't control this environment, so unfortunately cannot do try it on the
> master node.


Is this bug report specifically about open.paas.redhat.com?  Reverse lookup in an openshift pod for a given deployment should work provided the dns server(s) that skydns is forwarding to are configured properly, but it is entirely dependent on the dns configuration of the deployed environment.

Comment 9 Travis Rogers 2017-02-08 18:56:27 UTC
"Is this bug report specifically about open.paas.redhat.com?  Reverse lookup in an openshift pod for a given deployment should work provided the dns server(s) that skydns is forwarding to are configured properly, but it is entirely dependent on the dns configuration of the deployed environment."

No.  This issue stems from an open support issue reported by an end user.  I believe that open.paas.redhat.com is in play because it is a ready place to try to reproduce the root issue of TLS handshake timeouts when navigating to the java console via the openshift web console.  See comment #1.

Comment 10 Travis Rogers 2017-02-08 19:09:09 UTC
So far for reproducing we have focused on the Jolokia endpoint since that is where the original report stems from.  Research on the Fuse JIRA [1] has started to show DNS lag times, but still in relation to Jolokia as the endpoint.  

I think it would be beneficial to isolate Jolokia and DNS as much as possible to confirm some of the observations noticed so far.  With this in mind, is there another URL that could be tested that uses the kube-proxy and a different endpoint?  IOW, something other than Jolokia.  Might be good to see how DNS is working with this different endpoint that is still being directed via kube-proxy.

Example curl of Jolokia URL which is a similar URL that the Openshift web console would generate and make available for an end user:

$ oc login -u user-name
$ export token=`oc whoami -t`
$ curl -v https://master.example.com/api/v1/namespaces/example-project/pods/https:broker-amq-example-pod:8778/proxy/jolokia/version -H "Authorization: Bearer $token" --insecure


Would be good to have another URL that the Openshift web console generates that uses kube-proxy, but has an endpoint to a different service other than Jolokia.  Anyone have an example?



[1]
https://issues.jboss.org/browse/OSFUSE-554

Comment 11 Maru Newby 2017-02-08 19:34:20 UTC
My reading of the comments on the jboss issue is that this doesn't have anything to do with either the api proxy (kube proxy is something else entirely) or the web console.  I think dns misconfiguration is the most likely culprit at this point, and my last comment identifies some suggestions for how to confirm that diagnosis.

Comment 13 Aurélien Pupier 2017-02-20 12:36:53 UTC
Is it possible to keep the severity to "High" until we have a clear workaround?
We have several cases where it is failing only "sometimes" but for some other cases it is not working at all so the Java console and Jolokia endpoint are not accessible.

thanks for your help.

Comment 14 Maru Newby 2017-02-20 15:09:52 UTC
(In reply to Aurélien Pupier from comment #13)
> Is it possible to keep the severity to "High" until we have a clear
> workaround?
> We have several cases where it is failing only "sometimes" but for some
> other cases it is not working at all so the Java console and Jolokia
> endpoint are not accessible.
> 
> thanks for your help.

Please address my comments to rule out DNS misconfiguration.

Comment 15 Aurélien Pupier 2017-02-20 15:50:58 UTC
@Maru Can you specify which comment you are talking about? and/or provide more specific steps to workaround the issue. I'm not familiar with neither OpenShift nor DNS. I don't understand what need to be done to workaround the issue from your previous comments.

Comment 16 Maru Newby 2017-02-20 16:02:59 UTC
(In reply to Aurélien Pupier from comment #15)
> @Maru Can you specify which comment you are talking about? and/or provide
> more specific steps to workaround the issue. I'm not familiar with neither
> OpenShift nor DNS. I don't understand what need to be done to workaround the
> issue from your previous comments.

My reading of the jboss issue is that reverse dns lookup is being attempted by the java app for the IP address of the API server.  Since skydns is not responsible for that address, the request would be passed to the dns servers configured on the host running skydns.  I believe that those dns servers (separate from openshift and skydns) are either not accessible or not configured to provide reverse lookup for the IP of the api server, and comment #8 on this bug describes ways to determine if that is the case.

Comment 17 Maru Newby 2017-02-20 16:04:03 UTC
(In reply to Maru Newby from comment #16)
> (In reply to Aurélien Pupier from comment #15)
> > @Maru Can you specify which comment you are talking about? and/or provide
> > more specific steps to workaround the issue. I'm not familiar with neither
> > OpenShift nor DNS. I don't understand what need to be done to workaround the
> > issue from your previous comments.
> 
> My reading of the jboss issue is that reverse dns lookup is being attempted
> by the java app for the IP address of the API server.  Since skydns is not

s/IP address of the API server/IP address of the node running the API server/

> responsible for that address, the request would be passed to the dns servers
> configured on the host running skydns.  I believe that those dns servers
> (separate from openshift and skydns) are either not accessible or not
> configured to provide reverse lookup for the IP of the api server, and
> comment #8 on this bug describes ways to determine if that is the case.

Comment 18 Aurélien Pupier 2017-02-21 09:19:08 UTC
> What happens when you try to resolve a random external name (e.g. nslookup google.com)?

unfortunately, I have a "sh: sudo: command not found" and even when connected with admin user, I cannot install the command

sh-4.2$ yum install bind-utils                                                                                                                                                                                                                                       
Loaded plugins: ovl, product-id, search-disabled-repos, subscription-manager                                                                                                                                                                                         
ovl: Error while doing RPMdb copy-up:                                                                                                                                                                                                                                
[Errno 13] Permission denied: '/var/lib/rpm/.rpm.lock'                                                                                                                                                                                                               
You need to be root to perform this command.

sh-4.2$ sudo yum install bind-utils                                                                                                                                                                                                                                  
sh: sudo: command not found

> What is KUBERNETES_SERVICE_HOST in a pod on the cluster?

sh-4.2$ echo $KUBERNETS_SERVICE_HOST
172.30.0.1

> Is this bug report specifically about open.paas.redhat.com?

No, it occurs on open.paas.redhat.com and it also occurs when using the CDK.

Comment 19 Maru Newby 2017-02-22 04:55:23 UTC
(In reply to Aurélien Pupier from comment #18)
> > What happens when you try to resolve a random external name (e.g. nslookup google.com)?
> 
> unfortunately, I have a "sh: sudo: command not found" and even when
> connected with admin user, I cannot install the command
> 
> sh-4.2$ yum install bind-utils                                              
> 
> Loaded plugins: ovl, product-id, search-disabled-repos, subscription-manager
> 
> ovl: Error while doing RPMdb copy-up:                                       
> 
> [Errno 13] Permission denied: '/var/lib/rpm/.rpm.lock'                      
> 
> You need to be root to perform this command.
> 
> sh-4.2$ sudo yum install bind-utils                                         
> 
> sh: sudo: command not found

You'll need to find an image that supports nslookup/dig/etc to perform the check.  It would be suprising would be surprised if external resolution was broken, 

> > What is KUBERNETES_SERVICE_HOST in a pod on the cluster?
> 
> sh-4.2$ echo $KUBERNETS_SERVICE_HOST
> 172.30.0.1

I'm confused as to why comment #7 reported that this IP wasn't resolvable.  I would expect skydns to return something like kubernetes.default.cluster.local.

> 
> > Is this bug report specifically about open.paas.redhat.com?
> 
> No, it occurs on open.paas.redhat.com and it also occurs when using the CDK.

And the symptoms are identical?

From a brief search on the CDK, I'm assuming it runs locally?  It should be easier to diagnose the issue via the CDK, since you can discover what address the API is bound to without having to connect to it and then perform reverse lookup on that IP.

Comment 20 Aurélien Pupier 2017-02-22 13:36:34 UTC
(In reply to Maru Newby from comment #19)
> (In reply to Aurélien Pupier from comment #18)
> > > What happens when you try to resolve a random external name (e.g. nslookup google.com)?
> > 
> > unfortunately, I have a "sh: sudo: command not found" and even when
> > connected with admin user, I cannot install the command
> > 
> > sh-4.2$ yum install bind-utils                                              
> > 
> > Loaded plugins: ovl, product-id, search-disabled-repos, subscription-manager
> > 
> > ovl: Error while doing RPMdb copy-up:                                       
> > 
> > [Errno 13] Permission denied: '/var/lib/rpm/.rpm.lock'                      
> > 
> > You need to be root to perform this command.
> > 
> > sh-4.2$ sudo yum install bind-utils                                         
> > 
> > sh: sudo: command not found
> 
> You'll need to find an image that supports nslookup/dig/etc to perform the
> check.  It would be suprising would be surprised if external resolution was
> broken, 

so I need image with a Java application and nslookup/dig/etc ?
Do you know where I can find one?


I have curl available, is it enough to investigate?

h-4.2$ curl 172.30.0.1                                                                                                                                                                                                                                              
curl: (7) Failed connect to 172.30.0.1:80; Connection timed out                                                                                                                                                                                                      
sh-4.2$ curl www.google.com                                                                                                                                                                                                                                          
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">                                                                                                                                                                                       
<TITLE>302 Moved</TITLE></HEAD><BODY>                                                                                                                                                                                                                                
<H1>302 Moved</H1>                                                                                                                                                                                                                                                   
The document has moved                                                                                                                                                                                                                                               
<A HREF="http://www.google.fr/?gfe_rd=cr&amp;ei=qZOtWNnkO6Hc8AeR9b-ACQ">here</A>.                                                                                                                                                                                    
</BODY></HTML>


> > > What is KUBERNETES_SERVICE_HOST in a pod on the cluster?
> > 
> > sh-4.2$ echo $KUBERNETS_SERVICE_HOST
> > 172.30.0.1
> 
> I'm confused as to why comment #7 reported that this IP wasn't resolvable. 
> I would expect skydns to return something like
> kubernetes.default.cluster.local.
> 
> > 
> > > Is this bug report specifically about open.paas.redhat.com?
> > 
> > No, it occurs on open.paas.redhat.com and it also occurs when using the CDK.
> 
> And the symptoms are identical?

symptoms are a bit different amongst all the reported issues That I have read but it is not a open.paas.redhat.com vs CDK.
Some have no display of teh Java consoel at all, some others have it partially displayed


> 
> From a brief search on the CDK, I'm assuming it runs locally?  It should be
> easier to diagnose the issue via the CDK, since you can discover what
> address the API is bound to without having to connect to it and then perform
> reverse lookup on that IP.

yes it is locally.

Comment 21 Aurélien Pupier 2017-02-22 13:51:55 UTC
I managed to install nslookup by:
vagrant ssh
sudo su
docker ps (to see the docker id of my pod)
docker exec -t -i --user=root 386f5bf8e6d4 /bin/sh
yum install /usr/bin/nslookup


and so the result of the nslookup are:

sh-4.2$ nslookup google.com                                                                                                                                                         
Server:         10.1.2.2                                                                                                                                                            
Address:        10.1.2.2#53                                                                                                                                                         
                                                                                                                                                                                    
Non-authoritative answer:                                                                                                                                                           
Name:   google.com                                                                                                                                                                  
Address: 172.217.22.78                                                                                                                                                              
                                                                                                                                                                                    
sh-4.2$ nslookup 172.30.0.1                                                                                                                                                         
Server:         10.1.2.2                                                                                                                                                            
Address:        10.1.2.2#53                                                                                                                                                         
                                                                                                                                                                                    
Non-authoritative answer:                                                                                                                                                           
1.0.30.172.in-addr.arpa name = kubernetes.default.svc.cluster.local.                                                                                                                
                                                                                                                                                                                    
Authoritative answers can be found from:                                                                                                                                            
                                                                                                                                                                                    

I have Jolokia agent which is started on: Jolokia: Agent started with URL https://172.17.0.4:8778/jolokia/    

For this address i have some connection timed out:
sh-4.2$ nslookup 172.17.0.4                                                                                                                                                         
;; connection timed out; trying next origin                                                                                                                                         
;; connection timed out; trying next origin                                                                                                                                         
;; connection timed out; trying next origin                                                                                                                                         
;; connection timed out; no servers could be reached

Comment 22 Aurélien Pupier 2017-02-22 13:53:40 UTC
If I specify the protocol, I have an error directly returned instead of a timed out:

sh-4.2$ nslookup https://172.17.0.4:8778/jolokia                                                                                                                                    
Server:         10.1.2.2                                                                                                                                                            
Address:        10.1.2.2#53                                                                                                                                                         
                                                                                                                                                                                    
** server can't find https://172.17.0.4:8778/jolokia: NXDOMAIN                                                                                                                      
                                                                                                                                                                                    
sh-4.2$ nslookup https://172.17.0.4                                                                                                                                                 
Server:         10.1.2.2                                                                                                                                                            
Address:        10.1.2.2#53                                                                                                                                                         
                                                                                                                                                                                    
** server can't find https://172.17.0.4: NXDOMAIN                                                                                                                                   
                                                                                                                                                                                    
sh-4.2$ nslookup http://172.17.0.4                                                                                                                                                  
Server:         10.1.2.2                                                                                                                                                            
Address:        10.1.2.2#53                                                                                                                                                         
                                                                                                                                                                                    
** server can't find http://172.17.0.4: NXDOMAIN

Comment 23 Maru Newby 2017-02-23 14:59:07 UTC
(In reply to Aurélien Pupier from comment #21)
> I managed to install nslookup by:
> vagrant ssh
> sudo su
> docker ps (to see the docker id of my pod)
> docker exec -t -i --user=root 386f5bf8e6d4 /bin/sh
> yum install /usr/bin/nslookup
> 
> 
> and so the result of the nslookup are:
> 
> sh-4.2$ nslookup google.com                                                 
> 
> Server:         10.1.2.2                                                    
> 
> Address:        10.1.2.2#53                                                 
> 
>                                                                             
> 
> Non-authoritative answer:                                                   
> 
> Name:   google.com                                                          
> 
> Address: 172.217.22.78                                                      
> 
>                                                                             
> 
> sh-4.2$ nslookup 172.30.0.1                                                 
> 
> Server:         10.1.2.2                                                    
> 
> Address:        10.1.2.2#53                                                 
> 
>                                                                             
> 
> Non-authoritative answer:                                                   
> 
> 1.0.30.172.in-addr.arpa name = kubernetes.default.svc.cluster.local.        


That's what I would expect.
 
>                                                                             
> 
> Authoritative answers can be found from:                                    
> 
>                                                                             
> 
> 
> I have Jolokia agent which is started on: Jolokia: Agent started with URL
> https://172.17.0.4:8778/jolokia/    
> 
> For this address i have some connection timed out:
> sh-4.2$ nslookup 172.17.0.4                                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; no servers could be reached

It is expected that reverse lookup would fail for the pod ip.  Have you had any luck finding the ip that the api proxy is connecting to the Jolokia pod from?  I'm assuming that attempting to nslookup that IP from the pod will have the same result as for 172.17.0.4.  If that is the case, then the next step would be to contact an administrator that can ensure the reverse dns is correctly configured for the master node(s).

Comment 24 Maru Newby 2017-02-23 15:00:46 UTC
(In reply to Aurélien Pupier from comment #22)
> If I specify the protocol, I have an error directly returned instead of a
> timed out:

That's to be expected.  The protocol should not be specified when making DNS queries.

Comment 25 Aurélien Pupier 2017-02-23 15:11:11 UTC
(In reply to Maru Newby from comment #23)
> (In reply to Aurélien Pupier from comment #21)

> > I have Jolokia agent which is started on: Jolokia: Agent started with URL
> > https://172.17.0.4:8778/jolokia/    
> > 
> > For this address i have some connection timed out:
> > sh-4.2$ nslookup 172.17.0.4                                                 
> > 
> > ;; connection timed out; trying next origin                                 
> > 
> > ;; connection timed out; trying next origin                                 
> > 
> > ;; connection timed out; trying next origin                                 
> > 
> > ;; connection timed out; no servers could be reached
> 
> It is expected that reverse lookup would fail for the pod ip.  Have you had
> any luck finding the ip that the api proxy is connecting to the Jolokia pod
> from?

How can I find this IP in the CDK?

> I'm assuming that attempting to nslookup that IP from the pod will
> have the same result as for 172.17.0.4.  If that is the case, then the next
> step would be to contact an administrator that can ensure the reverse dns is
> correctly configured for the master node(s).

For open.paas.redhat.com, why not contacting him if you know what to tell him he/she has to modify.
For the CDK, I suppose that we should contact the CDK team to modify something in the CDK?

Comment 26 Maru Newby 2017-02-23 16:49:30 UTC
(In reply to Aurélien Pupier from comment #25)
> (In reply to Maru Newby from comment #23)
> > (In reply to Aurélien Pupier from comment #21)
> 
> > > I have Jolokia agent which is started on: Jolokia: Agent started with URL
> > > https://172.17.0.4:8778/jolokia/    
> > > 
> > > For this address i have some connection timed out:
> > > sh-4.2$ nslookup 172.17.0.4                                                 
> > > 
> > > ;; connection timed out; trying next origin                                 
> > > 
> > > ;; connection timed out; trying next origin                                 
> > > 
> > > ;; connection timed out; trying next origin                                 
> > > 
> > > ;; connection timed out; no servers could be reached
> > 
> > It is expected that reverse lookup would fail for the pod ip.  Have you had
> > any luck finding the ip that the api proxy is connecting to the Jolokia pod
> > from?
> 
> How can I find this IP in the CDK?

For the CDK it will likely be the same IP as the API (check 'oc status'). 

> 
> > I'm assuming that attempting to nslookup that IP from the pod will
> > have the same result as for 172.17.0.4.  If that is the case, then the next
> > step would be to contact an administrator that can ensure the reverse dns is
> > correctly configured for the master node(s).
> 
> For open.paas.redhat.com, why not contacting him if you know what to tell
> him he/she has to modify.
> For the CDK, I suppose that we should contact the CDK team to modify
> something in the CDK?

This might be resolvable for a deployed cluster, but I'm not sure what the fix would be for the CDK.  It's always possible that reverse dns lookups will induce latency into an application, and I think it would be a lot simpler to default to not doing the lookups than trying to ensure that the lookups won't fail.  There is too much environmental dependency involved in reverse lookup for a general solution to exist.

Comment 27 Aurélien Pupier 2017-02-24 14:32:08 UTC
(In reply to Maru Newby from comment #26)
> (In reply to Aurélien Pupier from comment #25)
> > (In reply to Maru Newby from comment #23)
> > > (In reply to Aurélien Pupier from comment #21)
> > 
> > > > I have Jolokia agent which is started on: Jolokia: Agent started with URL
> > > > https://172.17.0.4:8778/jolokia/    
> > > > 
> > > > For this address i have some connection timed out:
> > > > sh-4.2$ nslookup 172.17.0.4                                                 
> > > > 
> > > > ;; connection timed out; trying next origin                                 
> > > > 
> > > > ;; connection timed out; trying next origin                                 
> > > > 
> > > > ;; connection timed out; trying next origin                                 
> > > > 
> > > > ;; connection timed out; no servers could be reached
> > > 
> > > It is expected that reverse lookup would fail for the pod ip.  Have you had
> > > any luck finding the ip that the api proxy is connecting to the Jolokia pod
> > > from?
> > 
> > How can I find this IP in the CDK?
> 
> For the CDK it will likely be the same IP as the API (check 'oc status'). 

C:\Users\Aurelien Pupier>oc status
In project default on server https://10.1.2.2:8443

https://hub.openshift.rhel-cdk.10.1.2.2.xip.io (passthrough) to pod port 5000-tcp (svc/docker-registry)
  dc/docker-registry deploys docker.io/openshift3/ose-docker-registry:v3.4.1.2
    deployment #4 deployed 3 days ago - 1 pod
    deployment #3 failed 3 days ago: newer deployment was found running
    deployment #2 failed 3 days ago: newer deployment was found running

svc/kube ports 443, 53->8053, 53->8053

svc/router - 172.30.133.243 ports 80, 443, 1936, 9101
  dc/router deploys
    docker.io/openshift3/ose-haproxy-router:v3.4.1.2
    deployment #1 deployed 3 days ago - 1 pod
    docker.io/prom/haproxy-exporter:latest
    deployment #1 deployed 3 days ago - 1 pod

3 warnings identified, use 'oc status -v' to see details.

> > 
> > > I'm assuming that attempting to nslookup that IP from the pod will
> > > have the same result as for 172.17.0.4.

there are 2 IPs available, i tried on both. Which one is the API proxy?

sh-4.2$ nslookup 10.1.2.2                                                                                                                                                                                                        
;; connection timed out; trying next origin                                                                                                                                                                                      
;; connection timed out; trying next origin                                                                                                                                                                                      
;; connection timed out; trying next origin                                                                                                                                                                                      
;; connection timed out; no servers could be reached                                                                                                                                                                             
                                                                                                                                                                                                                                 
sh-4.2$ nslookup 172.30.133.243                                                                                                                                                                                                  
Server:         10.1.2.2                                                                                                                                                                                                         
Address:        10.1.2.2#53                                                                                                                                                                                                      
                                                                                                                                                                                                                                 
Non-authoritative answer:                                                                                                                                                                                                        
243.133.30.172.in-addr.arpa     name = router.default.svc.cluster.local.                                                                                                                                                         
                                                                                                                                                                                                                                 
Authoritative answers can be found from:                                                                                                                                                                                         
                                                                                                                                                                                                                                 
sh-4.2$                                                                                                                                                                                                                          

> > > If that is the case, then the next
> > > step would be to contact an administrator that can ensure the reverse dns is
> > > correctly configured for the master node(s).
> > 
> > For open.paas.redhat.com, why not contacting him if you know what to tell
> > him he/she has to modify.
> > For the CDK, I suppose that we should contact the CDK team to modify
> > something in the CDK?
> 
> This might be resolvable for a deployed cluster, but I'm not sure what the
> fix would be for the CDK.  It's always possible that reverse dns lookups
> will induce latency into an application, and I think it would be a lot
> simpler to default to not doing the lookups than trying to ensure that the
> lookups won't fail.  There is too much environmental dependency involved in
> reverse lookup for a general solution to exist.

I really don't understand why it depends on the applications. We are in a CDk so inside the same OpenShift instance and it seems that some pods are not able to communicate with themselves, why it is not a pure OpenShift/CDK configuration issue?

Comment 28 Maru Newby 2017-02-24 17:01:02 UTC
(In reply to Maru Newby from comment #26)

> > > How can I find this IP in the CDK?
> > 
> > For the CDK it will likely be the same IP as the API (check 'oc status'). 
> 
> C:\Users\Aurelien Pupier>oc status
> In project default on server https://10.1.2.2:8443

10.1.2.2. is the IP of the api, and I would expect the api proxy to connect from it.


> 
> sh-4.2$ nslookup 10.1.2.2                                                   
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; trying next origin                                 
> 
> ;; connection timed out; no servers could be reached                        
> 

Reverse dns lookup is failing for the IP for the API proxy. 


> > This might be resolvable for a deployed cluster, but I'm not sure what the
> > fix would be for the CDK.  It's always possible that reverse dns lookups
> > will induce latency into an application, and I think it would be a lot
> > simpler to default to not doing the lookups than trying to ensure that the
> > lookups won't fail.  There is too much environmental dependency involved in
> > reverse lookup for a general solution to exist.
> 
> I really don't understand why it depends on the applications. We are in a
> CDk so inside the same OpenShift instance and it seems that some pods are
> not able to communicate with themselves, why it is not a pure OpenShift/CDK
> configuration issue?

It depends on the applications because reverse dns lookup is done by the application.  It's not done automatically by the system.  The latency induced by a reverse dns lookup failure is usually ~10s.  Since the api proxy's tls handshake timeout is 10s, it won't be possible to connect via tls through the proxy to applications that insist on doing reverse dns lookup in an environment where reverse lookup will fail.

Comment 29 Maru Newby 2017-02-24 17:11:55 UTC
Is there a reason it's preferable to connect through the API proxy vs exposing via a service?

Comment 30 Aurélien Pupier 2017-02-27 08:16:23 UTC
(In reply to Maru Newby from comment #29)
> Is there a reason it's preferable to connect through the API proxy vs
> exposing via a service?

I have no idea, I'm just a consumer of these images.

Comment 31 Aurélien Pupier 2017-02-27 08:38:25 UTC
(In reply to Maru Newby from comment #28)

> It depends on the applications because reverse dns lookup is done by the
> application.  It's not done automatically by the system.  The latency
> induced by a reverse dns lookup failure is usually ~10s.  Since the api
> proxy's tls handshake timeout is 10s, it won't be possible to connect via
> tls through the proxy to applications that insist on doing reverse dns
> lookup in an environment where reverse lookup will fail.

"in an environment where reverse dns lookup will fail", what is the "environment"? it is the OpenShift instance? it is the CDK? it is the Host system on which the CDK is installed? A combination of them?

Comment 32 Marek Schmidt 2017-02-27 08:43:18 UTC
The Java console (part of the openshift console) does use the API proxy for the user browser to connect to the jolokia agent running in the pod.

The jolokia agent uses the internal java implementation for HTTPS server ( sun.net.httpserver.SSLStreams ), which always calls addr.getHostName() [1] that tries to translate the requestor IP address to hostname.

API proxy usually runs on the master node and if master node open connection to the pod, it does appear to the pod as the hostnetwork IP address (e.g. 10.1.5.1)
, on CDK it would probably be 10.1.2.2 


[1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/net/httpserver/SSLStreams.java/#64

Comment 33 Maru Newby 2017-03-02 21:49:27 UTC
(In reply to Aurélien Pupier from comment #31)
> (In reply to Maru Newby from comment #28)
> 
> > It depends on the applications because reverse dns lookup is done by the
> > application.  It's not done automatically by the system.  The latency
> > induced by a reverse dns lookup failure is usually ~10s.  Since the api
> > proxy's tls handshake timeout is 10s, it won't be possible to connect via
> > tls through the proxy to applications that insist on doing reverse dns
> > lookup in an environment where reverse lookup will fail.
> 
> "in an environment where reverse dns lookup will fail", what is the
> "environment"? it is the OpenShift instance? it is the CDK? it is the Host
> system on which the CDK is installed? A combination of them?

The 'environment' is any kubernetes or openshift deployment where the host running skydns cannot resolve the API's IP address.

Comment 34 Aurélien Pupier 2017-03-03 12:33:03 UTC
(In reply to Maru Newby from comment #33)
> (In reply to Aurélien Pupier from comment #31)
> > (In reply to Maru Newby from comment #28)
> > 
> > > It depends on the applications because reverse dns lookup is done by the
> > > application.  It's not done automatically by the system.  The latency
> > > induced by a reverse dns lookup failure is usually ~10s.  Since the api
> > > proxy's tls handshake timeout is 10s, it won't be possible to connect via
> > > tls through the proxy to applications that insist on doing reverse dns
> > > lookup in an environment where reverse lookup will fail.
> > 
> > "in an environment where reverse dns lookup will fail", what is the
> > "environment"? it is the OpenShift instance? it is the CDK? it is the Host
> > system on which the CDK is installed? A combination of them?
> 
> The 'environment' is any kubernetes or openshift deployment where the host
> running skydns cannot resolve the API's IP address.

I'm on Windows 10 and I never personally added something related to a skyDNS or a specific DNS.
Can it come from a specific application?
How can I check that skyDNS is installed? How to configure it?

Comment 35 Aurélien Pupier 2017-03-07 13:28:30 UTC
this issue https://github.com/openshift/openshift-ansible/issues/3017 in openshift-ansible project is pointed out as can being the responsible for the issue. (see https://issues.jboss.org/browse/OSFUSE-554?focusedCommentId=13361064&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13361064 visibility restricted to Red Hat Employee)

Comment 37 Travis Rogers 2017-06-22 15:35:00 UTC
Script to test curl command to Jolokia java console.

jolokia-dns.sh:

#!/bin/bash
 
host=$1
bearer=$2
namespace=$3
pod=$4
for i in $(seq 0 10)
do
  start=$(date +'%s')
  curl -f -k \
     -H "Authorization: Bearer $bearer" \
     https://${host}/api/v1/namespaces/${namespace}/pods/https:${pod}:8778/proxy/jolokia/ \
    >/dev/null 2>&1
  rc=$?
  end=$(date +'%s')
  duration=$(( $end - $start ))
  if [ $rc -gt 0 ]; then
    echo "$i: NOK $duration s"
  else
    echo "$i: OK $duration s"
  fi
done


Example command is:
jolokia-dns.sh openshift.master.example.com bearer-token project pod

Retrieve a bearer token:
oc whoami -t

Comment 38 Scott Dodson 2017-06-26 19:16:15 UTC
If the host's normal resolver cannot lookup reverse dns for the master's IP address then none of the changes coming in 3.6 would affect this.

In 3.5 reverse lookups from pods go to dnsmasq on the node and then dnsmasq will route those queries to the default nameservers for the host. Kube reverse dns won't work. The only thing we're fixing in 3.6 is that kube service ips will have working reverse dns but this sounds like it's a problem resolving the ip address of the master rather than kube.

Comment 39 Scott Dodson 2017-06-26 19:25:29 UTC
To debug this I'd turn on query logging for dnsmasq on the host where the problematic pod is running, restart dnsmasq, and that should hilight the host that dnsmasq is forwarding the request to.

Comment 51 Scott Dodson 2017-08-14 19:50:25 UTC
Marek, Aurlien,

Are you suggesting that dns resolution failures or timeouts are problematic? In testing this today in Travis's environment we see that Jolokia works fine even though the dns server is returning NXDOMAIN-IPv4 for the source IP address.

Thanks,
Scott

Comment 53 Aurélien Pupier 2017-08-28 11:17:08 UTC
Hi Scott,

sorry but here I just reported the issue and provide information from cross-bugs to try to help but I have no idea what is the root cause personally.

Comment 54 Marek Schmidt 2017-08-28 12:52:09 UTC
Scott,

For Jolokia it does not matter what the result of the reverse DNS query is, as long as it is instantaneous. It is problematic if the reverse DNS lookup timeouts, as then the api proxy timeouts as well with the "net/http: TLS handshake timeout"

Comment 56 Scott Dodson 2017-09-07 19:30:38 UTC
It was determined that the DNS servers in the customer's environment time out all PTR requests rather than returning NXDOMAIN as expected. A workaround of adding static entries to dnsmasq on all nodes like `host-address=master01,10.0.224.1` can work around such a broken dns environment. 

You may find other Java based applications that suffer the same behavior as the reverse lookup is deep inside common java libraries. Incoming requests via the router will similarly originate from the tun0 address of the node the router is running on and you can add entries for that too. However, if requests are made between two pods then there's really no sane way to define entries for each pod ip, hopefully this isn't common.

This was debugged by first verifying the 503 in jolokia. Since the api server proxies to the pod ip address the request originates from the api server's tun0 interface, if the device is not tun0 you can find it by looking at the hostsubnet assigned and the first IP address in that  range will be the source ip for requests routed via that API server, ie: 10.128.0.1 for ose3-master.example.com below

[root@ose3-master ~]# oc get hostsubnets
NAME                      HOST                      HOST IP          SUBNET
ose3-master.example.com   ose3-master.example.com   192.168.122.52   10.128.0.0/23

You'd then attempt to resolve that address via dnsmasq, in this case it was timing out

# dig -x 10.128.0.1

Next, bypass dnsmasq and go directly to the upstream dns servers, these are defined in /etc/dnsmasq.d/origin-upstream-dns.conf if that similarly times out then we've determined that the upstream dns servers are at fault.

# cat /etc/dnsmasq.d/origin-upstream-dns.conf 
server=192.168.122.1

# dig @192.168.122.1 10.128.0.1


I had suggested that the dns changes in 3.6 would have improved this but as it turns out the new node resolver does not provide PTR records for pod ips at all, so it wouldn't have a PTR for the host's sdn IP either. However discussing with Clayton it sounds as if it would be desirable for the node resolver to provide negative responses for both the pod CIDR and the services CIDR ranges so I'll open a bug for that.


Note You need to log in before you can comment on or make changes to this bug.