Bug 1275003

Summary:	router reports 503 for seemingly properly identified route when endpoints accessible
Product:	OpenShift Container Platform	Reporter:	Erik M Jacobs <ejacobs>
Component:	Networking	Assignee:	Michail Kargakis <mkargaki>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, ejacobs, pweil, sdodson
Version:	3.1.0
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-23 14:26:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Erik M Jacobs 2015-10-24 19:11:44 UTC

[joe@ose3-master ~]$ oc version
oc v3.0.2.903
kubernetes v1.2.0-alpha.1-1107-g4c8e6f4

atomic-openshift-3.0.2.903-0.git.0.a4ff36b.el7aos.x86_64
atomic-openshift-clients-3.0.2.903-0.git.0.a4ff36b.el7aos.x86_64
atomic-openshift-master-3.0.2.903-0.git.0.a4ff36b.el7aos.x86_64
atomic-openshift-node-3.0.2.903-0.git.0.a4ff36b.el7aos.x86_64
atomic-openshift-sdn-ovs-3.0.2.903-0.git.0.a4ff36b.el7aos.x86_64
tuned-profiles-atomic-openshift-node-3.0.2.903-0.git.0.a4ff36b.el7aos.x86_64

master node can reach other node with pod:
[root@ose3-master training]# oc get endpoints hello-service -n demo
NAME            ENDPOINTS                                   AGE
hello-service   10.1.1.6:8080,10.1.1.7:8080,10.1.2.2:8080   10m
[root@ose3-master training]# curl 10.1.2.2:8080
Hello OpenShift!
[root@ose3-master 

router configuration:
[root@ose3-master training]# oc exec router-2-2ty6y -- cat /var/lib/containers/router/routes.json
{
  "default/kubernetes": {
    "Name": "default/kubernetes",
    "EndpointTable": [
      {
        "ID": "192.168.133.2:8443",
        "IP": "192.168.133.2",
        "Port": "8443",
        "TargetName": "192.168.133.2",
        "PortName": "https"
      }
    ],
    "ServiceAliasConfigs": {}
  },
  "default/router": {
    "Name": "default/router",
    "EndpointTable": [
      {
        "ID": "192.168.133.2:80",
        "IP": "192.168.133.2",
        "Port": "80",
        "TargetName": "router-2-2ty6y",
        "PortName": "80-tcp"
      }
    ],
    "ServiceAliasConfigs": {}
  },
  "demo/hello-service": {
    "Name": "demo/hello-service",
    "EndpointTable": [
      {
        "ID": "10.1.1.6:8080",
        "IP": "10.1.1.6",
        "Port": "8080",
        "TargetName": "hello-openshift-1",
        "PortName": ""
      },
      {
        "ID": "10.1.1.7:8080",
        "IP": "10.1.1.7",
        "Port": "8080",
        "TargetName": "hello-openshift-3",
        "PortName": ""
      },
      {
        "ID": "10.1.2.2:8080",
        "IP": "10.1.2.2",
        "Port": "8080",
        "TargetName": "hello-openshift-2",
        "PortName": ""
      }
    ],
    "ServiceAliasConfigs": {
      "demo_hello-service": {
        "Host": "hello-service-demo.cloudapps.example.com",
        "Path": "",
        "TLSTermination": "",
        "Certificates": null,
        "Status": "",
        "PreferPort": "8888"
      }
    }
  }
}

router reports 503:
[root@ose3-master training]# curl hello-service-demo.cloudapps.example.com
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

router reports 503 inside router:
[root@ose3-master training]# oc exec router-2-2ty6y -it -- bash
[root@ose3-master conf]# curl hello-service-demo.cloudapps.example.com
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

router can reach pod on other node, so 503 isn't related to endpoint:
[root@ose3-master conf]# curl 10.1.2.2:8080
Hello OpenShift!

What's wrong here?

Comment 2 Erik M Jacobs 2015-10-24 20:08:45 UTC

All endpoints are reachable, in case anyone was wondering (to be thorough):

[root@ose3-master training]# curl 10.1.1.6:8080
Hello OpenShift!
[root@ose3-master training]# curl 10.1.1.7:8080                                                          
Hello OpenShift!
[root@ose3-master training]# curl 10.1.2.2:8080                                                          
Hello OpenShift!
[root@ose3-master training]# oc exec router-2-2ty6y -it -- bash
[root@ose3-master conf]# curl 10.1.1.6:8080
Hello OpenShift!
[root@ose3-master conf]# curl 10.1.1.7:8080
Hello OpenShift!
[root@ose3-master conf]# curl 10.1.2.2:8080
Hello OpenShift!

Comment 3 Erik M Jacobs 2015-10-24 20:44:02 UTC

inside the router:
[root@ose3-master /]# find -name '*.map'
./var/lib/haproxy/conf/os_http_be.map
./var/lib/haproxy/conf/os_sni_passthrough.map
./var/lib/haproxy/conf/os_reencrypt.map
./var/lib/haproxy/conf/os_edge_http_be.map
./var/lib/haproxy/conf/os_tcp_be.map
./usr/share/groff/1.22.2/font/devps/generate/dingbats.map
[root@ose3-master /]# cd /var/lib/haproxy/conf/
[root@ose3-master conf]# cat os_http_be.map 
hello-service-demo.cloudapps.example.com demo_hello-service










----

Yes, everything above ---- is whitespace/blank

[root@ose3-master training]# oc logs router-1-vyia4
I1024 16:35:35.754370       1 router.go:122] Router is including routes in all namespaces

Logs don't indicate anything...

Should I not see something like:
backend be_http_tim-test-2_nodejs-example-http

  mode http
  option redispatch
  balance leastconn
  timeout check 5000ms
 
    cookie OPENSHIFT_tim-test-2_nodejs-example-http_SERVERID insert indirect nocache httponly

  server 10.1.3.6:8080 10.1.3.6:8080 check inter 5000ms cookie 10.1.3.6:8080

Comment 4 zhaozhanqi 2015-10-26 08:42:27 UTC

@Erik

QE cannot reproduce this issue..here are the steps

1. Create 3 pods with labels 'name: hello-nginx-docker"
   
# oc get pod -n zzhao
NAME                   READY     STATUS    RESTARTS   AGE
hello-nginx-docker     1/1       Running   0          19m
hello-nginx-docker-2   1/1       Running   0          18m
hello-nginx-docker-3   1/1       Running   0          17m
   
2. Create service to mapping those three pods

[root@openshift-140 ~]# oc get endpoints -n zzhao
NAME          ENDPOINTS                                AGE
hello-nginx   10.1.1.37:80,10.1.1.38:80,10.1.1.39:80   18m


3. oc expose svc hello-nginx
  
# oc get route
NAME          HOST/PORT                              PATH      SERVICE       LABELS             TLS TERMINATION
hello-nginx   hello-nginx-zzhao.ose-appoxza.com.cn             hello-nginx   name=hello-nginx  

4. curl the route
# curl hello-nginx-zzhao.ose-appoxza.com.cn
Hello World

I was wondering your haproxy.conf is wrong, here

<------snip---->
 cookie OPENSHIFT_tim-test-2_nodejs-example-http_SERVERID insert indirect nocache httponly

  server 10.1.3.6:8080 10.1.3.6:8080 check inter 5000ms cookie 10.1.3.6:8080

<-----snip---->

don't know why the pod ip is '10.1.3.6', normally this should be your three pods ip.

you can check my side:

######haproxy.conf######
backend be_http_zzhao_hello-nginx
                
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout check 5000ms
  http-request set-header X-Forwarded-Host %[req.hdr(host)]
  http-request set-header X-Forwarded-Port %[dst_port]
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  
    cookie OPENSHIFT_zzhao_hello-nginx_SERVERID insert indirect nocache httponly
    http-request set-header X-Forwarded-Proto http
  
  http-request set-header Forwarded for=%[src],host=%[req.hdr(host)],proto=%[req.hdr(X-Forwarded-Proto)]
                
  server 10.1.1.37:80 10.1.1.37:80 check inter 5000ms cookie 10.1.1.37:80
                
  server 10.1.1.38:80 10.1.1.38:80 check inter 5000ms cookie 10.1.1.38:80
                
  server 10.1.1.39:80 10.1.1.39:80 check inter 5000ms cookie 10.1.1.39:80

#######################################

Could you double check your service if can work, I doubt the service is not working..

thanks.

Comment 5 Erik M Jacobs 2015-10-27 07:13:12 UTC

There is no file 'haproxy.conf' in my router:

[root@ose3-master /]# find -name '*.map*'
./var/lib/haproxy/conf/os_http_be.map
./var/lib/haproxy/conf/os_sni_passthrough.map
./var/lib/haproxy/conf/os_reencrypt.map
./var/lib/haproxy/conf/os_edge_http_be.map
./var/lib/haproxy/conf/os_tcp_be.map
./usr/share/groff/1.22.2/font/devps/generate/dingbats.map
[root@ose3-master /]# find -name 'haproxy.conf'

There is this file:
./var/lib/haproxy/conf/haproxy.config

It has the following:

##-------------- app level backends ----------------
         
                
backend be_http_demo_hello-service
                
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout check 5000ms
  http-request set-header X-Forwarded-Host %[req.hdr(host)]
  http-request set-header X-Forwarded-Port %[dst_port]
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  
    cookie OPENSHIFT_demo_hello-service_SERVERID insert indirect nocache httponly
    http-request set-header X-Forwarded-Proto http
  
  http-request set-header Forwarded for=%[src],host=%[req.hdr(host)],proto=%[req.hdr(X-Forwarded-Proto)]
                
  
###

There are no IPs listed.

The service works:
[root@ose3-master training]# oc get service -n demo
NAME            CLUSTER_IP       EXTERNAL_IP   PORT(S)    SELECTOR               AGE
hello-service   172.30.176.104   <none>        8888/TCP   name=hello-openshift   3m
[root@ose3-master training]# curl 172.30.176.104:8888
Hello OpenShift!
[root@ose3-master training]# oc get endpoints hello-service -n demo
NAME            ENDPOINTS                                   AGE
hello-service   10.1.1.2:8080,10.1.2.2:8080,10.1.2.3:8080   4m

Note the router can reach the pods still:
[root@ose3-master training]# oc exec router-2-4rjez -it -- curl 10.1.2.3:8080
Hello OpenShift!

Something appears to be wrong with the router obtaining the endpoints?

Comment 6 zhaozhanqi 2015-10-27 07:30:16 UTC

that's weird,  I never met this kind of issue before, can you try to delete the service and re-create it and better to change another port instead of '8888'.

Found /var/lib/containers/router/routes.json has a little different, the 'PreferPort' is nil in my site.

"ServiceAliasConfigs": {
      "demo_hello-service": {
        "Host": "hello-service-demo.cloudapps.example.com",
        "Path": "",
        "TLSTermination": "",
        "Certificates": null,
        "Status": "",
        "PreferPort": "8888"           #### here is nil in my site

Comment 7 Erik M Jacobs 2015-10-27 08:07:53 UTC

I will try it again (I had to rebuild with 3.0.2) but I do not think the port is the issue. I am using the exact same JSON objects to define the pods and service with 3.0.2 as I am using with 3.1:

https://github.com/thoraxe/training/blob/31-fixes/content/hello-service-pods.json
https://github.com/thoraxe/training/blob/31-fixes/content/hello-service.json

Followed by:
oc expose service hello-service -l name=hello-openshift

The thing that is strange to me is that the router's JSON file (the /var/lib/containers/router/routes.json) has the endpoints listed, but neither the map file nor the conf file have the endpoints.

Again, the two JSON definitions (pods, service) + oc expose works in 3.0.2 on port 8888.

Comment 8 Paul Weil 2015-10-27 18:12:35 UTC

Michalis,

It looks like this is being caused by having a service definition that is using a port that is not the same as the endpoint ports.  

The service in question is exposing 8888, the endpoints are serving on 8080.  When the oc expose command is run it is setting the target port to 8888 which results in no endpoints being chosen when handling the route in the router plugin. 

Related: https://github.com/openshift/origin/pull/5067

Comment 9 zhaozhanqi 2015-10-29 08:12:07 UTC

(In reply to Paul Weil from comment #8)
> Michalis,
> 
> It looks like this is being caused by having a service definition that is
> using a port that is not the same as the endpoint ports. 

That's means in service the port must same as 'target port' in future? or the fixed code has not be merged to AEP 3.1 since I reproduced this issue on AEP 3.1

oc v3.0.2.903-29-g49953d6
kubernetes v1.2.0-alpha.1-1107-g4c8e6f4

Comment 10 Paul Weil 2015-10-29 12:35:12 UTC

(In reply to zhaozhanqi from comment #9)
> (In reply to Paul Weil from comment #8)
> > Michalis,
> > 
> > It looks like this is being caused by having a service definition that is
> > using a port that is not the same as the endpoint ports. 
> 
> That's means in service the port must same as 'target port' in future? or
> the fixed code has not be merged to AEP 3.1 since I reproduced this issue on
> AEP 3.1
> 
> oc v3.0.2.903-29-g49953d6
> kubernetes v1.2.0-alpha.1-1107-g4c8e6f4

No, Michalis has fixed it, it just has not merged yet.

Comment 11 openshift-github-bot 2015-10-29 15:40:45 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/af969529ab4af1776cd78dda61d142ee10d09cf9
Bug 1275003: expose: Set route port based on service target port

Route should be using the port used from the endpoints of the service
and not the port the service is exposing.

Comment 12 zhaozhanqi 2015-11-02 10:34:06 UTC

this bug has been fixed 
oc v3.0.2.905
kubernetes v1.2.0-alpha.1-1107-g4c8e6f4


# oc get endpoints
NAME          ENDPOINTS         AGE
hello-nginx   10.1.2.114:8080   2h


 "Path": "",
        "TLSTermination": "",
        "Certificates": null,
        "Status": "saved",
        "PreferPort": "8080",
        "InsecureEdgeTerminationPolicy": ""
      }

Comment 13 Brenton Leanhardt 2015-11-23 14:26:31 UTC

This fix is available in OpenShift Enterprise 3.1.