1343083 – not possible to start two router pods on same node

Bug 1343083 - not possible to start two router pods on same node

Summary: not possible to start two router pods on same node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267746
TreeView+	depends on / blocked

Reported:	2016-06-06 13:12 UTC by Alexander Koksharov
Modified:	2022-08-04 22:20 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Added the ability to set the internal SNI port with an environment variable. This allows all ports to be changed so that multiple routers can be run on a single node. Reason: Multiple routers may be needed to support different features (sharding). Result:
Clone Of:
Environment:
Last Closed:	2016-09-27 09:33:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1933	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.3 Release Advisory	2016-09-27 13:24:36 UTC

Description Alexander Koksharov 2016-06-06 13:12:56 UTC

Description of problem:

When two pods scheduled to run on the same node (different listen IPs set through env variables), majority of requests to what ever router are failed with error 503.

Version-Release number of selected component (if applicable):


How reproducible:
Start two router pods on a node

Steps to Reproduce:
1.
2.
3.

Actual results:
lots of requests fail

Expected results:
all requests forwarded to the application

Additional info:

- I did have this behavior when two router pods are running on same node:
> for i in {1..10}; do curl -sSLko /dev/null -w '%{http_code}\n' https://hello-world-cake.apps.alko.lab:11443/; done | grep  200 |  wc -l
2
> for i in {1..10}; do curl -sSLko /dev/null -w '%{http_code}\n' https://hello-world-cake.apps.alko.lab:11443/; done | grep  200 |  wc -l
7

- the below output suggests that there could be race conditions while binding to ports 10443 and 10444
# netstat -nlp4 | grep haproxy | sort
tcp        0      0 0.0.0.0:11080           0.0.0.0:*               LISTEN      2468/haproxy        
tcp        0      0 0.0.0.0:11443           0.0.0.0:*               LISTEN      2468/haproxy        
tcp        0      0 0.0.0.0:1936            0.0.0.0:*               LISTEN      59318/haproxy       
tcp        0      0 0.0.0.0:1938            0.0.0.0:*               LISTEN      2468/haproxy        
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      59318/haproxy       
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      59318/haproxy       
tcp        0      0 127.0.0.1:10443         0.0.0.0:*               LISTEN      59318/haproxy       
tcp        0      0 127.0.0.1:10444         0.0.0.0:*               LISTEN      59318/haproxy       

- I found out that standard template does have these ports defined. So, I altered configs for one of my routers:
root@master1 # oc get pods
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-2-qy76m   1/1       Running   0          8d
router-4-2lxqd            1/1       Running   0          8d
router-two-4-v0i7b        1/1       Running   0          9m

root@master1 # oc exec router-two-4-v0i7b  -- cat haproxy.config| grep -P "^\s+bind"
  bind :11080
  bind :11443
  bind 127.0.0.1:20444 ssl no-sslv3 crt /var/lib/haproxy/conf/default_pub_keys.pem crt /var/lib/containers/router/certs accept-proxy
  bind 127.0.0.1:20443 ssl no-sslv3 crt /var/lib/haproxy/conf/default_pub_keys.pem accept-proxy

root@master1 # oc exec router-4-2lxqd  -- cat haproxy.config| grep -P "^\s+bind"
  bind :80
  bind :443
  bind 127.0.0.1:10444 ssl no-sslv3 crt /var/lib/haproxy/conf/default_pub_keys.pem crt /var/lib/containers/router/certs accept-proxy
  bind 127.0.0.1:10443 ssl no-sslv3 crt /var/lib/haproxy/conf/default_pub_keys.pem accept-proxy

- As a result i have now:
root@worknode1 # netstat -nlp4 | grep haproxy
tcp        0      0 0.0.0.0:11080           0.0.0.0:*               LISTEN      76799/haproxy       
tcp        0      0 127.0.0.1:10443         0.0.0.0:*               LISTEN      76777/haproxy       
tcp        0      0 127.0.0.1:10444         0.0.0.0:*               LISTEN      76777/haproxy       
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      76777/haproxy       
tcp        0      0 0.0.0.0:1936            0.0.0.0:*               LISTEN      76777/haproxy       
tcp        0      0 0.0.0.0:1938            0.0.0.0:*               LISTEN      76799/haproxy       
tcp        0      0 0.0.0.0:11443           0.0.0.0:*               LISTEN      76799/haproxy       
tcp        0      0 127.0.0.1:20443         0.0.0.0:*               LISTEN      76799/haproxy       
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      76777/haproxy       
tcp        0      0 127.0.0.1:20444         0.0.0.0:*               LISTEN      76799/haproxy       

alko@localhost > for i in {1..10}; do curl -sSLko /dev/null -w '%{http_code}\n' https://hello-world-cake.apps.alko.lab/; done | grep  200 |  wc -l
10

Comment 1 Aleks Lazic 2016-06-06 13:16:49 UTC

Hi.

Florian have fixed it like this.

https://github.com/git001/openshift_custom_haproxy_ext/pull/1

BR Aleks

Comment 2 Aleks Lazic 2016-06-06 13:43:40 UTC

(In reply to Aleks Lazic from comment #1)
> Hi.
> 
> Florian have fixed it like this.
> 
> https://github.com/git001/openshift_custom_haproxy_ext/pull/1
> 
> BR Aleks

I have added a PR to origin.

https://github.com/openshift/origin/pull/9175

BR Aleks

Comment 3 Josep 'Pep' Turro Mauri 2016-06-07 07:41:05 UTC

There is a similar report in bug 1268904: it's for a different pair of ports, but essentially the same thing I believe. Wondering if we should mark this as a duplicate and make 1268904 handle all the hardcoded values.

Comment 4 Aleks Lazic 2016-06-07 08:04:02 UTC

Well it it's get faster fixed I'm in.

Comment 5 Ben Bennett 2016-06-08 17:55:01 UTC

No, 1268904 has already merged and is slightly different.  This PR has been reviewed and should be merged shortly, so let's keep this as a separate bug for now.

Comment 7 Eric Rich 2016-07-06 12:50:14 UTC

Ben / Ram, 

I think we can move this to POST? As https://github.com/openshift/origin/commit/5d25a1da3da43bdb74decf641e91ce0245490438 is merged upstream, and is deigned to fix this?

Comment 9 Ben Bennett 2016-07-08 17:46:48 UTC

(In reply to Eric Rich from comment #7)
> Ben / Ram, 
> 
> I think we can move this to POST? As
> https://github.com/openshift/origin/commit/
> 5d25a1da3da43bdb74decf641e91ce0245490438 is merged upstream, and is deigned
> to fix this?

That is correct.

Comment 10 Aleks Lazic 2016-07-11 19:15:03 UTC

does this mean that we can expect this template in Openshift Enterprise with the next update?!

More concrete question.
What does POST means for the end-users like the RH OSE Customers out there?

Comment 11 Jaspreet Kaur 2016-07-22 10:19:38 UTC

Hello,

Can we have an ETA as to when this is expected to fixed.

Regards,
Jaspreet

Comment 13 Ben Bennett 2016-07-22 13:59:42 UTC

It should be in 3.3.

As a work-around, on 3.2 you can replace the template in a router without rebuilding an image.  You can do that by making a ConfigMap that contains the changed template and then changing the router DC.  So, you'd pull the current router image and then apply the change in https://github.com/openshift/origin/commit/5d25a1da3da43bdb74decf641e91ce0245490438 to the new template.

A guide:
https://github.com/openshift/openshift-docs/blob/master/install_config/install/deploy_router.adoc#using-configmap-replace-template

Comment 14 zhaozhanqi 2016-08-16 09:08:33 UTC

verified this bug in 

# openshift version
openshift v3.3.0.21
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git


$ for i in {1..10} ; do curl --resolve test-service-default.0816-j34.qe.rhcloud.com:10443:172.18.7.237 https://test-service-default.0816-j34.qe.rhcloud.com:10443 -k ; done
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!
Hello OpenShift!

Comment 16 errata-xmlrpc 2016-09-27 09:33:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933

Note You need to log in before you can comment on or make changes to this bug.