Bug 968334

Summary:	Head gear haproxy configs hard code the IP address of the sub gears...
Product:	OpenShift Online	Reporter:	Thomas Wiest <twiest>
Component:	Containers	Assignee:	Mrunal Patel <mpatel>
Status:	CLOSED CURRENTRELEASE	QA Contact:	libra bugs <libra-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.x	CC:	agoldste, bhatiam, bmeng, ccoleman, jhonce, mpatel
Target Milestone:	---	Keywords:	UpcomingRelease
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-01-24 03:22:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Thomas Wiest 2013-05-29 13:27:34 UTC

Description of problem:
The head gear haproxy configs hard code the IP address of the sub gears.

For Example:
server gear-51a5133903ef64c23f000122-REDACTED 10.154.132.226:36121 check fall 2 rise 3 inter 2000 cookie 51a5133903ef64c23f000122-REDACTED

The IP address 10.154.132.226 should really be a DNS entry for the gear.

In other words, that line should look like this:
server gear-51a5133903ef64c23f000122-REDACTED 51a5133903ef64c23f000122-REDACTED.rhcloud.com:36121 check fall 2 rise 3 inter 2000 cookie 51a5133903ef64c23f000122-REDACTED

Hard coding the IP address is bad for a number of reasons, but primarily it's bad when the IP address of the gear changes. When this happens, the haproxy config is broken and won't work.

The IP address of a gear can change for a couple of reasons:
1. The ex-node has problems and must be stopped and started so that it's placed on a different physical vm host.
2. We move the gear from one ex-node to another

Both of these happen quite often in PROD.

Once these happen, the haproxy configs are wrong until some event causes the haproxy config file to be rewritten (like a scale up or scale down event).

However, many users lock the number of gears they run to a certain number, which means that they will not receive the config file updates.

Version-Release number of selected component (if applicable):
openshift-origin-cartridge-haproxy-0.4.7-1.el6oso.noarch
openshift-origin-cartridge-haproxy-1.4-1.9.3-1.el6oso.noarch

How reproducible:
Very

Steps to Reproduce:
1. Create a scaled app
2. Look at the head gear's haproxy config file, which is located here: /var/lib/openshift/$UUID/haproxy/conf/haproxy.cfg
3. Notice that it has IP addresses in it.

Actual results:
haproxy config file uses IP addresses

Expected results:
haproxy config file should use the gear's DNS for exactly the same reason that we tell end users to not use IP addresses and instead use the gear DNS.

Comment 1 Mrunal Patel 2013-05-29 15:23:13 UTC

The reason we went with IPs was because the DNS took time to resolve. Also, AFAIK
the connection hooks are run after a move and they should fix up the haproxy configuration. If they aren't run then they should be run.

Comment 2 Thomas Wiest 2013-05-29 15:34:27 UTC

I see, ok, well that still doesn't address the problem of stopping and starting an ex-node in AWS.

Unfortunately, for a number of reasons, stopping and starting instances in AWS is something we have to be able to do without breaking a bunch of apps / gears.

According to our internal monitoring, Dyn now takes on average 10.15 seconds between when we register new DNS and when it propagates to AWS.

It does sometimes take much longer (like minutes), but that's pretty rare these days.

Comment 3 Mrunal Patel 2013-05-29 18:03:38 UTC

It might be possible to use validate configuration feature to fix the haproxy config.

Comment 4 Clayton Coleman 2013-06-17 15:53:23 UTC

We've talked about this a few times and per gear DNS records may be the only way to do this across all the use cases we are going to need to support.

Comment 5 Andy Goldstein 2013-11-04 15:05:38 UTC

In 2.0.35 we now use the public_hostname instead of the IP address.

Comment 6 Meng Bo 2013-11-05 06:17:36 UTC

Checked on latest STG (devenv-stage_552), issue has been fixed.


$ cat haproxy.cfg
<--->
    server gear-52788aaadbd93cc3ab00004b-bmeng1stg ex-std-node1.stg.rhcloud.com:43401 check fall 2 rise 3 inter 2000 cookie 52788aaadbd93cc3ab00004b-bmeng1stg


Move bug to verified.