Description of problem: The head gear haproxy configs hard code the IP address of the sub gears. For Example: server gear-51a5133903ef64c23f000122-REDACTED 10.154.132.226:36121 check fall 2 rise 3 inter 2000 cookie 51a5133903ef64c23f000122-REDACTED The IP address 10.154.132.226 should really be a DNS entry for the gear. In other words, that line should look like this: server gear-51a5133903ef64c23f000122-REDACTED 51a5133903ef64c23f000122-REDACTED.rhcloud.com:36121 check fall 2 rise 3 inter 2000 cookie 51a5133903ef64c23f000122-REDACTED Hard coding the IP address is bad for a number of reasons, but primarily it's bad when the IP address of the gear changes. When this happens, the haproxy config is broken and won't work. The IP address of a gear can change for a couple of reasons: 1. The ex-node has problems and must be stopped and started so that it's placed on a different physical vm host. 2. We move the gear from one ex-node to another Both of these happen quite often in PROD. Once these happen, the haproxy configs are wrong until some event causes the haproxy config file to be rewritten (like a scale up or scale down event). However, many users lock the number of gears they run to a certain number, which means that they will not receive the config file updates. Version-Release number of selected component (if applicable): openshift-origin-cartridge-haproxy-0.4.7-1.el6oso.noarch openshift-origin-cartridge-haproxy-1.4-1.9.3-1.el6oso.noarch How reproducible: Very Steps to Reproduce: 1. Create a scaled app 2. Look at the head gear's haproxy config file, which is located here: /var/lib/openshift/$UUID/haproxy/conf/haproxy.cfg 3. Notice that it has IP addresses in it. Actual results: haproxy config file uses IP addresses Expected results: haproxy config file should use the gear's DNS for exactly the same reason that we tell end users to not use IP addresses and instead use the gear DNS.
The reason we went with IPs was because the DNS took time to resolve. Also, AFAIK the connection hooks are run after a move and they should fix up the haproxy configuration. If they aren't run then they should be run.
I see, ok, well that still doesn't address the problem of stopping and starting an ex-node in AWS. Unfortunately, for a number of reasons, stopping and starting instances in AWS is something we have to be able to do without breaking a bunch of apps / gears. According to our internal monitoring, Dyn now takes on average 10.15 seconds between when we register new DNS and when it propagates to AWS. It does sometimes take much longer (like minutes), but that's pretty rare these days.
It might be possible to use validate configuration feature to fix the haproxy config.
We've talked about this a few times and per gear DNS records may be the only way to do this across all the use cases we are going to need to support.
In 2.0.35 we now use the public_hostname instead of the IP address.
Checked on latest STG (devenv-stage_552), issue has been fixed. $ cat haproxy.cfg <---> server gear-52788aaadbd93cc3ab00004b-bmeng1stg ex-std-node1.stg.rhcloud.com:43401 check fall 2 rise 3 inter 2000 cookie 52788aaadbd93cc3ab00004b-bmeng1stg Move bug to verified.