Description of problem: Now 3.6 router introduce a new port (1935) named "router-stats", when deploying multiple routers on the same host, only one router is deployed successfully, others failed due to 1935 port is already in use by other router. Version-Release number of selected component (if applicable): openshift-ansible-3.6.84-1.git.0.72b2d74.el7.noarch openshift v3.6.74 kubernetes v1.6.1+5115d708d7 How reproducible: Always Steps to Reproduce: 1. Set the following openshift_hosted_routers option in inventory host file, and trigger an cluster install which only have 1 single node. openshift_hosted_routers=[{"name": "router1", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1936, "ports": ["80:80", "443:443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": []}, {"name": "router2", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1939, "ports": ["8080:8080", "6443:6443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": [{"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "443"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "6443"}}, {"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "80"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "8080"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "NAMESPACE_LABELS", "value": "n=install-test"}}]}] 2. The router2 deployment failed, due to the default 1935 is already in use by router1. 3. Try to update openshift_hosted_routers option as following to update default 1935 to 1940, openshift_hosted_routers=[{"name": "router1", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1936, "ports": ["80:80", "443:443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": []}, {"name": "router2", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1939, "ports": ["8080:8080", "6443:6443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": [{"action": "update", "curr_value": {"name": "ROUTER_LISTEN_ADDR", "value": "0.0.0.0:1935"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_LISTEN_ADDR", "value": "0.0.0.0:1940"}},{"action": "update", "curr_value": {"containerPort": 1935, "hostPort": 1935, "name": "router-stats", "protocol": "TCP"}, "key": "spec.template.spec.containers[0].ports", "value": {"containerPort": 1940, "hostPort": 1940, "name": "router-stats", "protocol": "TCP"}},{"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "443"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "6443"}}, {"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "80"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "8080"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "NAMESPACE_LABELS", "value": "n=install-test"}}]}] Actual results: router2 is still failed to be deployed. Though I added update action to update the default 1935 port to 1940, but only spec.template.spec.containers[0].env is updated, spec.template.spec.containers[0].ports do not. After installation, router2's dc is showing as following: # oc get dc router2 -o yaml <--snip--> - name: ROUTER_LISTEN_ADDR value: 0.0.0.0:1940 <--snip--> - containerPort: 1935 hostPort: 1935 name: router-stats protocol: TCP <--snip--> After the above failure, I edit router2 manually to change 1935 to 1940, router2 is deployed successfully. Expected results: installer should allow user to update spec.template.spec.containers[0].ports upon multiple router shards install on the same host. Additional info:
Need to plumb through the ability to set router stats port.
BTW, I also opened a RFE bug (BZ#1452019) for router command line to allow user set customized router stats port. Once BZ#1452019 is fixed, installer only need do some code change to support "router_stats_port" setting just like "stats_port" what we have now. Before the RFE bug is fixed, I thought I could workaround it via spec.template.spec.containers[0].ports update, but it does NOT. Now I am not sure now installer does not support spec.template.spec.containers[0].ports update, only support spec.template.spec.containers[0].env update? or my json syntax is wrong?
I have ran the installer and tried every conceivable permutation the EDIT params. The closest I could get to a working dual-router-single-host deployment was using this syntax [1]. The problem I am running into, which is causing the PodFitsHostPorts condition to fail in the second router deployment, is the editing of the 'router-stats' value in the spec.template.spec.containers[0].ports list. Even with an explicit edit->update->new-ports stanza (see gist [0]), the resources are both ultimately created with: > - containerPort: 1935 > hostPort: 1935 > name: router-stats > protocol: TCP Stranger is that the other item in that list, with "name: 'stats" is mutable. I can create two routers at the same time and modify the 'router-stats' port as I want. This matches up pretty well with what you experienced. Something fundamentally is hard-coding that 'stats' port to 1935. If I 'oc edit dc router2' I can change the 'stats' port value to 1940 (or anything else if I wish), and then do an 'oc rollout retry dc router3' and they will both be online and healthy. I suppose that if we wanted to support this in openshift-ansible we would have to do some hacks to edit the DC for router n+1 to ensure they have unique ports set correctly. But this looks like an ugly hack. Just look at the length of the `openshift_hosted_routers` string in the gist [0]. Another thing I noticed, when I edit 'router2' in the gist the 'ROUTER_LISTEN_ADDR' field is actually missing in the deployment. However, the unmodified 'router3' has the field present with its default value. I am not sure what the best thing to do here is. Is this some kind of upstream bug? Some strange hard-coded behavior? [0] https://gist.github.com/tbielawa/ae4a1f4163748094589ec6a839c97bd7
(In reply to Tim Bielawa from comment #3) > Stranger is that the other item in that list, with "name: 'stats" is > mutable. I do not think that take effects because of your an explicit edit->update->new-ports stanza, should be '"stats_port": 1941' help you achieve that. That means probably whatever you are trying to update "router-stats" or "stats", any edit->update->new-ports update would take no any effect.
According to this comment [1] by Clayton, the `stats-port` that gets created on port 1935 is part of the haproxy prometheus stats. It seems that, by design [2], the new "stats-port" will always listen on port 1935, regardless of your specified --stats-port value, as long as the router type is "haproxy". 1. https://github.com/openshift/origin/issues/14759#issuecomment-309573659 2. https://github.com/openshift/origin/blob/master/pkg/cmd/admin/router/router.go#L685
This is being changed in https://github.com/openshift/origin/pull/14790 1. Asking for the new style metrics port (on by default) completely overrides the old stats port, and the old haproxy stats page will not be available on that port anymore, just /healthz and /metrics 2. --listen-addr now overrides stats-port 3. The new 1935 port is completely removed 4. A new health check test for reload-haproxy and from the openshift-router process will check against the primary public port (80) and verify that when no HOST is set a 503 is returned. Someone customizing the template is also expected to customize the reload behavior. It also handles PROXY mode with a different health check style. 5. A new env var ROUTER_METRICS_READY_HTTP_URL may be passed for template customizers who want to run the stats port in a different fashion.
This should be fixed in 3.6.128 of OCP.
I verified this was fixed in a recent build: atomic-openshift-master-3.6.129-1.git.0.f01d244.el7.x86_64 oc get pods NAME READY STATUS RESTARTS AGE router-external-1-18njj 1/1 Running 0 40s router-external-1-cd8b7 1/1 Running 0 40s router-internal-1-1x72g 1/1 Running 0 23s router-internal-1-z4sgz 1/1 Running 0 23s oc adm router -o yaml --dry-run: spec: ports: - name: 80-tcp port: 80 targetPort: 80 - name: 443-tcp port: 443 targetPort: 443 - name: 1936-tcp port: 1936 protocol: TCP targetPort: 1936 Thanks Juan and Clayton for the work on this!
Verified this bug with openshift v3.6.131, and PASS. # oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE router-1-86kgd 1/1 Running 0 5m 172.18.3.255 ip-172-18-3-255.ec2.internal router-1-nv1vf 1/1 Running 0 5m 172.18.22.240 ip-172-18-22-240.ec2.internal router1-1-c8wfp 1/1 Running 0 5m 172.18.3.255 ip-172-18-3-255.ec2.internal 1935 port is totally removed now. Multiple router shareds deployment still keep original deployment way. openshift_hosted_routers=[{"name": "router", "replicas": 2, "serviceaccount": "router", "namespace": "default", "stats_port": 1936, "ports": ["80:80", "443:443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {"certfile": "{{ lookup(\"env\", \"WORKSPACE\") }}/private-openshift-misc/v3-launch-templates/functionality-testing/aos-36/extra-ansible/files/custom_router_1.crt", "keyfile": "{{ lookup(\"env\", \"WORKSPACE\") }}/private-openshift-misc/v3-launch-templates/functionality-testing/aos-36/extra-ansible/files/custom_router_1.key", "cafile": "{{ lookup(\"env\", \"WORKSPACE\") }}/private-openshift-misc/v3-launch-templates/functionality-testing/aos-36/extra-ansible/files/custom_router_1_rootca.crt"}, "selector": "role=node,router=enabled", "edits": [{"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTE_LABELS", "value": "route=external"}}]}, {"name": "router1", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1937, "ports": ["7080:7080", "7443:7443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": [{"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "443"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "7443"}}, {"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "80"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "7080"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "NAMESPACE_LABELS", "value": "n=install-test"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_SNI_PORT", "value": "10445"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_NO_SNI_PORT", "value": "10442"}}]}]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716