Bug 1455459 - multiple router shards deployment failed on the same host due to newly introduced router-stats port.
Summary: multiple router shards deployment failed on the same host due to newly introd...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Kenny Woodson
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-25 09:00 UTC by Johnny Liu
Modified: 2017-08-16 19:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-10 05:25:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Johnny Liu 2017-05-25 09:00:28 UTC
Description of problem:
Now 3.6 router introduce a new port (1935) named "router-stats", when deploying multiple routers on the same host, only one router is deployed successfully, others failed due to 1935 port is already in use by other router.

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.84-1.git.0.72b2d74.el7.noarch
openshift v3.6.74
kubernetes v1.6.1+5115d708d7

How reproducible:
Always

Steps to Reproduce:
1. Set the following openshift_hosted_routers option in inventory host file, and trigger an cluster install which only have 1 single node.

openshift_hosted_routers=[{"name": "router1", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1936, "ports": ["80:80", "443:443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": []}, {"name": "router2", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1939, "ports": ["8080:8080", "6443:6443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": [{"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "443"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "6443"}}, {"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "80"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "8080"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "NAMESPACE_LABELS", "value": "n=install-test"}}]}]

2. The router2 deployment failed, due to the default 1935 is already in use by router1. 
3. Try to update openshift_hosted_routers option as following to update default 1935 to 1940, 
openshift_hosted_routers=[{"name": "router1", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1936, "ports": ["80:80", "443:443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": []}, {"name": "router2", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1939, "ports": ["8080:8080", "6443:6443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": [{"action": "update", "curr_value": {"name": "ROUTER_LISTEN_ADDR", "value": "0.0.0.0:1935"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_LISTEN_ADDR", "value": "0.0.0.0:1940"}},{"action": "update", "curr_value": {"containerPort": 1935, "hostPort": 1935, "name": "router-stats", "protocol": "TCP"}, "key": "spec.template.spec.containers[0].ports", "value": {"containerPort": 1940, "hostPort": 1940, "name": "router-stats", "protocol": "TCP"}},{"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "443"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "6443"}}, {"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "80"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "8080"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "NAMESPACE_LABELS", "value": "n=install-test"}}]}]


Actual results:
router2 is still failed to be deployed. Though I added update action to update the default 1935 port to 1940, but only spec.template.spec.containers[0].env is updated, spec.template.spec.containers[0].ports do not.

After installation, router2's dc is showing as following:
# oc get dc router2 -o yaml
<--snip-->
        - name: ROUTER_LISTEN_ADDR
          value: 0.0.0.0:1940
<--snip-->
        - containerPort: 1935
          hostPort: 1935
          name: router-stats
          protocol: TCP
<--snip-->

After the above failure, I edit router2 manually to change 1935 to 1940, router2 is deployed successfully.

Expected results:
installer should allow user to update spec.template.spec.containers[0].ports upon multiple router shards install on the same host.

Additional info:

Comment 1 Scott Dodson 2017-05-25 12:52:48 UTC
Need to plumb through the ability to set router stats port.

Comment 2 Johnny Liu 2017-05-26 02:43:24 UTC
BTW, I also opened a RFE bug (BZ#1452019) for router command line to allow user set customized router stats port.

Once BZ#1452019 is fixed, installer only need do some code change to support "router_stats_port" setting just like "stats_port" what we have now.

Before the RFE bug is fixed, I thought I could workaround it via spec.template.spec.containers[0].ports update, but it does NOT. Now I am not sure now installer does not support spec.template.spec.containers[0].ports update, only support spec.template.spec.containers[0].env update? or my json syntax is wrong?

Comment 3 Tim Bielawa 2017-06-06 17:58:33 UTC
I have ran the installer and tried every conceivable permutation the EDIT params. The closest I could get to a working dual-router-single-host deployment was using this syntax [1].

The problem I am running into, which is causing the PodFitsHostPorts condition to fail in the second router deployment, is the editing of the 'router-stats' value in the spec.template.spec.containers[0].ports list.

Even with an explicit edit->update->new-ports stanza (see gist [0]), the resources are both ultimately created with:

>     - containerPort: 1935
>       hostPort: 1935
>       name: router-stats
>       protocol: TCP

Stranger is that the other item in that list, with "name: 'stats" is mutable. I can create two routers at the same time and modify the 'router-stats' port as I want.

This matches up pretty well with what you experienced. Something fundamentally is hard-coding that 'stats' port to 1935.

If I 'oc edit dc router2' I can change the 'stats' port value to 1940 (or anything else if I wish), and then do an 'oc rollout retry dc router3' and they will both be online and healthy.

I suppose that if we wanted to support this in openshift-ansible we would have to do some hacks to edit the DC for router n+1 to ensure they have unique ports set correctly.

But this looks like an ugly hack. Just look at the length of the `openshift_hosted_routers` string in the gist [0].

Another thing I noticed, when I edit 'router2' in the gist the 'ROUTER_LISTEN_ADDR' field is actually missing in the deployment. However, the unmodified 'router3' has the field present with its default value.

I am not sure what the best thing to do here is. Is this some kind of upstream bug? Some strange hard-coded behavior?


[0] https://gist.github.com/tbielawa/ae4a1f4163748094589ec6a839c97bd7

Comment 4 Johnny Liu 2017-06-07 05:57:18 UTC
(In reply to Tim Bielawa from comment #3)
> Stranger is that the other item in that list, with "name: 'stats" is
> mutable. 
I do not think that take effects because of your an explicit edit->update->new-ports stanza, should be '"stats_port": 1941' help you achieve that. That means probably whatever you are trying to update "router-stats" or "stats", any edit->update->new-ports update would take no any effect.

Comment 5 Juan Vallejo 2017-06-19 21:20:41 UTC
According to this comment [1] by Clayton, the `stats-port` that gets created on port 1935 is part of the haproxy prometheus stats. It seems that, by design [2], the new "stats-port" will always listen on port 1935, regardless of your specified --stats-port value, as long as the router type is "haproxy".

1. https://github.com/openshift/origin/issues/14759#issuecomment-309573659

2. https://github.com/openshift/origin/blob/master/pkg/cmd/admin/router/router.go#L685

Comment 6 Clayton Coleman 2017-06-21 02:11:45 UTC
This is being changed in https://github.com/openshift/origin/pull/14790

1. Asking for the new style metrics port (on by default) completely overrides the old stats port, and the old haproxy stats page will not be available on that port anymore, just /healthz and /metrics
2. --listen-addr now overrides stats-port
3. The new 1935 port is completely removed
4. A new health check test for reload-haproxy and from the openshift-router process will check against the primary public port (80) and verify that when no HOST is set a 503 is returned. Someone customizing the template is also expected to customize the reload behavior. It also handles PROXY mode with a different health check style.
5. A new env var ROUTER_METRICS_READY_HTTP_URL may be passed for template customizers who want to run the stats port in a different fashion.

Comment 7 Scott Dodson 2017-06-30 00:40:50 UTC
This should be fixed in 3.6.128 of OCP.

Comment 9 Kenny Woodson 2017-06-30 13:49:51 UTC
I verified this was fixed in a recent build: atomic-openshift-master-3.6.129-1.git.0.f01d244.el7.x86_64

oc get pods
NAME                      READY     STATUS    RESTARTS   AGE
router-external-1-18njj   1/1       Running   0          40s
router-external-1-cd8b7   1/1       Running   0          40s
router-internal-1-1x72g   1/1       Running   0          23s
router-internal-1-z4sgz   1/1       Running   0          23s

oc adm router -o yaml --dry-run:
  spec:
    ports:
    - name: 80-tcp
      port: 80
      targetPort: 80
    - name: 443-tcp
      port: 443
      targetPort: 443
    - name: 1936-tcp
      port: 1936
      protocol: TCP
      targetPort: 1936


Thanks Juan and Clayton for the work on this!

Comment 10 Johnny Liu 2017-07-03 10:32:47 UTC
Verified this bug with openshift v3.6.131, and PASS.

# oc get po  -o wide
NAME                       READY     STATUS    RESTARTS   AGE       IP              NODE
router-1-86kgd             1/1       Running   0          5m        172.18.3.255    ip-172-18-3-255.ec2.internal
router-1-nv1vf             1/1       Running   0          5m        172.18.22.240   ip-172-18-22-240.ec2.internal
router1-1-c8wfp            1/1       Running   0          5m        172.18.3.255    ip-172-18-3-255.ec2.internal


1935 port is totally removed now. Multiple router shareds deployment still keep original deployment way.

openshift_hosted_routers=[{"name": "router", "replicas": 2, "serviceaccount": "router", "namespace": "default", "stats_port": 1936, "ports": ["80:80", "443:443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {"certfile": "{{ lookup(\"env\", \"WORKSPACE\") }}/private-openshift-misc/v3-launch-templates/functionality-testing/aos-36/extra-ansible/files/custom_router_1.crt", "keyfile": "{{ lookup(\"env\", \"WORKSPACE\") }}/private-openshift-misc/v3-launch-templates/functionality-testing/aos-36/extra-ansible/files/custom_router_1.key", "cafile": "{{ lookup(\"env\", \"WORKSPACE\") }}/private-openshift-misc/v3-launch-templates/functionality-testing/aos-36/extra-ansible/files/custom_router_1_rootca.crt"}, "selector": "role=node,router=enabled", "edits": [{"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTE_LABELS", "value": "route=external"}}]}, {"name": "router1", "replicas": 1, "serviceaccount": "router", "namespace": "default", "stats_port": 1937, "ports": ["7080:7080", "7443:7443"], "images": "registry.xxx.com/openshift3/ose-${component}:${version}", "certificate": {}, "selector": "role=node,router=enabled", "edits": [{"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "443"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTPS_PORT", "value": "7443"}}, {"action": "update", "curr_value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "80"}, "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_HTTP_PORT", "value": "7080"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "NAMESPACE_LABELS", "value": "n=install-test"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_SNI_PORT", "value": "10445"}}, {"action": "append", "key": "spec.template.spec.containers[0].env", "value": {"name": "ROUTER_SERVICE_NO_SNI_PORT", "value": "10442"}}]}]

Comment 12 errata-xmlrpc 2017-08-10 05:25:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716


Note You need to log in before you can comment on or make changes to this bug.