Bug 1464740 - Router can't bind to port 1936 in 3.6.124 - openshift-ansible and oc adm deploy of router fails
Summary: Router can't bind to port 1936 in 3.6.124 - openshift-ansible and oc adm depl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.6.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.6.z
Assignee: Justin Pierce
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-25 09:50 UTC by Mike Fiedler
Modified: 2022-08-04 22:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-01-23 17:57:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift-ansible inventory (5.68 KB, text/plain)
2017-06-25 09:50 UTC, Mike Fiedler
no flags Details
Router node syslog with debug=5 (72.39 KB, application/x-gzip)
2017-06-26 01:45 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0113 0 normal SHIPPED_LIVE OpenShift Container Platform 3.7 and 3.6 bug fix and enhancement update 2018-01-23 22:55:59 UTC

Description Mike Fiedler 2017-06-25 09:50:13 UTC
Created attachment 1291671 [details]
openshift-ansible inventory

Description of problem:

Not sure if this should go to Routing or Install.   Starting with Routing.

Router pods are failing to deploy on installation with openshift-ansible in 3.6.124.   They are also failing the same way when created with oadm router.  Router pod logs show:

root@ip-172-31-5-243: ~/openshift-ansible # oc logs -f router-5-n6x13
I0625 09:37:40.597755       1 router.go:239] Router is including routes in all namespaces
E0625 09:37:40.669162       1 ratelimiter.go:52] error reloading router: exit status 1
[ALERT] 175/093740 (47) : Starting proxy stats: cannot bind socket [0.0.0.0:1936]

netstat -tunapl | grep 1936 on the router node:

root@ip-172-31-25-241: ~ # netstat -tunapl | grep 1936
root@ip-172-31-25-241: ~ # 


Ansible install log shows:

TASK [openshift_hosted : Poll for OpenShift router deployment success] *********
Sunday 25 June 2017  09:11:09 +0000 (0:00:01.739)       0:42:30.540 *********** 
failed: [ec2-54-187-91-73.us-west-2.compute.amazonaws.com] (item=[{u'name': u'router', u'certificate': {}, u'replicas': u'1', u'serviceaccount': u'router', u'namespace': u'default', u'stats_port': 1936, u'edits': [{u'action': u'put', u'key': u'spec.strategy.rollingParams.intervalSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.rollingParams.updatePeriodSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.activeDeadlineSeconds', u'value': 21600}], u'images': u'registry.ops.openshift.com/openshift3/ose-${component}:${version}', u'selector': u'region=infra,zone=default', u'ports': [u'80:80', u'443:443']}, {'_ansible_parsed': True, u'changed': True, u'stdout': u'3', '_ansible_no_log': False, 'stdout_lines': [u'3'], u'warnings': [], '_ansible_item_result': True, u'start': u'2017-06-25 05:11:08.495028', u'delta': u'0:00:00.300996', u'cmd': [u'oc', u'get', u'deploymentconfig', u'router', u'--namespace', u'default', u'--config', u'/etc/origin/master/admin.kubeconfig', u'-o', u'jsonpath={ .status.latestVersion }'], 'item': {u'name': u'router', u'certificate': {}, u'replicas': u'1', u'serviceaccount': u'router', u'namespace': u'default', u'selector': u'region=infra,zone=default', u'edits': [{u'action': u'put', u'value': 1, u'key': u'spec.strategy.rollingParams.intervalSeconds'}, {u'action': u'put', u'value': 1, u'key': u'spec.strategy.rollingParams.updatePeriodSeconds'}, {u'action': u'put', u'value': 21600, u'key': u'spec.strategy.activeDeadlineSeconds'}], u'images': u'registry.ops.openshift.com/openshift3/ose-${component}:${version}', u'stats_port': 1936, u'ports': [u'80:80', u'443:443']}, u'rc': 0, 'invocation': {'module_name': u'command', u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': False, u'_raw_params': u"oc get deploymentconfig router --namespace default --config /etc/origin/master/admin.kubeconfig -o jsonpath='{ .status.latestVersion }'", u'removes': None, u'creates': None, u'chdir': None}}, u'end': u'2017-06-25 05:11:08.796024', u'stderr': u''}]) => {"attempts": 1, "changed": true, "cmd": ["oc", "get", "replicationcontroller", "router-3", "--namespace", "default", "--config", "/etc/origin/master/admin.kubeconfig", "-o", "jsonpath={ .metadata.annotations.openshift\\.io/deployment\\.phase }"], "delta": "0:00:00.227702", "end": "2017-06-25 05:11:10.447300", "failed": true, "failed_when_result": true, "item": [{"certificate": {}, "edits": [{"action": "put", "key": "spec.strategy.rollingParams.intervalSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.rollingParams.updatePeriodSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.activeDeadlineSeconds", "value": 21600}], "images": "registry.ops.openshift.com/openshift3/ose-${component}:${version}", "name": "router", "namespace": "default", "ports": ["80:80", "443:443"], "replicas": "1", "selector": "region=infra,zone=default", "serviceaccount": "router", "stats_port": 1936}, {"_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": true, "cmd": ["oc", "get", "deploymentconfig", "router", "--namespace", "default", "--config", "/etc/origin/master/admin.kubeconfig", "-o", "jsonpath={ .status.latestVersion }"], "delta": "0:00:00.300996", "end": "2017-06-25 05:11:08.796024", "invocation": {"module_args": {"_raw_params": "oc get deploymentconfig router --namespace default --config /etc/origin/master/admin.kubeconfig -o jsonpath='{ .status.latestVersion }'", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}, "module_name": "command"}, "item": {"certificate": {}, "edits": [{"action": "put", "key": "spec.strategy.rollingParams.intervalSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.rollingParams.updatePeriodSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.activeDeadlineSeconds", "value": 21600}], "images": "registry.ops.openshift.com/openshift3/ose-${component}:${version}", "name": "router", "namespace": "default", "ports": ["80:80", "443:443"], "replicas": "1", "selector": "region=infra,zone=default", "serviceaccount": "router", "stats_port": 1936}, "rc": 0, "start": "2017-06-25 05:11:08.495028", "stderr": "", "stdout": "3", "stdout_lines": ["3"], "warnings": []}], "rc": 0, "start": "2017-06-25 05:11:10.219598", "stderr": "", "stdout": "Failed", "stdout_lines": ["Failed"], "warnings": []}


Version-Release number of selected component (if applicable):  3.6.124 with HEAD of openshift-ansible master (commit 88690667342bca0e7df75bc90bb1846b63d6d78a)

 

How reproducible:  Always - I've had 3 installs fail like this.   Going to dig deeper into the current one to see if it can be rescured


Steps to Reproduce:
1.  Install a cluster (1 master, 1 infra, 3 nodes) on AWS with openshift-ansible (inventory below)
2.  Alternatively, create a router with oadm router --images registry.ops.openshift.com/openshift3/ose-haproxy-router:v3.6.123


Actual results:

Router fails to start with error in log above

Expected results:

Router starts

Additional info:

Inventory attached

Comment 1 Mike Fiedler 2017-06-25 12:02:10 UTC
While the router is trying to start, the infra node does show activity on 1936:

root@ip-172-31-25-241: ~ # while true; do netstat -tunapl | grep 1936; sleep 2; done                     
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38630               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38628               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38630               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38628               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38630               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38628               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38630               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38628               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -                   
tcp6       0      0 :::1936                 :::*                    LISTEN      92206/openshift-rou 
tcp6       0      0 ::1:1936                ::1:38630               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38604               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38628               TIME_WAIT   -                   
tcp6       0      0 ::1:1936                ::1:38606               TIME_WAIT   -

Comment 2 Mike Fiedler 2017-06-26 01:25:31 UTC
Unless we discover a workaround, this blocks any testing requiring routes.   Marking TestBlock

Comment 3 Mike Fiedler 2017-06-26 01:45:38 UTC
Created attachment 1291804 [details]
Router node syslog with debug=5

Comment 4 Mike Fiedler 2017-06-26 01:47:32 UTC
Correction, oadm command in description should be:

oadm router --images registry.ops.openshift.com/openshift3/ose-haproxy-router:v3.6.124  (not 123)

Node syslogs with debug=5 attached.  Only entries in router pod logs:

root@ip-172-31-2-110: ~ # oc logs -f router-4-9vzt5                                                      │router-4-9vzt5   0/1       CrashLoopBackOff   6         3m
I0626 01:45:14.155571       1 router.go:239] Router is including routes in all namespaces                │router-4-9vzt5   0/1       Running   7         4m
E0626 01:45:14.227408       1 ratelimiter.go:52] error reloading router: exit status 1                   │router-4-9vzt5   0/1       Running   8         5m
[ALERT] 176/014514 (48) : Starting proxy stats: cannot bind socket [0.0.0.0:1936]                        │router-4-9vzt5

Comment 6 Mike Fiedler 2017-06-26 06:46:43 UTC
Removing TestBlocker,  editing the router DC and changing the port back to 1935 lets the router start.

Comment 8 zhaozhanqi 2017-06-26 08:23:32 UTC


From this commit:	https://github.com/openshift/ose/commit/ff44d2379747677c50f8563c39a3ba4f2acc991f

 line 685 I guess should be cfg.StatsPort-1 instead of cfg.StatsPort

 685 + env["ROUTER_LISTEN_ADDR"] = fmt.Sprintf("0.0.0.0:%d", cfg.StatsPort)

Comment 9 Clayton Coleman 2017-06-26 16:48:03 UTC
This may just be a change to those environments that have the old env var to remove it.

Comment 10 Clayton Coleman 2017-06-26 17:33:03 UTC
There isn't a custom router template in play here, is there?

Comment 11 Clayton Coleman 2017-06-26 18:34:19 UTC
When testing this locally, registry.ops.openshift.com/openshift3/ose-haproxy-router:v3.6.124 does not have the expected matching haproxy-config.template version.

Comment 12 Clayton Coleman 2017-06-26 19:02:34 UTC
Something is wrong with our ose -> distgit process - this change was not in the dist-git repo in 124.  Assigning to smunilla

Comment 13 Justin Pierce 2017-06-26 20:51:21 UTC
When ose was in stagecut, the master branch was still being used to synchronize with dist-git (when it should have been synchronizing with the stage branch). Changes to image content that were in ose/stage but not ose/master were therefore being missed. 
Testing a fix.

Comment 14 Mike Fiedler 2017-06-26 21:47:30 UTC
This comment may be late (currently I'm TZ shifted).   In response to comment 9 and comment 10:  These are fresh environments and no custom router templates are involved.

Comment 15 Justin Pierce 2017-06-27 09:32:06 UTC
The haproxy conf file issue has been addressed in OCP build 3.6.126.

Comment 16 Mike Fiedler 2017-06-27 09:44:09 UTC
Verified on 3.6.126

Comment 19 errata-xmlrpc 2018-01-23 17:57:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0113


Note You need to log in before you can comment on or make changes to this bug.