Created attachment 1291671 [details] openshift-ansible inventory Description of problem: Not sure if this should go to Routing or Install. Starting with Routing. Router pods are failing to deploy on installation with openshift-ansible in 3.6.124. They are also failing the same way when created with oadm router. Router pod logs show: root@ip-172-31-5-243: ~/openshift-ansible # oc logs -f router-5-n6x13 I0625 09:37:40.597755 1 router.go:239] Router is including routes in all namespaces E0625 09:37:40.669162 1 ratelimiter.go:52] error reloading router: exit status 1 [ALERT] 175/093740 (47) : Starting proxy stats: cannot bind socket [0.0.0.0:1936] netstat -tunapl | grep 1936 on the router node: root@ip-172-31-25-241: ~ # netstat -tunapl | grep 1936 root@ip-172-31-25-241: ~ # Ansible install log shows: TASK [openshift_hosted : Poll for OpenShift router deployment success] ********* Sunday 25 June 2017 09:11:09 +0000 (0:00:01.739) 0:42:30.540 *********** failed: [ec2-54-187-91-73.us-west-2.compute.amazonaws.com] (item=[{u'name': u'router', u'certificate': {}, u'replicas': u'1', u'serviceaccount': u'router', u'namespace': u'default', u'stats_port': 1936, u'edits': [{u'action': u'put', u'key': u'spec.strategy.rollingParams.intervalSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.rollingParams.updatePeriodSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.activeDeadlineSeconds', u'value': 21600}], u'images': u'registry.ops.openshift.com/openshift3/ose-${component}:${version}', u'selector': u'region=infra,zone=default', u'ports': [u'80:80', u'443:443']}, {'_ansible_parsed': True, u'changed': True, u'stdout': u'3', '_ansible_no_log': False, 'stdout_lines': [u'3'], u'warnings': [], '_ansible_item_result': True, u'start': u'2017-06-25 05:11:08.495028', u'delta': u'0:00:00.300996', u'cmd': [u'oc', u'get', u'deploymentconfig', u'router', u'--namespace', u'default', u'--config', u'/etc/origin/master/admin.kubeconfig', u'-o', u'jsonpath={ .status.latestVersion }'], 'item': {u'name': u'router', u'certificate': {}, u'replicas': u'1', u'serviceaccount': u'router', u'namespace': u'default', u'selector': u'region=infra,zone=default', u'edits': [{u'action': u'put', u'value': 1, u'key': u'spec.strategy.rollingParams.intervalSeconds'}, {u'action': u'put', u'value': 1, u'key': u'spec.strategy.rollingParams.updatePeriodSeconds'}, {u'action': u'put', u'value': 21600, u'key': u'spec.strategy.activeDeadlineSeconds'}], u'images': u'registry.ops.openshift.com/openshift3/ose-${component}:${version}', u'stats_port': 1936, u'ports': [u'80:80', u'443:443']}, u'rc': 0, 'invocation': {'module_name': u'command', u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': False, u'_raw_params': u"oc get deploymentconfig router --namespace default --config /etc/origin/master/admin.kubeconfig -o jsonpath='{ .status.latestVersion }'", u'removes': None, u'creates': None, u'chdir': None}}, u'end': u'2017-06-25 05:11:08.796024', u'stderr': u''}]) => {"attempts": 1, "changed": true, "cmd": ["oc", "get", "replicationcontroller", "router-3", "--namespace", "default", "--config", "/etc/origin/master/admin.kubeconfig", "-o", "jsonpath={ .metadata.annotations.openshift\\.io/deployment\\.phase }"], "delta": "0:00:00.227702", "end": "2017-06-25 05:11:10.447300", "failed": true, "failed_when_result": true, "item": [{"certificate": {}, "edits": [{"action": "put", "key": "spec.strategy.rollingParams.intervalSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.rollingParams.updatePeriodSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.activeDeadlineSeconds", "value": 21600}], "images": "registry.ops.openshift.com/openshift3/ose-${component}:${version}", "name": "router", "namespace": "default", "ports": ["80:80", "443:443"], "replicas": "1", "selector": "region=infra,zone=default", "serviceaccount": "router", "stats_port": 1936}, {"_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": true, "cmd": ["oc", "get", "deploymentconfig", "router", "--namespace", "default", "--config", "/etc/origin/master/admin.kubeconfig", "-o", "jsonpath={ .status.latestVersion }"], "delta": "0:00:00.300996", "end": "2017-06-25 05:11:08.796024", "invocation": {"module_args": {"_raw_params": "oc get deploymentconfig router --namespace default --config /etc/origin/master/admin.kubeconfig -o jsonpath='{ .status.latestVersion }'", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}, "module_name": "command"}, "item": {"certificate": {}, "edits": [{"action": "put", "key": "spec.strategy.rollingParams.intervalSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.rollingParams.updatePeriodSeconds", "value": 1}, {"action": "put", "key": "spec.strategy.activeDeadlineSeconds", "value": 21600}], "images": "registry.ops.openshift.com/openshift3/ose-${component}:${version}", "name": "router", "namespace": "default", "ports": ["80:80", "443:443"], "replicas": "1", "selector": "region=infra,zone=default", "serviceaccount": "router", "stats_port": 1936}, "rc": 0, "start": "2017-06-25 05:11:08.495028", "stderr": "", "stdout": "3", "stdout_lines": ["3"], "warnings": []}], "rc": 0, "start": "2017-06-25 05:11:10.219598", "stderr": "", "stdout": "Failed", "stdout_lines": ["Failed"], "warnings": []} Version-Release number of selected component (if applicable): 3.6.124 with HEAD of openshift-ansible master (commit 88690667342bca0e7df75bc90bb1846b63d6d78a) How reproducible: Always - I've had 3 installs fail like this. Going to dig deeper into the current one to see if it can be rescured Steps to Reproduce: 1. Install a cluster (1 master, 1 infra, 3 nodes) on AWS with openshift-ansible (inventory below) 2. Alternatively, create a router with oadm router --images registry.ops.openshift.com/openshift3/ose-haproxy-router:v3.6.123 Actual results: Router fails to start with error in log above Expected results: Router starts Additional info: Inventory attached
While the router is trying to start, the infra node does show activity on 1936: root@ip-172-31-25-241: ~ # while true; do netstat -tunapl | grep 1936; sleep 2; done tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38630 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38628 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38630 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38628 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38630 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38628 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38630 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38628 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT - tcp6 0 0 :::1936 :::* LISTEN 92206/openshift-rou tcp6 0 0 ::1:1936 ::1:38630 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38604 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38628 TIME_WAIT - tcp6 0 0 ::1:1936 ::1:38606 TIME_WAIT -
Unless we discover a workaround, this blocks any testing requiring routes. Marking TestBlock
Created attachment 1291804 [details] Router node syslog with debug=5
Correction, oadm command in description should be: oadm router --images registry.ops.openshift.com/openshift3/ose-haproxy-router:v3.6.124 (not 123) Node syslogs with debug=5 attached. Only entries in router pod logs: root@ip-172-31-2-110: ~ # oc logs -f router-4-9vzt5 │router-4-9vzt5 0/1 CrashLoopBackOff 6 3m I0626 01:45:14.155571 1 router.go:239] Router is including routes in all namespaces │router-4-9vzt5 0/1 Running 7 4m E0626 01:45:14.227408 1 ratelimiter.go:52] error reloading router: exit status 1 │router-4-9vzt5 0/1 Running 8 5m [ALERT] 176/014514 (48) : Starting proxy stats: cannot bind socket [0.0.0.0:1936] │router-4-9vzt5
Possibly due to https://github.com/openshift/ose/commit/ff44d2379747677c50f8563c39a3ba4f2acc991f
Removing TestBlocker, editing the router DC and changing the port back to 1935 lets the router start.
From this commit: https://github.com/openshift/ose/commit/ff44d2379747677c50f8563c39a3ba4f2acc991f line 685 I guess should be cfg.StatsPort-1 instead of cfg.StatsPort 685 + env["ROUTER_LISTEN_ADDR"] = fmt.Sprintf("0.0.0.0:%d", cfg.StatsPort)
This may just be a change to those environments that have the old env var to remove it.
There isn't a custom router template in play here, is there?
When testing this locally, registry.ops.openshift.com/openshift3/ose-haproxy-router:v3.6.124 does not have the expected matching haproxy-config.template version.
Something is wrong with our ose -> distgit process - this change was not in the dist-git repo in 124. Assigning to smunilla
When ose was in stagecut, the master branch was still being used to synchronize with dist-git (when it should have been synchronizing with the stage branch). Changes to image content that were in ose/stage but not ose/master were therefore being missed. Testing a fix.
This comment may be late (currently I'm TZ shifted). In response to comment 9 and comment 10: These are fresh environments and no custom router templates are involved.
The haproxy conf file issue has been addressed in OCP build 3.6.126.
Verified on 3.6.126
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0113