Description of problem: While troubleshooting overcloud compute scale out , we noticed that all resources for existing compute node remained in PROGRESS & later failed. We confirmed that os-collect-config process is running but is not polling Director . I suspect that os-collect-config reaches a state where it stops polling after a long time . Version-Release number of selected component (if applicable): os-collect-config-0.1.35-2.el7ost.noarch How reproducible: No. Steps to Reproduce: 1. 2. 3. Actual results: Last os-collect-config polling was noticed on Dec 17 . overcloud scale out fails due to resources not being run by compute node as os-collect-config polling stopped . Expected results: os-collect-config keeps polling Additional info:
*** Bug 1306139 has been marked as a duplicate of this bug. ***
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
There are reports of os-collect-config stopping polling, and other reports of the os-collect-config process disappearing. Until we know more I propose that this bug be used to track both situations. For disappearing process, we could easily improve the situation by specifying Restart in the systemd unit file. For stopping polling, currently it is not possible to set a kill timeout for os-refresh-config so there is nothing stopping a misbehaving script from running indefinitely. The kill timeout value should be passed to os-refresh-config and left to os-refresh-config to exit itself. If it turns out that os-collect-config stops polling for reasons other than os-refresh-config, a final recommended change could be to have os-collect-config send watchdog pings to systemd and set WatchdogSec in the systemd unit.
the os-collect-config service file already has Restart=on-failure, and to be fair most reports have been about os-collect-config running but stalled. An upstream bug has been raised for a os-refresh-config --timeout feature
The upstream changes for this are now ready for review. https://review.openstack.org/#/q/topic:bug/1595722
The above os-refresh-config timeout changes are worth it but I no longer believe that os-refresh-config is the prime cause of this issue. A comment in bug #1353716 shows a nova metadata server becoming temporarily unavailable, and the ec2 collectors failing as a result: http://pastebin.test.redhat.com/390534 The ec2, cfn and request collectors do a python requests get without any timeout argument. This implies a timeout of None which will wait *forever* if necessary. This could wedge os-collect-config if a collector source is unavailable for some reason (network brownout, service down) during a collection attempt.
Upstream bug has a fix associated with it.
Any connectivity issues from the overcloud to the undercloud could cause os-collect-config to stall indefinitely on http requests. The workaround for when this occurs is to restart os-collect-config on all nodes. A permanent fix would be to backport https://review.openstack.org/#/c/340179/ and upgrade all nodes to the lastest os-collect-config package.
Can we get a backport of this patch as it's affecting many customers? My current customer is using RHOSP 8.0 so we'd need to backport that patch to RHOSP 8.0 as well.
I think a backport is desirable but its not my call. Setting needinfo to mburns
Can we update which version of os-collect-config this is fixed in and move it to modified? Backports of this bug up to OSP7 will be approved - clones need to be created. Steve, will you be able to handle the backports too, please?
Hi Steve, I tried blocking one of the overcloud nodes access to the undercloud and watched the os-collect-config journal for changes. Below are my results. Can we consider these enough to verify this bug? Thank you. journalctl -fl -u os-collect-config =========== Block overcloud node connectivity to the undercloud =========== Nov 24 10:11:12 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/ (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x34fc510>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)')) Nov 24 10:11:12 overcloud-controller-0.localdomain os-collect-config[4021]: Source [ec2] Unavailable. Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='192.168.0.1', port=8080): Max retries exceeded with url: /v1/AUTH_ce297a53c20f4c46ba71c87eb8fe01b0/ov-6aavlgpnccx-0-cyszwjfzt6ki-Controller-kiih4dysq5v4/302be806-b4d3-4536-82b7-03fff809e08d?temp_url_sig=5575407b8acbde2c7fdc44db52f63b774abd7870&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x3615090>, 'Connection to 192.168.0.1 timed out. (connect timeout=10.0)')) Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: Source [request] Unavailable. Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data']) Nov 24 10:12:02 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/ (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x3615690>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)')) Nov 24 10:12:02 overcloud-controller-0.localdomain os-collect-config[4021]: Source [ec2] Unavailable. Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='192.168.0.1', port=8080): Max retries exceeded with url: /v1/AUTH_ce297a53c20f4c46ba71c87eb8fe01b0/ov-6aavlgpnccx-0-cyszwjfzt6ki-Controller-kiih4dysq5v4/302be806-b4d3-4536-82b7-03fff809e08d?temp_url_sig=5575407b8acbde2c7fdc44db52f63b774abd7870&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x36151d0>, 'Connection to 192.168.0.1 timed out. (connect timeout=10.0)')) Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: Source [request] Unavailable. Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data']) =========== Resume overcloud node connectivity to the undercloud =========== Nov 24 10:12:43 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:12:43 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data']) Nov 24 10:13:13 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:13:13 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data'])
Yes, this looks like the fix is working as expected.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html