Bug 1306140

Summary: os-collect-config stop polling after a long time
Product: Red Hat OpenStack Reporter: Jaison Raju <jraju>
Component: os-collect-configAssignee: Steve Baker <sbaker>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: medium Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: apevec, dhill, ipetrova, jcoufal, jraju, jschluet, lhh, mburns, mcornea, mlopes, nlevinki, ohochman, pneedle, rhel-osp-director-maint, romano.silva, sbaker, scorcora, srevivo
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: os-collect-config-5.0.0-1.el7ost Doc Type: Bug Fix
Doc Text:
Prior to this update, HTTP requests to `os-collect-config` for configuration did not specify a request timeout. Consequently, polling for data while the undercloud was inaccessible (for example, rebooting undercloud, network connectivity issues) resulted in `os-collect-config` stalling, performing no polling or configuration. This often only became apparent when an overcloud stack operation was performed and software configuration operations timed out. With this update, `os-collect-config` HTTP requests now always specify a timeout period. As a result, polling for data will fail when the undercloud is unavailable, and then resume when it is available again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 15:22:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaison Raju 2016-02-10 06:58:11 UTC
Description of problem:
While troubleshooting overcloud compute scale out , we noticed that
all resources for existing compute node remained in PROGRESS & later failed.
We confirmed that os-collect-config process is running but is not polling
Director .
I suspect that os-collect-config reaches a state where it stops polling after a long time .

Version-Release number of selected component (if applicable):
os-collect-config-0.1.35-2.el7ost.noarch

How reproducible:
No.

Steps to Reproduce:
1.
2.
3.

Actual results:
Last os-collect-config polling was noticed on Dec 17 .
overcloud scale out fails due to resources not being run by compute node
as os-collect-config polling stopped .

Expected results:
os-collect-config keeps polling

Additional info:

Comment 3 Mike Burns 2016-02-10 12:52:21 UTC
*** Bug 1306139 has been marked as a duplicate of this bug. ***

Comment 4 Mike Burns 2016-04-07 21:07:13 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 6 Steve Baker 2016-06-20 20:49:29 UTC
There are reports of os-collect-config stopping polling, and other reports of the os-collect-config process disappearing. Until we know more I propose that this bug be used to track both situations.

For disappearing process, we could easily improve the situation by specifying Restart in the systemd unit file.

For stopping polling, currently it is not possible to set a kill timeout for os-refresh-config so there is nothing stopping a misbehaving script from running indefinitely. The kill timeout value should be passed to os-refresh-config and left to os-refresh-config to exit itself.

If it turns out that os-collect-config stops polling for reasons other than os-refresh-config, a final recommended change could be to have os-collect-config send watchdog pings to systemd and set WatchdogSec in the systemd unit.

Comment 13 Steve Baker 2016-06-23 21:34:39 UTC
the os-collect-config service file already has Restart=on-failure, and to be fair most reports have been about os-collect-config running but stalled.

An upstream bug has been raised for a os-refresh-config --timeout feature

Comment 15 Steve Baker 2016-07-04 23:59:38 UTC
The upstream changes for this are now ready for review.

https://review.openstack.org/#/q/topic:bug/1595722

Comment 16 Steve Baker 2016-07-10 22:51:29 UTC
The above os-refresh-config timeout changes are worth it but I no longer believe that os-refresh-config is the prime cause of this issue.

A comment in bug #1353716 shows a nova metadata server becoming temporarily unavailable, and the ec2 collectors failing as a result:

  http://pastebin.test.redhat.com/390534

The ec2, cfn and request collectors do a python requests get without any timeout argument. This implies a timeout of None which will wait *forever* if necessary.

This could wedge os-collect-config if a collector source is unavailable for some reason (network brownout, service down) during a collection attempt.

Comment 17 Steve Baker 2016-07-10 23:17:43 UTC
Upstream bug has a fix associated with it.

Comment 19 Steve Baker 2016-09-05 22:57:19 UTC
Any connectivity issues from the overcloud to the undercloud could cause os-collect-config to stall indefinitely on http requests.

The workaround for when this occurs is to restart os-collect-config on all nodes. A permanent fix would be to backport https://review.openstack.org/#/c/340179/ and upgrade all nodes to the lastest os-collect-config package.

Comment 20 David Hill 2016-09-05 23:03:55 UTC
Can we get a backport of this patch as it's affecting many customers?  My current customer is using RHOSP 8.0 so we'd need to backport that patch to RHOSP 8.0 as well.

Comment 21 Steve Baker 2016-09-06 01:48:53 UTC
I think a backport is desirable but its not my call. Setting needinfo to mburns

Comment 25 Jaromir Coufal 2016-10-05 18:15:31 UTC
Can we update which version of os-collect-config this is fixed in and move it to modified?

Backports of this bug up to OSP7 will be approved - clones need to be created.

Steve, will you be able to handle the backports too, please?

Comment 28 Marius Cornea 2016-11-24 10:28:04 UTC
Hi Steve,

I tried blocking one of the overcloud nodes access to the undercloud and watched the os-collect-config journal for changes. Below are my results. Can we consider these enough to verify this bug? Thank you.  

journalctl -fl -u os-collect-config

=========== Block overcloud node connectivity to the undercloud ===========

Nov 24 10:11:12 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/ (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x34fc510>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)'))
Nov 24 10:11:12 overcloud-controller-0.localdomain os-collect-config[4021]: Source [ec2] Unavailable.
Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='192.168.0.1', port=8080): Max retries exceeded with url: /v1/AUTH_ce297a53c20f4c46ba71c87eb8fe01b0/ov-6aavlgpnccx-0-cyszwjfzt6ki-Controller-kiih4dysq5v4/302be806-b4d3-4536-82b7-03fff809e08d?temp_url_sig=5575407b8acbde2c7fdc44db52f63b774abd7870&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x3615090>, 'Connection to 192.168.0.1 timed out. (connect timeout=10.0)'))
Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: Source [request] Unavailable.
Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping
Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data'])

Nov 24 10:12:02 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/ (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x3615690>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)'))
Nov 24 10:12:02 overcloud-controller-0.localdomain os-collect-config[4021]: Source [ec2] Unavailable.
Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='192.168.0.1', port=8080): Max retries exceeded with url: /v1/AUTH_ce297a53c20f4c46ba71c87eb8fe01b0/ov-6aavlgpnccx-0-cyszwjfzt6ki-Controller-kiih4dysq5v4/302be806-b4d3-4536-82b7-03fff809e08d?temp_url_sig=5575407b8acbde2c7fdc44db52f63b774abd7870&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x36151d0>, 'Connection to 192.168.0.1 timed out. (connect timeout=10.0)'))
Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: Source [request] Unavailable.
Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping
Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data'])

=========== Resume overcloud node connectivity to the undercloud ===========

Nov 24 10:12:43 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping
Nov 24 10:12:43 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data'])
Nov 24 10:13:13 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping
Nov 24 10:13:13 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data'])

Comment 29 Steve Baker 2016-11-24 19:55:11 UTC
Yes, this looks like the fix is working as expected.

Comment 31 errata-xmlrpc 2016-12-14 15:22:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html