Bug 1306140
Summary: | os-collect-config stop polling after a long time | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jaison Raju <jraju> |
Component: | os-collect-config | Assignee: | Steve Baker <sbaker> |
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 8.0 (Liberty) | CC: | apevec, dhill, ipetrova, jcoufal, jraju, jschluet, lhh, mburns, mcornea, mlopes, nlevinki, ohochman, pneedle, rhel-osp-director-maint, romano.silva, sbaker, scorcora, srevivo |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 10.0 (Newton) | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | os-collect-config-5.0.0-1.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Prior to this update, HTTP requests to `os-collect-config` for configuration did not specify a request timeout. Consequently, polling for data while the undercloud was inaccessible (for example, rebooting undercloud, network connectivity issues) resulted in `os-collect-config` stalling, performing no polling or configuration. This often only became apparent when an overcloud stack operation was performed and software configuration operations timed out.
With this update, `os-collect-config` HTTP requests now always specify a timeout period.
As a result, polling for data will fail when the undercloud is unavailable, and then resume when it is available again.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 15:22:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jaison Raju
2016-02-10 06:58:11 UTC
*** Bug 1306139 has been marked as a duplicate of this bug. *** This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. There are reports of os-collect-config stopping polling, and other reports of the os-collect-config process disappearing. Until we know more I propose that this bug be used to track both situations. For disappearing process, we could easily improve the situation by specifying Restart in the systemd unit file. For stopping polling, currently it is not possible to set a kill timeout for os-refresh-config so there is nothing stopping a misbehaving script from running indefinitely. The kill timeout value should be passed to os-refresh-config and left to os-refresh-config to exit itself. If it turns out that os-collect-config stops polling for reasons other than os-refresh-config, a final recommended change could be to have os-collect-config send watchdog pings to systemd and set WatchdogSec in the systemd unit. the os-collect-config service file already has Restart=on-failure, and to be fair most reports have been about os-collect-config running but stalled. An upstream bug has been raised for a os-refresh-config --timeout feature The upstream changes for this are now ready for review. https://review.openstack.org/#/q/topic:bug/1595722 The above os-refresh-config timeout changes are worth it but I no longer believe that os-refresh-config is the prime cause of this issue. A comment in bug #1353716 shows a nova metadata server becoming temporarily unavailable, and the ec2 collectors failing as a result: http://pastebin.test.redhat.com/390534 The ec2, cfn and request collectors do a python requests get without any timeout argument. This implies a timeout of None which will wait *forever* if necessary. This could wedge os-collect-config if a collector source is unavailable for some reason (network brownout, service down) during a collection attempt. Upstream bug has a fix associated with it. Any connectivity issues from the overcloud to the undercloud could cause os-collect-config to stall indefinitely on http requests. The workaround for when this occurs is to restart os-collect-config on all nodes. A permanent fix would be to backport https://review.openstack.org/#/c/340179/ and upgrade all nodes to the lastest os-collect-config package. Can we get a backport of this patch as it's affecting many customers? My current customer is using RHOSP 8.0 so we'd need to backport that patch to RHOSP 8.0 as well. I think a backport is desirable but its not my call. Setting needinfo to mburns Can we update which version of os-collect-config this is fixed in and move it to modified? Backports of this bug up to OSP7 will be approved - clones need to be created. Steve, will you be able to handle the backports too, please? Hi Steve, I tried blocking one of the overcloud nodes access to the undercloud and watched the os-collect-config journal for changes. Below are my results. Can we consider these enough to verify this bug? Thank you. journalctl -fl -u os-collect-config =========== Block overcloud node connectivity to the undercloud =========== Nov 24 10:11:12 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/ (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x34fc510>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)')) Nov 24 10:11:12 overcloud-controller-0.localdomain os-collect-config[4021]: Source [ec2] Unavailable. Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='192.168.0.1', port=8080): Max retries exceeded with url: /v1/AUTH_ce297a53c20f4c46ba71c87eb8fe01b0/ov-6aavlgpnccx-0-cyszwjfzt6ki-Controller-kiih4dysq5v4/302be806-b4d3-4536-82b7-03fff809e08d?temp_url_sig=5575407b8acbde2c7fdc44db52f63b774abd7870&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x3615090>, 'Connection to 192.168.0.1 timed out. (connect timeout=10.0)')) Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: Source [request] Unavailable. Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:11:22 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data']) Nov 24 10:12:02 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/ (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x3615690>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)')) Nov 24 10:12:02 overcloud-controller-0.localdomain os-collect-config[4021]: Source [ec2] Unavailable. Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: HTTPConnectionPool(host='192.168.0.1', port=8080): Max retries exceeded with url: /v1/AUTH_ce297a53c20f4c46ba71c87eb8fe01b0/ov-6aavlgpnccx-0-cyszwjfzt6ki-Controller-kiih4dysq5v4/302be806-b4d3-4536-82b7-03fff809e08d?temp_url_sig=5575407b8acbde2c7fdc44db52f63b774abd7870&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x36151d0>, 'Connection to 192.168.0.1 timed out. (connect timeout=10.0)')) Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: Source [request] Unavailable. Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:12:12 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data']) =========== Resume overcloud node connectivity to the undercloud =========== Nov 24 10:12:43 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:12:43 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data']) Nov 24 10:13:13 overcloud-controller-0.localdomain os-collect-config[4021]: /var/lib/os-collect-config/local-data not found. Skipping Nov 24 10:13:13 overcloud-controller-0.localdomain os-collect-config[4021]: No local metadata found (['/var/lib/os-collect-config/local-data']) Yes, this looks like the fix is working as expected. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |