Bug 1304878
Summary: | rhel-osp-director: Update overcloud 7.0->7.3 times out on controller UpdateDeployment | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||||||||
Component: | openstack-tripleo-heat-templates | Assignee: | Jiri Stransky <jstransk> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> | ||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 7.0 (Kilo) | CC: | athomas, dbecker, dmacpher, dyasny, ggillies, jcoufal, jslagle, mburns, morazi, ohochman, rhel-osp-director-maint, shardy, zbitter | ||||||||||
Target Milestone: | y3 | ||||||||||||
Target Release: | 7.0 (Kilo) | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | openstack-tripleo-heat-templates-0.8.6-115.el7ost | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
Heat used a 1MB payload size for returned output. YUM's output exceeded this limit if updating a high number of packages on the Overcloud, which caused 'openstack overcloud update' to fail. This fix adds the '-q' option to YUM during an Overcloud update. This option sets the output to quiet mode, which reduced the output. YUM's output no longer exceeds Heat's limit and 'openstack overcloud update' succeeds.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2016-02-18 16:52:35 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Alexander Chuzhoy
2016-02-04 21:31:58 UTC
Created attachment 1121206 [details]
os-collect-config journal log from Controller-1
The only resources in the failed state are the Controller group, its constituent nested stacks, and the UpdateDeployment resource on Controller-1.
# heat stack-list -n | grep -v COMPLETE
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------+
| id | stack_name | stack_status | creation_time | parent |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------+
| 17d43d0f-b2fb-4134-ab1f-ac8609d1307c | overcloud | UPDATE_FAILED | 2016-02-03T17:14:27Z | None |
| 773eac1c-84ec-4c89-8034-09d3d1ac5114 | overcloud-Controller-keqvaphuzsnz | UPDATE_FAILED | 2016-02-03T17:14:46Z | 17d43d0f-b2fb-4134-ab1f-ac8609d1307c |
| 902aa56c-7f82-4d76-9764-e8733e9476bf | overcloud-Controller-keqvaphuzsnz-1-3336hvdojdsw | UPDATE_FAILED | 2016-02-03T17:14:50Z | 773eac1c-84ec-4c89-8034-09d3d1ac5114 |
| ef5268eb-47c0-42c0-84e3-4e07fdc17c57 | overcloud-Controller-keqvaphuzsnz-0-wlu5pzdbkmfl | UPDATE_FAILED | 2016-02-03T17:14:54Z | 773eac1c-84ec-4c89-8034-09d3d1ac5114 |
| f4334bce-e0a6-4e14-b952-4399080d5f10 | overcloud-Controller-keqvaphuzsnz-2-wu3su4rcs4uy | UPDATE_FAILED | 2016-02-03T17:15:00Z | 773eac1c-84ec-4c89-8034-09d3d1ac5114 |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------+
The nested stack resources all show "Timed Out" as their status_reason. The UpdateDeployment on Controller-1 shows "UPDATE aborted", and all of its attributes are null.
[root@instack ~]# heat resource-list 902aa56c-7f82-4d76-9764-e8733e9476bf
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+
| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+
| Controller | 8d1fcdf9-b197-4d8c-8dd5-e97ae1fb4999 | OS::Nova::Server | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| ControllerDeployment | 7b79aabf-96da-4436-b25a-5a246abbe196 | OS::TripleO::SoftwareDeployment | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| ControllerExtraConfigPre | e72c2ae8-65a3-4bc5-8385-e4bf1cf61750 | OS::TripleO::ControllerExtraConfigPre | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| UpdateConfig | 39fdf1e5-7cb3-4906-938a-85054bd64a97 | OS::TripleO::Tasks::PackageUpdate | UPDATE_COMPLETE | 2016-02-03T20:31:50Z |
| NodeUserData | 1c247ff1-a8f9-4c12-84e6-1c247efde2c8 | OS::TripleO::NodeUserData | UPDATE_COMPLETE | 2016-02-03T20:31:53Z |
| InternalApiPort | c16bef3e-ec34-4094-8395-15251fa656f1 | OS::TripleO::Controller::Ports::InternalApiPort | UPDATE_COMPLETE | 2016-02-03T20:32:01Z |
| ExternalPort | d81499b1-2585-4560-ae74-f73f99c9e9f4 | OS::TripleO::Controller::Ports::ExternalPort | UPDATE_COMPLETE | 2016-02-03T20:32:14Z |
| StoragePort | 9a0fe13d-0d9d-4887-b3a1-80e4d75fea85 | OS::TripleO::Controller::Ports::StoragePort | UPDATE_COMPLETE | 2016-02-03T20:32:22Z |
| TenantPort | 4c536c54-6301-4e26-8c46-8f637fb9e762 | OS::TripleO::Controller::Ports::TenantPort | UPDATE_COMPLETE | 2016-02-03T20:32:30Z |
| StorageMgmtPort | 305db570-ac9d-4165-b412-4a0418ca1c18 | OS::TripleO::Controller::Ports::StorageMgmtPort | UPDATE_COMPLETE | 2016-02-03T20:32:39Z |
| NetIpMap | 49e537c1-7fcc-4e6c-bb61-30bd879c9750 | OS::TripleO::Network::Ports::NetIpMap | UPDATE_COMPLETE | 2016-02-03T20:32:52Z |
| NetIpSubnetMap | 904a5d59-eb61-4b16-929d-7bfbb29f2fea | OS::TripleO::Network::Ports::NetIpMap | UPDATE_COMPLETE | 2016-02-03T20:33:02Z |
| NetworkConfig | 140c4610-5632-4a83-a563-fd0e8947cc01 | OS::TripleO::Controller::Net::SoftwareConfig | UPDATE_COMPLETE | 2016-02-03T20:33:10Z |
| NetworkDeployment | 4670f82e-d74f-4e5d-9d43-5007257a97f8 | OS::TripleO::SoftwareDeployment | UPDATE_COMPLETE | 2016-02-03T20:33:33Z |
| NodeTLSCAData | 8057d408-76fd-476c-b54a-44f00348da31 | OS::TripleO::NodeTLSCAData | CREATE_COMPLETE | 2016-02-03T20:33:37Z |
| NodeTLSData | 60f97e90-411e-4a79-9427-1c698af971e6 | OS::TripleO::NodeTLSData | CREATE_COMPLETE | 2016-02-03T20:33:44Z |
| ControllerConfig | 813d1ac3-3a5c-45fe-9e8d-5407f25f33d2 | OS::Heat::StructuredConfig | CREATE_COMPLETE | 2016-02-03T20:34:00Z |
| UpdateDeployment | b1358eea-5cec-47a7-bc3a-26f33dc6d307 | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2016-02-03T20:34:32Z |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+
So it looks like the UpdateDeployment was just hanging, until the stack timed out.
The os-collect-config journal on Controller-1 shows something happening 34s after UpdateDeployment starts running (for convenience, I've output the journal timestamps in UTC):
Feb 03 20:35:06 overcloud-controller-1.localdomain os-collect-config[4478]: [2016-02-03 15:35:06,231] (os-refresh-config) [INFO] Starting phase pre-configure
and it completes successfully. However the UpdateDeployment is *not* one of those run:
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: dib-run-parts Wed Feb 3 15:48:02 EST 2016 Running /usr/libexec/os-refresh-config/configure.d/55-heat-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,397] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,397] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping config 7e3d1f2e-e397-46e9-aa98-edb88d4b48ca, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/7e3d1f2e-e397-46e9-aa98-edb88d4b48ca.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping group Heat::Ungrouped with no hook script /var/lib/heat-config/hooks/HeatUngrouped
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping config 02e762c0-1232-420d-a99c-668b1c4e7dfa, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/02e762c0-1232-420d-a99c-668b1c4e7dfa.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] Skipping config 00fcec8b-bc50-41d6-b91d-f9aabf4a89ef, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/00fcec8b-bc50-41d6-b91d-f9aabf4a89ef.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] Skipping config 6c7d0e51-43e3-4881-95da-a0fa87da0491, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/6c7d0e51-43e3-4881-95da-a0fa87da0491.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] Skipping config 715037a4-d334-4762-947d-7b76bd2764db, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/715037a4-d334-4762-947d-7b76bd2764db.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping config 038b27a0-0190-452a-a37d-d6cf35bdfdf6, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/038b27a0-0190-452a-a37d-d6cf35bdfdf6.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping config 88d98fc1-3dd4-4078-9e96-e84ce081136c, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/88d98fc1-3dd4-4078-9e96-e84ce081136c.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping config 48398659-a33a-4700-b2b2-3abf5d4aa231, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/48398659-a33a-4700-b2b2-3abf5d4aa231.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,401] (heat-config) [WARNING] Skipping group Heat::Ungrouped with no hook script /var/lib/heat-config/hooks/HeatUngrouped
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: dib-run-parts Wed Feb 3 15:48:02 EST 2016 55-heat-config completed
before and after this burst of activity (heavily edited here), it just logs stuff of the form:
Feb 03 20:35:06 overcloud-controller-1.localdomain os-collect-config[4478]: 2016-02-03 15:35:06.168 4478 WARNING os_collect_config.local [-] /var/lib/os-collect-config/local-data not found. Skipping
Feb 03 20:35:06 overcloud-controller-1.localdomain os-collect-config[4478]: 2016-02-03 15:35:06.169 4478 WARNING os_collect_config.local [-] No local metadata found (['/var/lib/os-collect-config/local-data'])
The full os-collect-config log is attached.
Created attachment 1121228 [details]
heat-engine.log
Nothing of interest in the heat-engine log.
It appears that os-collect-config is behaving normally. It sees a metadata update at the appropriate time, but presumably the metadata it retrieves does not contain the data for the UpdateDeployment deployment that it is supposed to run. Previous bugs of this type have been caused by race conditions in various parts of the code: https://bugs.launchpad.net/heat/+bug/1488366 https://bugs.launchpad.net/heat/+bug/1477329 Both of those are now fixed, but it's possible there is another race somewhere. (Thanks to Thomas Hervé for this analysis.) I raised an upstream bug to note the fact that there definitely *are* races in metadata writes even though we don't know yet how (or, indeed, if) they are being triggered in this case: https://bugs.launchpad.net/heat/+bug/1542480 I also raised a bug (and submitted a patch) to stop us from ever recalculating metadata as a result of clearing a user hook, although that should not be the cause of this issue (since in this case the hook was set on a SoftwareDeployment resource, and signals to SoftwareDeployment resources were already exempt per bug 1255759): https://bugs.launchpad.net/heat/+bug/1542455 OK, it turns out all of the speculation above was a red herring. os-collect-config was running fine in the background but not logging (as documented in bug 1272254) because it itself had been updated. (When the new version started up, it didn't feel the need to re-run the deployment because it had already been run, just not logged to the journal.) The real problem is this, from the heat-api-cfn log on the undercloud (note: this was on a different setup, so the times don't match the ones above): 2016-02-05 13:08:25.079 25154 INFO eventlet.wsgi.server [req-3992ee77-eac5-4162-9538-68beb8c0bf38 b2144b17f14644ccbbd3da4c4f97e352 f56a0c07e08a4166b876bbb8076e3672] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 454, in handle_one_response result = self.application(self.environ, start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 410, in __call__ response = req.get_response(self.application) File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send application, catch_exc_info=False) File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application app_iter = application(self.environ, start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/site-packages/osprofiler/web.py", line 99, in __call__ return request.get_response(self.application) File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send application, catch_exc_info=False) File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application app_iter = application(self.environ, start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__ return resp(environ, start_response) File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 634, in __call__ return self._call_app(env, start_response) File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 554, in _call_app return self._app(env, _fake_start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 410, in __call__ response = req.get_response(self.application) File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send application, catch_exc_info=False) File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application app_iter = application(self.environ, start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__ return resp(environ, start_response) File "/usr/lib/python2.7/site-packages/routes/middleware.py", line 131, in __call__ response = self.app(environ, start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__ return resp(environ, start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 699, in __call__ raise translate_exception(err, request.best_match_language()) RequestLimitExceeded: Request limit exceeded: JSON body size (1050539 bytes) exceeds maximum allowed size (1048576 bytes). 2016-02-05 13:08:25.079 25154 INFO eventlet.wsgi.server [req-3992ee77-eac5-4162-9538-68beb8c0bf38 b2144b17f14644ccbbd3da4c4f97e352 f56a0c07e08a4166b876bbb8076e3672] 192.0.2.8 - - [05/Feb/2016 13:08:25] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3Af75923fe9f1d46008171167c4887ff5f%3Astacks%2Fovercloud-Controller-7dgfo27yepjy-1-zjuybvpl5jha%2F956959d9-e24b-4543-b0c3-e77a5b3133bf%2Fresources%2FUpdateDeployment?Timestamp=2016-02-04T20%3A25%3A30Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=9a1a426c23d944df9fbe2b6aa78a32c5&SignatureVersion=2&Signature=GUOR5yyHPvTm8q1HWGap1rWNAcKg1lSBtjVFtpk%2By5s%3D HTTP/1.1" 500 3618 0.297531 The response from os-collect-config back to Heat is HUUUUUUGE - slightly over 1MB, which turns out to be the limit. The surfeit of data is mostly the progress output of yum, so it depends entirely on how many packages there are to update (older installations are likely to have more trouble). That's kind of unfortunate because that data is nice to have for debugging when things go wrong - it would be a shame to have to turn it off by running "yum -q". A lot of the data appears to be duplicated between stdout and stderr for reasons that I have not yet dug into, but there is an opportunity there to save probably around a quarter to a third - which would buy us some temporary breathing room. We may or may not also be able to tweak the debug level to find a happy medium. Created attachment 1121542 [details]
stdout from the UpdateDeployment
Created attachment 1121543 [details]
stderr from the UpdateDeployment
Thanks for the update Zane. Since we're in the process of switching to Swift for metadata polling (ref https://review.openstack.org/#/c/257657/) I guess we can look at moving to swift for the signal transport too, which would probably address the payload limit issue. I just noticed that https://review.openstack.org/#/c/188737/ was committed upstream as the fix for bug 1243884 (originally part of bug 1242396, before it was split into multiple bugs for different components), but never actually pulled in to the downstream openstack-heat-templates package. This would have prevented the issue. *** Bug 1304463 has been marked as a duplicate of this bug. *** Verified: Environment: openstack-tripleo-heat-templates-0.8.6-120.el7ost.noarch Successfully updated from 7.0 -> to 7.3. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0264.html *** Bug 1304463 has been marked as a duplicate of this bug. *** |