Bug 1304878

Summary: rhel-osp-director: Update overcloud 7.0->7.3 times out on controller UpdateDeployment
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-tripleo-heat-templatesAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: unspecified Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: athomas, dbecker, dmacpher, dyasny, ggillies, jcoufal, jslagle, mburns, morazi, ohochman, rhel-osp-director-maint, shardy, zbitter
Target Milestone: y3   
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-0.8.6-115.el7ost Doc Type: Bug Fix
Doc Text:
Heat used a 1MB payload size for returned output. YUM's output exceeded this limit if updating a high number of packages on the Overcloud, which caused 'openstack overcloud update' to fail. This fix adds the '-q' option to YUM during an Overcloud update. This option sets the output to quiet mode, which reduced the output. YUM's output no longer exceeds Heat's limit and 'openstack overcloud update' succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-18 16:52:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
os-collect-config journal log from Controller-1
none
heat-engine.log
none
stdout from the UpdateDeployment
none
stderr from the UpdateDeployment none

Description Alexander Chuzhoy 2016-02-04 21:31:58 UTC
rhel-osp-director: Update overcloud 7.0->7.3  times out on controller UpdateDeployment

Environment:
instack-undercloud-2.1.2-37.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-112.el7ost.noarch

Steps to reproduce:
1. Deploy overcloud 7.0 with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1   --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server x.x.x.x --timeout 90 -e network-environment.yaml
2. After populating the overcloud with some objects, update it with:
openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/en
vironments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-vip.yaml  -e network-environment.yaml


Result:
ERROR: openstack ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1
Authentication required

 heat stack-show overcloud|grep reason
| stack_status_reason   | Timed out   




[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+---------------------------------------------+
| resource_name                               | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                             |
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+---------------------------------------------+
| Controller                                  | 773eac1c-84ec-4c89-8034-09d3d1ac5114          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2016-02-03T20:31:15Z |                                             |
| 1                                           | 902aa56c-7f82-4d76-9764-e8733e9476bf          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2016-02-03T20:31:23Z | Controller                                  |
| 0                                           | ef5268eb-47c0-42c0-84e3-4e07fdc17c57          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2016-02-03T20:31:43Z | Controller                                  |
| 2                                           | f4334bce-e0a6-4e14-b952-4399080d5f10          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2016-02-03T20:32:32Z | Controller                                  |
| UpdateDeployment                            | b1358eea-5cec-47a7-bc3a-26f33dc6d307          | OS::Heat::SoftwareDeployment                      | UPDATE_FAILED   | 2016-02-03T20:34:32Z | 1                                           |
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+---------------------------------------------+




[stack@instack ~]$ heat resource-list 902aa56c-7f82-4d76-9764-e8733e9476bf
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+
| resource_name            | physical_resource_id                 | resource_type                                   | resource_status | updated_time         |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+
| Controller               | 8d1fcdf9-b197-4d8c-8dd5-e97ae1fb4999 | OS::Nova::Server                                | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| ControllerDeployment     | 7b79aabf-96da-4436-b25a-5a246abbe196 | OS::TripleO::SoftwareDeployment                 | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| ControllerExtraConfigPre | e72c2ae8-65a3-4bc5-8385-e4bf1cf61750 | OS::TripleO::ControllerExtraConfigPre           | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| UpdateConfig             | 39fdf1e5-7cb3-4906-938a-85054bd64a97 | OS::TripleO::Tasks::PackageUpdate               | UPDATE_COMPLETE | 2016-02-03T20:31:50Z |
| NodeUserData             | 1c247ff1-a8f9-4c12-84e6-1c247efde2c8 | OS::TripleO::NodeUserData                       | UPDATE_COMPLETE | 2016-02-03T20:31:53Z |
| InternalApiPort          | c16bef3e-ec34-4094-8395-15251fa656f1 | OS::TripleO::Controller::Ports::InternalApiPort | UPDATE_COMPLETE | 2016-02-03T20:32:01Z |
| ExternalPort             | d81499b1-2585-4560-ae74-f73f99c9e9f4 | OS::TripleO::Controller::Ports::ExternalPort    | UPDATE_COMPLETE | 2016-02-03T20:32:14Z |
| StoragePort              | 9a0fe13d-0d9d-4887-b3a1-80e4d75fea85 | OS::TripleO::Controller::Ports::StoragePort     | UPDATE_COMPLETE | 2016-02-03T20:32:22Z |
| TenantPort               | 4c536c54-6301-4e26-8c46-8f637fb9e762 | OS::TripleO::Controller::Ports::TenantPort      | UPDATE_COMPLETE | 2016-02-03T20:32:30Z |
| StorageMgmtPort          | 305db570-ac9d-4165-b412-4a0418ca1c18 | OS::TripleO::Controller::Ports::StorageMgmtPort | UPDATE_COMPLETE | 2016-02-03T20:32:39Z |
| NetIpMap                 | 49e537c1-7fcc-4e6c-bb61-30bd879c9750 | OS::TripleO::Network::Ports::NetIpMap           | UPDATE_COMPLETE | 2016-02-03T20:32:52Z |
| NetIpSubnetMap           | 904a5d59-eb61-4b16-929d-7bfbb29f2fea | OS::TripleO::Network::Ports::NetIpMap           | UPDATE_COMPLETE | 2016-02-03T20:33:02Z |
| NetworkConfig            | 140c4610-5632-4a83-a563-fd0e8947cc01 | OS::TripleO::Controller::Net::SoftwareConfig    | UPDATE_COMPLETE | 2016-02-03T20:33:10Z |
| NetworkDeployment        | 4670f82e-d74f-4e5d-9d43-5007257a97f8 | OS::TripleO::SoftwareDeployment                 | UPDATE_COMPLETE | 2016-02-03T20:33:33Z |
| NodeTLSCAData            | 8057d408-76fd-476c-b54a-44f00348da31 | OS::TripleO::NodeTLSCAData                      | CREATE_COMPLETE | 2016-02-03T20:33:37Z |
| NodeTLSData              | 60f97e90-411e-4a79-9427-1c698af971e6 | OS::TripleO::NodeTLSData                        | CREATE_COMPLETE | 2016-02-03T20:33:44Z |
| ControllerConfig         | 813d1ac3-3a5c-45fe-9e8d-5407f25f33d2 | OS::Heat::StructuredConfig                      | CREATE_COMPLETE | 2016-02-03T20:34:00Z |
| UpdateDeployment         | b1358eea-5cec-47a7-bc3a-26f33dc6d307 | OS::Heat::SoftwareDeployment                    | UPDATE_FAILED   | 2016-02-03T20:34:32Z |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+

Comment 2 Zane Bitter 2016-02-04 21:49:01 UTC
Created attachment 1121206 [details]
os-collect-config journal log from Controller-1

The only resources in the failed state are the Controller group, its constituent nested stacks, and the UpdateDeployment resource on Controller-1.

# heat stack-list -n | grep -v COMPLETE
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------+
| id                                   | stack_name                                                                                                    | stack_status    | creation_time        | parent                               |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------+
| 17d43d0f-b2fb-4134-ab1f-ac8609d1307c | overcloud                                                                                                     | UPDATE_FAILED   | 2016-02-03T17:14:27Z | None                                 |
| 773eac1c-84ec-4c89-8034-09d3d1ac5114 | overcloud-Controller-keqvaphuzsnz                                                                             | UPDATE_FAILED   | 2016-02-03T17:14:46Z | 17d43d0f-b2fb-4134-ab1f-ac8609d1307c |
| 902aa56c-7f82-4d76-9764-e8733e9476bf | overcloud-Controller-keqvaphuzsnz-1-3336hvdojdsw                                                              | UPDATE_FAILED   | 2016-02-03T17:14:50Z | 773eac1c-84ec-4c89-8034-09d3d1ac5114 |
| ef5268eb-47c0-42c0-84e3-4e07fdc17c57 | overcloud-Controller-keqvaphuzsnz-0-wlu5pzdbkmfl                                                              | UPDATE_FAILED   | 2016-02-03T17:14:54Z | 773eac1c-84ec-4c89-8034-09d3d1ac5114 |
| f4334bce-e0a6-4e14-b952-4399080d5f10 | overcloud-Controller-keqvaphuzsnz-2-wu3su4rcs4uy                                                              | UPDATE_FAILED   | 2016-02-03T17:15:00Z | 773eac1c-84ec-4c89-8034-09d3d1ac5114 |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------+

The nested stack resources all show "Timed Out" as their status_reason. The UpdateDeployment on Controller-1 shows "UPDATE aborted", and all of its attributes are null.

[root@instack ~]# heat resource-list 902aa56c-7f82-4d76-9764-e8733e9476bf
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+
| resource_name            | physical_resource_id                 | resource_type                                   | resource_status | updated_time         |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+
| Controller               | 8d1fcdf9-b197-4d8c-8dd5-e97ae1fb4999 | OS::Nova::Server                                | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| ControllerDeployment     | 7b79aabf-96da-4436-b25a-5a246abbe196 | OS::TripleO::SoftwareDeployment                 | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| ControllerExtraConfigPre | e72c2ae8-65a3-4bc5-8385-e4bf1cf61750 | OS::TripleO::ControllerExtraConfigPre           | CREATE_COMPLETE | 2016-02-03T17:14:50Z |
| UpdateConfig             | 39fdf1e5-7cb3-4906-938a-85054bd64a97 | OS::TripleO::Tasks::PackageUpdate               | UPDATE_COMPLETE | 2016-02-03T20:31:50Z |
| NodeUserData             | 1c247ff1-a8f9-4c12-84e6-1c247efde2c8 | OS::TripleO::NodeUserData                       | UPDATE_COMPLETE | 2016-02-03T20:31:53Z |
| InternalApiPort          | c16bef3e-ec34-4094-8395-15251fa656f1 | OS::TripleO::Controller::Ports::InternalApiPort | UPDATE_COMPLETE | 2016-02-03T20:32:01Z |
| ExternalPort             | d81499b1-2585-4560-ae74-f73f99c9e9f4 | OS::TripleO::Controller::Ports::ExternalPort    | UPDATE_COMPLETE | 2016-02-03T20:32:14Z |
| StoragePort              | 9a0fe13d-0d9d-4887-b3a1-80e4d75fea85 | OS::TripleO::Controller::Ports::StoragePort     | UPDATE_COMPLETE | 2016-02-03T20:32:22Z |
| TenantPort               | 4c536c54-6301-4e26-8c46-8f637fb9e762 | OS::TripleO::Controller::Ports::TenantPort      | UPDATE_COMPLETE | 2016-02-03T20:32:30Z |
| StorageMgmtPort          | 305db570-ac9d-4165-b412-4a0418ca1c18 | OS::TripleO::Controller::Ports::StorageMgmtPort | UPDATE_COMPLETE | 2016-02-03T20:32:39Z |
| NetIpMap                 | 49e537c1-7fcc-4e6c-bb61-30bd879c9750 | OS::TripleO::Network::Ports::NetIpMap           | UPDATE_COMPLETE | 2016-02-03T20:32:52Z |
| NetIpSubnetMap           | 904a5d59-eb61-4b16-929d-7bfbb29f2fea | OS::TripleO::Network::Ports::NetIpMap           | UPDATE_COMPLETE | 2016-02-03T20:33:02Z |
| NetworkConfig            | 140c4610-5632-4a83-a563-fd0e8947cc01 | OS::TripleO::Controller::Net::SoftwareConfig    | UPDATE_COMPLETE | 2016-02-03T20:33:10Z |
| NetworkDeployment        | 4670f82e-d74f-4e5d-9d43-5007257a97f8 | OS::TripleO::SoftwareDeployment                 | UPDATE_COMPLETE | 2016-02-03T20:33:33Z |
| NodeTLSCAData            | 8057d408-76fd-476c-b54a-44f00348da31 | OS::TripleO::NodeTLSCAData                      | CREATE_COMPLETE | 2016-02-03T20:33:37Z |
| NodeTLSData              | 60f97e90-411e-4a79-9427-1c698af971e6 | OS::TripleO::NodeTLSData                        | CREATE_COMPLETE | 2016-02-03T20:33:44Z |
| ControllerConfig         | 813d1ac3-3a5c-45fe-9e8d-5407f25f33d2 | OS::Heat::StructuredConfig                      | CREATE_COMPLETE | 2016-02-03T20:34:00Z |
| UpdateDeployment         | b1358eea-5cec-47a7-bc3a-26f33dc6d307 | OS::Heat::SoftwareDeployment                    | UPDATE_FAILED   | 2016-02-03T20:34:32Z |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+----------------------+

So it looks like the UpdateDeployment was just hanging, until the stack timed out.


The os-collect-config journal on Controller-1 shows something happening 34s after UpdateDeployment starts running (for convenience, I've output the journal timestamps in UTC):

Feb 03 20:35:06 overcloud-controller-1.localdomain os-collect-config[4478]: [2016-02-03 15:35:06,231] (os-refresh-config) [INFO] Starting phase pre-configure

and it completes successfully. However the UpdateDeployment is *not* one of those run:

Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: dib-run-parts Wed Feb 3 15:48:02 EST 2016 Running /usr/libexec/os-refresh-config/configure.d/55-heat-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,397] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,397] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping config 7e3d1f2e-e397-46e9-aa98-edb88d4b48ca, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/7e3d1f2e-e397-46e9-aa98-edb88d4b48ca.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping group Heat::Ungrouped with no hook script /var/lib/heat-config/hooks/HeatUngrouped
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,398] (heat-config) [WARNING] Skipping config 02e762c0-1232-420d-a99c-668b1c4e7dfa, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/02e762c0-1232-420d-a99c-668b1c4e7dfa.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] Skipping config 00fcec8b-bc50-41d6-b91d-f9aabf4a89ef, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/00fcec8b-bc50-41d6-b91d-f9aabf4a89ef.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] Skipping config 6c7d0e51-43e3-4881-95da-a0fa87da0491, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/6c7d0e51-43e3-4881-95da-a0fa87da0491.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] Skipping config 715037a4-d334-4762-947d-7b76bd2764db, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,399] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/715037a4-d334-4762-947d-7b76bd2764db.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping config 038b27a0-0190-452a-a37d-d6cf35bdfdf6, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/038b27a0-0190-452a-a37d-d6cf35bdfdf6.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping config 88d98fc1-3dd4-4078-9e96-e84ce081136c, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/88d98fc1-3dd4-4078-9e96-e84ce081136c.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping config 48398659-a33a-4700-b2b2-3abf5d4aa231, already deployed
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/48398659-a33a-4700-b2b2-3abf5d4aa231.json
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,400] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: [2016-02-03 15:48:02,401] (heat-config) [WARNING] Skipping group Heat::Ungrouped with no hook script /var/lib/heat-config/hooks/HeatUngrouped
Feb 03 20:48:02 overcloud-controller-1.localdomain os-collect-config[26579]: dib-run-parts Wed Feb 3 15:48:02 EST 2016 55-heat-config completed

before and after this burst of activity (heavily edited here), it just logs stuff of the form:


Feb 03 20:35:06 overcloud-controller-1.localdomain os-collect-config[4478]: 2016-02-03 15:35:06.168 4478 WARNING os_collect_config.local [-] /var/lib/os-collect-config/local-data not found. Skipping
Feb 03 20:35:06 overcloud-controller-1.localdomain os-collect-config[4478]: 2016-02-03 15:35:06.169 4478 WARNING os_collect_config.local [-] No local metadata found (['/var/lib/os-collect-config/local-data'])

The full os-collect-config log is attached.

Comment 3 Zane Bitter 2016-02-04 22:42:26 UTC
Created attachment 1121228 [details]
heat-engine.log

Nothing of interest in the heat-engine log.

Comment 4 Zane Bitter 2016-02-05 16:29:35 UTC
It appears that os-collect-config is behaving normally. It sees a metadata update at the appropriate time, but presumably the metadata it retrieves does not contain the data for the UpdateDeployment deployment that it is supposed to run.

Previous bugs of this type have been caused by race conditions in various parts of the code:

https://bugs.launchpad.net/heat/+bug/1488366
https://bugs.launchpad.net/heat/+bug/1477329

Both of those are now fixed, but it's possible there is another race somewhere.

(Thanks to Thomas Hervé for this analysis.)

Comment 5 Zane Bitter 2016-02-05 21:35:08 UTC
I raised an upstream bug to note the fact that there definitely *are* races in metadata writes even though we don't know yet how (or, indeed, if) they are being triggered in this case:

https://bugs.launchpad.net/heat/+bug/1542480

I also raised a bug (and submitted a patch) to stop us from ever recalculating metadata as a result of clearing a user hook, although that should not be the cause of this issue (since in this case the hook was set on a SoftwareDeployment resource, and signals to SoftwareDeployment resources were already exempt per bug 1255759):

https://bugs.launchpad.net/heat/+bug/1542455

Comment 6 Zane Bitter 2016-02-06 01:17:46 UTC
OK, it turns out all of the speculation above was a red herring. os-collect-config was running fine in the background but not logging (as documented in bug 1272254) because it itself had been updated. (When the new version started up, it didn't feel the need to re-run the deployment because it had already been run, just not logged to the journal.)

The real problem is this, from the heat-api-cfn log on the undercloud (note: this was on a different setup, so the times don't match the ones above):

2016-02-05 13:08:25.079 25154 INFO eventlet.wsgi.server [req-3992ee77-eac5-4162-9538-68beb8c0bf38 b2144b17f14644ccbbd3da4c4f97e352 f56a0c07e08a4166b876bbb8076e3672] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 454, in handle_one_response
    result = self.application(self.environ, start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 410, in __call__
    response = req.get_response(self.application)
  File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/osprofiler/web.py", line 99, in __call__
    return request.get_response(self.application)
  File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
    return resp(environ, start_response)
  File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 634, in __call__
    return self._call_app(env, start_response)
  File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 554, in _call_app
    return self._app(env, _fake_start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 410, in __call__
    response = req.get_response(self.application)
  File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
    return resp(environ, start_response)
  File "/usr/lib/python2.7/site-packages/routes/middleware.py", line 131, in __call__
    response = self.app(environ, start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
    return resp(environ, start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/heat/common/wsgi.py", line 699, in __call__
    raise translate_exception(err, request.best_match_language())
RequestLimitExceeded: Request limit exceeded: JSON body size (1050539 bytes) exceeds maximum allowed size (1048576 bytes).
2016-02-05 13:08:25.079 25154 INFO eventlet.wsgi.server [req-3992ee77-eac5-4162-9538-68beb8c0bf38 b2144b17f14644ccbbd3da4c4f97e352 f56a0c07e08a4166b876bbb8076e3672] 192.0.2.8 - - [05/Feb/2016 13:08:25] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3Af75923fe9f1d46008171167c4887ff5f%3Astacks%2Fovercloud-Controller-7dgfo27yepjy-1-zjuybvpl5jha%2F956959d9-e24b-4543-b0c3-e77a5b3133bf%2Fresources%2FUpdateDeployment?Timestamp=2016-02-04T20%3A25%3A30Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=9a1a426c23d944df9fbe2b6aa78a32c5&SignatureVersion=2&Signature=GUOR5yyHPvTm8q1HWGap1rWNAcKg1lSBtjVFtpk%2By5s%3D HTTP/1.1" 500 3618 0.297531

The response from os-collect-config back to Heat is HUUUUUUGE - slightly over 1MB, which turns out to be the limit.

The surfeit of data is mostly the progress output of yum, so it depends entirely on how many packages there are to update (older installations are likely to have more trouble). That's kind of unfortunate because that data is nice to have for debugging when things go wrong - it would be a shame to have to turn it off by running "yum -q". A lot of the data appears to be duplicated between stdout and stderr for reasons that I have not yet dug into, but there is an opportunity there to save probably around a quarter to a third - which would buy us some temporary breathing room. We may or may not also be able to tweak the debug level to find a happy medium.

Comment 7 Zane Bitter 2016-02-06 01:18:40 UTC
Created attachment 1121542 [details]
stdout from the UpdateDeployment

Comment 8 Zane Bitter 2016-02-06 01:19:26 UTC
Created attachment 1121543 [details]
stderr from the UpdateDeployment

Comment 9 Steven Hardy 2016-02-06 08:33:37 UTC
Thanks for the update Zane.  Since we're in the process of switching to Swift for metadata polling (ref https://review.openstack.org/#/c/257657/) I guess we can look at moving to swift for the signal transport too, which would probably address the payload limit issue.

Comment 11 Zane Bitter 2016-02-10 20:48:35 UTC
I just noticed that https://review.openstack.org/#/c/188737/ was committed upstream as the fix for bug 1243884 (originally part of bug 1242396, before it was split into multiple bugs for different components), but never actually pulled in to the downstream openstack-heat-templates package. This would have prevented the issue.

Comment 13 Zane Bitter 2016-02-12 16:00:08 UTC
*** Bug 1304463 has been marked as a duplicate of this bug. ***

Comment 18 Alexander Chuzhoy 2016-02-16 23:09:06 UTC
Verified:
Environment: openstack-tripleo-heat-templates-0.8.6-120.el7ost.noarch
Successfully updated from 7.0 -> to 7.3.

Comment 20 errata-xmlrpc 2016-02-18 16:52:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0264.html

Comment 21 Zane Bitter 2016-06-16 11:12:45 UTC
*** Bug 1304463 has been marked as a duplicate of this bug. ***