Description of problem: Tried to deploy HA+ceph : 3 controllers, 1 compute and 1 ceph via undercloud UI. One + hour after the deployment started got : CREATE_FAILED, Create time out. In heat-engine: 2015-09-16 05:50:02.029 29559 ERROR oslo_db.api [req-83818354-80bb-4c5d-b6c2-96bb69f5e006 admin admin] DB error. 2015-09-16 05:50:02.029 29559 TRACE oslo_db.api Traceback (most recent call last): 2015-09-16 05:50:02.029 29559 TRACE oslo_db.api File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 131, in wrapper 2015-09-16 05:50:02.029 29559 TRACE oslo_db.api return f(*args, **kwargs) 2015-09-16 05:50:02.029 29559 TRACE oslo_db.api File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 89, in _push_metadata_software_deployments 2015-09-16 05:50:02.029 29559 TRACE oslo_db.api exception.DeploymentConcurrentTransaction(server=server_id)) 2015-09-16 05:50:02.029 29559 TRACE oslo_db.api RetryRequest 2015-09-16 05:50:02.029 29559 TRACE oslo_db.api Version-Release number of selected component (if applicable): openstack-tuskar-ui-0.4.0-3.el7ost.noarch python-tuskarclient-0.1.18-4.el7ost.noarch openstack-tuskar-0.4.18-4.el7ost.noarch openstack-tuskar-ui-extras-0.0.4-1.el7ost.noarch How reproducible: Steps to Reproduce: 1. Access the undercloud UI, register 5 nodes, create 1 flavor (the suggested one after the nodes were discovered), upload images, assign images and flavor the roles. Set 3 for controller, 1 for compute and 1 for ceph 2.in Simplify configuration change snmp password to 'password' 3.verify deployment type is "Virtualized" 4. set Service net map for all resources to meet json format: {"GlanceRegistryNetwork": "internal_api", "NeutronTenantNetwork": "tenant", "NovaApiNetwork": "internal_api", "CeilometerApiNetwork": "internal_api", "CephStorageHostnameResolveNetwork": "storage", "SwiftMgmtNetwork": "storage_mgmt", "MemcachedNetwork": "internal_api", "RabbitMqNetwork": "internal_api", "KeystoneAdminApiNetwork": "internal_api", "SwiftProxyNetwork": "storage", "CinderApiNetwork": "internal_api", "CephClusterNetwork": "storage_mgmt", "NovaMetadataNetwork": "internal_api", "RedisNetwork": "internal_api", "NeutronApiNetwork": "internal_api", "GlanceApiNetwork": "storage", "ObjectStorageHostnameResolveNetwork": "internal_api", "KeystonePublicApiNetwork": "internal_api", "HeatApiNetwork": "internal_api", "NovaVncProxyNetwork": "internal_api", "ControllerHostnameResolveNetwork": "internal_api", "MysqlNetwork": "internal_api", "BlockStorageHostnameResolveNetwork": "internal_api", "ComputeHostnameResolveNetwork": "internal_api", "CephPublicNetwork": "storage", "MongoDbNetwork": "internal_api", "HorizonNetwork": "internal_api", "CinderIscsiNetwork": "storage"} Actual results: deployment failed , timed out after ~1.5 hours Expected results: deployment succees Additional info: $ heat resource-show overcloud ComputeNodesPostDeployment +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | attributes | {} | | description | | | links | http://192.0.2.1:8004/v1/a91bf4578f7344b79314be8d208a79e1/stacks/overcloud/ab4652a4-4ff0-457f-a330-d95e6a3b7deb/resources/ComputeNodesPostDeployment (self) | | | http://192.0.2.1:8004/v1/a91bf4578f7344b79314be8d208a79e1/stacks/overcloud/ab4652a4-4ff0-457f-a330-d95e6a3b7deb (stack) | | | http://192.0.2.1:8004/v1/a91bf4578f7344b79314be8d208a79e1/stacks/overcloud-ComputeNodesPostDeployment-izwgx346q3ia/2beed1f4-ab2a-4c9c-a17a-092bf76daf59 (nested) | | logical_resource_id | ComputeNodesPostDeployment | | physical_resource_id | 2beed1f4-ab2a-4c9c-a17a-092bf76daf59 | | required_by | | | resource_name | ComputeNodesPostDeployment | | resource_status | CREATE_FAILED | | resource_status_reason | CREATE aborted | | resource_type | OS::TripleO::ComputePostDeployment | | updated_time | 2015-09-16T09:22:45Z | +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ heat logs , messages log attached
Created attachment 1073977 [details] heat-logs1.tar.gz
Created attachment 1073979 [details] heat-logs2.tar.gz
Created attachment 1073980 [details] messages.tar.gz
This is potentially fixed by the heat db overflow error that was patched this week. bz1261512 Would you mind retesting after that bug after the next puddle is generated? (should be EOD today)
*** Bug 1262124 has been marked as a duplicate of this bug. ***
This also seems to break with 3 controllers, 1 compute, zero ceph. The deployment times out with the deployment log message: resources.ControllerNodesPostDeployment.Property error: resources[1].properties.server: Expecting to find username or userId in passwordCredentials - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (Also to be found in the heat-engine.log)
Tested with the most recent puddle on2015-09-22 and 2015-09-23. So if the heat db overflow error was included there, it's not the cause of this bug.
Adding to my previous comment: "$ heat stack-list" is unresponsive after the failed deployment.
Heat engine was unresponsive because of low available memory on the undercloud. Performed keystone token flush, restarted mariadb, keystone and heat services. It allowed me to use the system long enough to attempt another deployment and heat-engine failed again. The relevant section of the log is: Sep 24 01:43:00 instack.localdomain kernel: Out of memory: Kill process 30969 (heat-engine) score 61 or sacrifice child Sep 24 01:43:00 instack.localdomain kernel: Killed process 30969 (heat-engine) total-vm:633912kB, anon-rss:234132kB, file-rss:2052kB Sep 24 01:43:00 instack.localdomain systemd[1]: openstack-heat-engine.service: main process exited, code=killed, status=9/KILL Sep 24 01:43:00 instack.localdomain systemd[1]: Unit openstack-heat-engine.service entered failed state. I suggest we change the memory for the instack vm, add a swap disk[1], or both. One of the things I noticed about the UI deployment is that it lacked an ntp server value in the service configuration. If you attempt a CLI deployment via the plugin without an ntp server it will stop and warn you that it's needed with HA. I didn't see if anyone had tried adding that to the UI deployment steps yet and think it's something to try if we can mitigate the low memory issue. [1] https://raymii.org/s/tutorials/KVM_add_disk_image_or_swap_image_to_virtual_machine_with_virsh.html
I removed and reinstalled all vms, giving the instack vm 8G of memory. Before doing that, I removed the rhos-release package on the host and reinstalled it again: The current version is 0.72 (was 0.69 before on both the host as well as the instack vm). The overcloud deployment failed after ~1hr: StackValidationFailed: resources.ControllerNodesPostDeployment.resources.ControllerRingbuilderDeployment_Step3.resources[0]: Property error: [0].Properties.server: The server has either erred or is incapable of performing the requested operation. (HTTP 500) Unlike before, the vm as well as heat and the ui were still responsive this time. The exception mentioned in bz1261512 can still be found in the heat-engine log. Not sure if this is related to the failure of the ControllerRingbuilderDeployment_Step3 mentioned above. $ grep DBAPIError /var/log/heat/heat-engine.log 2015-09-24 05:49:28.262 29357 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (_mysql_exceptions.DataError) (1406, "Data too long for column 'resource_properties' at row 1") [...]
It's possible/likely those logs are a red herring. I believe that in bz1261512 the stack went to CREATE_FAILED immediately, rather than time out.
Yes, these exceptions appear in the log even on successful deployments. They can probably be ruled out as the cause of this bug.
On the same undercloud (no re-provisioning or new installation in between), I did 4 successive deployments: 1. FAILED: HA: 3 controllers, 1 compute, 1 ceph 2. SUCCESS: 1 controller, 1 compute 3. SUCCESS: HA: 3 controllers, 1 compute 4. SUCCESS: HA: 3 controllers, 1 compute, 1 ceph Other than the nodes themselves there were no changes between deployments (service config, roles etc.).
First HA deployment attempt on fresh undercloud installation succeeded. (3 controllers, 1 compute.) Undercloud memory: 8GB $ rhos-release -v $ 1.0.0
Note that the log in the first comment is also sort-of expected - an error is raised because of a DB conflict (this is actually an improvement, because it means that we're detecting a race and dealing with it). We appear to have a correct retry handler in that code (and in fact we fixed a previous issue with that, which is not evident in the log). It is a little suspicious though, because intermittently hanging forever is the kind of thing you would expect to see from an unresolved race condition.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days