Bug 1263651 - DB error while deploying HA+ceph via undercloud UI [NEEDINFO]
DB error while deploying HA+ceph via undercloud UI
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tuskar-ui (Show other bugs)
Unspecified Unspecified
high Severity urgent
: y1
: 7.0 (Kilo)
Assigned To: Florian Fuchs
: Triaged
: 1262124 (view as bug list)
Depends On:
Blocks: 1250250
  Show dependency treegraph
Reported: 2015-09-16 07:08 EDT by Ola Pavlenko
Modified: 2016-04-18 02:54 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2015-09-25 05:21:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
calfonso: needinfo? (opavlenk)

Attachments (Terms of Use)
heat-logs1.tar.gz (16.88 MB, application/x-gzip)
2015-09-16 07:14 EDT, Ola Pavlenko
no flags Details
heat-logs2.tar.gz (2.74 MB, application/x-gzip)
2015-09-16 07:16 EDT, Ola Pavlenko
no flags Details
messages.tar.gz (11.27 MB, application/x-gzip)
2015-09-16 07:18 EDT, Ola Pavlenko
no flags Details

  None (edit)
Description Ola Pavlenko 2015-09-16 07:08:01 EDT
Description of problem:
Tried to deploy HA+ceph : 3 controllers, 1 compute and 1 ceph via undercloud UI.
One + hour after the deployment started got : CREATE_FAILED, Create time out.

In heat-engine:

2015-09-16 05:50:02.029 29559 ERROR oslo_db.api [req-83818354-80bb-4c5d-b6c2-96bb69f5e006 admin admin] DB error.
2015-09-16 05:50:02.029 29559 TRACE oslo_db.api Traceback (most recent call last):
2015-09-16 05:50:02.029 29559 TRACE oslo_db.api   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 131, in wrapper
2015-09-16 05:50:02.029 29559 TRACE oslo_db.api     return f(*args, **kwargs)
2015-09-16 05:50:02.029 29559 TRACE oslo_db.api   File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 89, in _push_metadata_software_deployments
2015-09-16 05:50:02.029 29559 TRACE oslo_db.api     exception.DeploymentConcurrentTransaction(server=server_id))
2015-09-16 05:50:02.029 29559 TRACE oslo_db.api RetryRequest
2015-09-16 05:50:02.029 29559 TRACE oslo_db.api 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Access the undercloud UI, register 5 nodes, create 1 flavor (the suggested one after the nodes were discovered), upload images, assign images and flavor the roles. Set 3 for controller, 1 for compute and 1 for ceph
2.in Simplify configuration change snmp password to 'password'
3.verify deployment type is "Virtualized"
4. set Service net map for all resources to meet json format:
{"GlanceRegistryNetwork": "internal_api", "NeutronTenantNetwork": "tenant", "NovaApiNetwork": "internal_api", "CeilometerApiNetwork": "internal_api", "CephStorageHostnameResolveNetwork": "storage", "SwiftMgmtNetwork": "storage_mgmt", "MemcachedNetwork": "internal_api", "RabbitMqNetwork": "internal_api", "KeystoneAdminApiNetwork": "internal_api", "SwiftProxyNetwork": "storage", "CinderApiNetwork": "internal_api", "CephClusterNetwork": "storage_mgmt", "NovaMetadataNetwork": "internal_api", "RedisNetwork": "internal_api", "NeutronApiNetwork": "internal_api", "GlanceApiNetwork": "storage", "ObjectStorageHostnameResolveNetwork": "internal_api", "KeystonePublicApiNetwork": "internal_api", "HeatApiNetwork": "internal_api", "NovaVncProxyNetwork": "internal_api", "ControllerHostnameResolveNetwork": "internal_api", "MysqlNetwork": "internal_api", "BlockStorageHostnameResolveNetwork": "internal_api", "ComputeHostnameResolveNetwork": "internal_api", "CephPublicNetwork": "storage", "MongoDbNetwork": "internal_api", "HorizonNetwork": "internal_api", "CinderIscsiNetwork": "storage"}

Actual results:
deployment failed , timed out after ~1.5 hours

Expected results:
deployment succees

Additional info:

$ heat resource-show overcloud ComputeNodesPostDeployment 
| Property               | Value                                                                                                                                                            |
| attributes             | {}                                                                                                                                                               |
| description            |                                                                                                                                                                  |
| links                  | (self)      |
|                        | (stack)                                          |
|                        | (nested) |
| logical_resource_id    | ComputeNodesPostDeployment                                                                                                                                       |
| physical_resource_id   | 2beed1f4-ab2a-4c9c-a17a-092bf76daf59                                                                                                                             |
| required_by            |                                                                                                                                                                  |
| resource_name          | ComputeNodesPostDeployment                                                                                                                                       |
| resource_status        | CREATE_FAILED                                                                                                                                                    |
| resource_status_reason | CREATE aborted                                                                                                                                                   |
| resource_type          | OS::TripleO::ComputePostDeployment                                                                                                                               |
| updated_time           | 2015-09-16T09:22:45Z                                                                                                                                             |

heat logs , messages log attached
Comment 3 Ola Pavlenko 2015-09-16 07:14:17 EDT
Created attachment 1073977 [details]
Comment 4 Ola Pavlenko 2015-09-16 07:16:47 EDT
Created attachment 1073979 [details]
Comment 5 Ola Pavlenko 2015-09-16 07:18:21 EDT
Created attachment 1073980 [details]
Comment 7 chris alfonso 2015-09-17 12:36:59 EDT
This is potentially fixed by the heat db overflow error that was patched this week. bz1261512

Would you mind retesting after that bug after the next puddle is generated? (should be EOD today)
Comment 8 Ana Krivokapic 2015-09-23 12:09:07 EDT
*** Bug 1262124 has been marked as a duplicate of this bug. ***
Comment 9 Florian Fuchs 2015-09-23 12:46:13 EDT
This also seems to break with 3 controllers, 1 compute, zero ceph. 

The deployment times out with the deployment log message:

resources.ControllerNodesPostDeployment.Property error: resources[1].properties.server: Expecting to find username or userId in passwordCredentials - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error.

(Also to be found in the heat-engine.log)
Comment 10 Florian Fuchs 2015-09-23 12:54:28 EDT
Tested with the most recent puddle on2015-09-22 and 2015-09-23. So if the heat db overflow error was included there, it's not the cause of this bug.
Comment 11 Florian Fuchs 2015-09-23 12:58:07 EDT
Adding to my previous comment: 

"$ heat stack-list" is unresponsive after the failed deployment.
Comment 12 Ryan Brady 2015-09-24 02:09:52 EDT
Heat engine was unresponsive because of low available memory on the undercloud.  Performed keystone token flush, restarted mariadb, keystone and heat services.  It allowed me to use the system long enough to attempt another deployment and heat-engine failed again.  The relevant section of the log is:

Sep 24 01:43:00 instack.localdomain kernel: Out of memory: Kill process 30969 (heat-engine) score 61 or sacrifice child
Sep 24 01:43:00 instack.localdomain kernel: Killed process 30969 (heat-engine) total-vm:633912kB, anon-rss:234132kB, file-rss:2052kB
Sep 24 01:43:00 instack.localdomain systemd[1]: openstack-heat-engine.service: main process exited, code=killed, status=9/KILL
Sep 24 01:43:00 instack.localdomain systemd[1]: Unit openstack-heat-engine.service entered failed state.

I suggest we change the memory for the instack vm, add a swap disk[1], or both.

One of the things I noticed about the UI deployment is that it lacked an ntp server value in the service configuration.  If you attempt a CLI deployment via the plugin without an ntp server it will stop and warn you that it's needed with HA.  I didn't see if anyone had tried adding that to the UI deployment steps yet and think it's something to try if we can mitigate the low memory issue.

[1] https://raymii.org/s/tutorials/KVM_add_disk_image_or_swap_image_to_virtual_machine_with_virsh.html
Comment 13 Florian Fuchs 2015-09-24 07:18:00 EDT
I removed and reinstalled all vms, giving the instack vm 8G of memory. 

Before doing that, I removed the rhos-release package on the host and reinstalled it again: The current version is 0.72 (was 0.69 before on both the host as well as the instack vm).

The overcloud deployment failed after ~1hr: 

StackValidationFailed: resources.ControllerNodesPostDeployment.resources.ControllerRingbuilderDeployment_Step3.resources[0]: Property error: [0].Properties.server: The server has either erred or is incapable of performing the requested operation. (HTTP 500)

Unlike before, the vm as well as heat and the ui were still responsive this time. 

The exception mentioned in bz1261512 can still be found in the heat-engine log. Not sure if this is related to the failure of the ControllerRingbuilderDeployment_Step3 mentioned above. 

$ grep DBAPIError /var/log/heat/heat-engine.log

2015-09-24 05:49:28.262 29357 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (_mysql_exceptions.DataError) (1406, "Data too long for column 'resource_properties' at row 1") [...]
Comment 14 Zane Bitter 2015-09-24 11:48:35 EDT
It's possible/likely those logs are a red herring. I believe that in bz1261512 the stack went to CREATE_FAILED immediately, rather than time out.
Comment 15 Florian Fuchs 2015-09-24 12:11:52 EDT
Yes, these exceptions appear in the log even on successful deployments. They can probably be ruled out as the cause of this bug.
Comment 16 Florian Fuchs 2015-09-24 13:55:58 EDT
On the same undercloud (no re-provisioning or new installation in between), I did 4 successive deployments:

1. FAILED: HA: 3 controllers, 1 compute, 1 ceph
2. SUCCESS: 1 controller, 1 compute
3. SUCCESS: HA: 3 controllers, 1 compute
4. SUCCESS: HA: 3 controllers, 1 compute, 1 ceph

Other than the nodes themselves there were no changes between deployments (service config, roles etc.).
Comment 17 Florian Fuchs 2015-09-25 05:21:03 EDT
First HA deployment attempt on fresh undercloud installation succeeded. (3 controllers, 1 compute.)

Undercloud memory: 8GB

$ rhos-release -v
$ 1.0.0
Comment 18 Zane Bitter 2015-09-25 18:28:32 EDT
Note that the log in the first comment is also sort-of expected - an error is raised because of a DB conflict (this is actually an improvement, because it means that we're detecting a race and dealing with it). We appear to have a correct retry handler in that code (and in fact we fixed a previous issue with that, which is not evident in the log). It is a little suspicious though, because intermittently hanging forever is the kind of thing you would expect to see from an unresolved race condition.

Note You need to log in before you can comment on or make changes to this bug.