Bug 1263651
Summary: | DB error while deploying HA+ceph via undercloud UI | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ola Pavlenko <opavlenk> | ||||||||
Component: | openstack-tuskar-ui | Assignee: | Florian Fuchs <flfuchs> | ||||||||
Status: | CLOSED WORKSFORME | QA Contact: | yeylon <yeylon> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | unspecified | CC: | akrivoka, calfonso, flfuchs, kbasil, mburns, opavlenk, rhel-osp-director-maint, sasha, sbaker, srevivo, zbitter | ||||||||
Target Milestone: | y1 | Keywords: | Triaged | ||||||||
Target Release: | 7.0 (Kilo) | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2015-09-25 09:21:03 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1250250 | ||||||||||
Attachments: |
|
Description
Ola Pavlenko
2015-09-16 11:08:01 UTC
Created attachment 1073977 [details]
heat-logs1.tar.gz
Created attachment 1073979 [details]
heat-logs2.tar.gz
Created attachment 1073980 [details]
messages.tar.gz
This is potentially fixed by the heat db overflow error that was patched this week. bz1261512 Would you mind retesting after that bug after the next puddle is generated? (should be EOD today) *** Bug 1262124 has been marked as a duplicate of this bug. *** This also seems to break with 3 controllers, 1 compute, zero ceph. The deployment times out with the deployment log message: resources.ControllerNodesPostDeployment.Property error: resources[1].properties.server: Expecting to find username or userId in passwordCredentials - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (Also to be found in the heat-engine.log) Tested with the most recent puddle on2015-09-22 and 2015-09-23. So if the heat db overflow error was included there, it's not the cause of this bug. Adding to my previous comment: "$ heat stack-list" is unresponsive after the failed deployment. Heat engine was unresponsive because of low available memory on the undercloud. Performed keystone token flush, restarted mariadb, keystone and heat services. It allowed me to use the system long enough to attempt another deployment and heat-engine failed again. The relevant section of the log is: Sep 24 01:43:00 instack.localdomain kernel: Out of memory: Kill process 30969 (heat-engine) score 61 or sacrifice child Sep 24 01:43:00 instack.localdomain kernel: Killed process 30969 (heat-engine) total-vm:633912kB, anon-rss:234132kB, file-rss:2052kB Sep 24 01:43:00 instack.localdomain systemd[1]: openstack-heat-engine.service: main process exited, code=killed, status=9/KILL Sep 24 01:43:00 instack.localdomain systemd[1]: Unit openstack-heat-engine.service entered failed state. I suggest we change the memory for the instack vm, add a swap disk[1], or both. One of the things I noticed about the UI deployment is that it lacked an ntp server value in the service configuration. If you attempt a CLI deployment via the plugin without an ntp server it will stop and warn you that it's needed with HA. I didn't see if anyone had tried adding that to the UI deployment steps yet and think it's something to try if we can mitigate the low memory issue. [1] https://raymii.org/s/tutorials/KVM_add_disk_image_or_swap_image_to_virtual_machine_with_virsh.html I removed and reinstalled all vms, giving the instack vm 8G of memory. Before doing that, I removed the rhos-release package on the host and reinstalled it again: The current version is 0.72 (was 0.69 before on both the host as well as the instack vm). The overcloud deployment failed after ~1hr: StackValidationFailed: resources.ControllerNodesPostDeployment.resources.ControllerRingbuilderDeployment_Step3.resources[0]: Property error: [0].Properties.server: The server has either erred or is incapable of performing the requested operation. (HTTP 500) Unlike before, the vm as well as heat and the ui were still responsive this time. The exception mentioned in bz1261512 can still be found in the heat-engine log. Not sure if this is related to the failure of the ControllerRingbuilderDeployment_Step3 mentioned above. $ grep DBAPIError /var/log/heat/heat-engine.log 2015-09-24 05:49:28.262 29357 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (_mysql_exceptions.DataError) (1406, "Data too long for column 'resource_properties' at row 1") [...] It's possible/likely those logs are a red herring. I believe that in bz1261512 the stack went to CREATE_FAILED immediately, rather than time out. Yes, these exceptions appear in the log even on successful deployments. They can probably be ruled out as the cause of this bug. On the same undercloud (no re-provisioning or new installation in between), I did 4 successive deployments: 1. FAILED: HA: 3 controllers, 1 compute, 1 ceph 2. SUCCESS: 1 controller, 1 compute 3. SUCCESS: HA: 3 controllers, 1 compute 4. SUCCESS: HA: 3 controllers, 1 compute, 1 ceph Other than the nodes themselves there were no changes between deployments (service config, roles etc.). First HA deployment attempt on fresh undercloud installation succeeded. (3 controllers, 1 compute.) Undercloud memory: 8GB $ rhos-release -v $ 1.0.0 Note that the log in the first comment is also sort-of expected - an error is raised because of a DB conflict (this is actually an improvement, because it means that we're detecting a race and dealing with it). We appear to have a correct retry handler in that code (and in fact we fixed a previous issue with that, which is not evident in the log). It is a little suspicious though, because intermittently hanging forever is the kind of thing you would expect to see from an unresolved race condition. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |