| Summary: | Can't deploy a non-default plan | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Udi Kalifon <ukalifon> |
| Component: | openstack-tripleo-ui | Assignee: | Honza Pokorny <hpokorny> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Ola Pavlenko <opavlenk> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 10.0 (Newton) | CC: | akrivoka, apannu, dtrainor, hpokorny, jjoyce, jpichon, jschluet, michele, sclewis, shardy, slinaber, tvignaud, ukalifon |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-07-06 12:44:04 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Attachments: | |||
Could you provide the mistral environment information for both plans (especially the one that failed) as well, for comparison? $ mistral environment-get overcloud # for the default plan $ mistral environment-get <plan-name> # for the non-default one I'm testing with a new plan now, but I know I have successfully deployed non-default plans in the past. These were fairly simple deployments though, 2-3 nodes with mostly the default templates. Right now I'm wondering if this may be related to bug 1395810, which the Mistral environment(s) would help confirm. I think because of that other BZ, it's possible that the two deployments were not equivalent to start with (the bug includes a workaround so that can be tried again). I was successful getting a non-default deployment completed successfully. My deployment specification was for three Controllers and one Compute, however only one Controller and one Compute were deployed. All indications are that this was how the plan was executed based on defaults (o-t-h-t/puppet/services/neutron-api.yaml, ControllerCount: 1) instead of what the plan in the UI was configured for. Comparing the Mistral environments between the Default plan and my Custom plan indicate that there is no ControllerCount nor NodeCount parameter_defaults for my Custom plan. Created attachment 1221313 [details]
mistral environment-get overcloud
Created attachment 1221314 [details]
mistral environment-get RHELOSP-18748
Just for kicks, I did a test where I unassigned all nodes and re-assigned them to roles to see if that update would drop a ControllerCount or NodeCount in my non-default deployment plan. However this had no effect. Previously, I had just re-purposed the existing Node assignment from one plan to the other, without any explicit node reassignment. FWIW this is the behaviour described in bug 1395810, if the cause of both bugs turns out to be the same we can mark one as duplicate - however because Dan's deployment was eventually successful while Udi's failed, I think there may also be something else going on here. Created attachment 1221517 [details]
Mistral environment for the default plan
Attacing the output of the 2 environments. The only difference that I see between them is that my plan has "template": "overcloud.yaml", and the default plan has "root_template": "overcloud.yaml" instead. However that doesn't seem to be able to explain why a deployment would hang and fail after many hours...
Created attachment 1221518 [details]
Mistral environment for the custom (though unmodified) plan
This is the attachment for the uploaded plan
Thank you for the information Udi! The ControllerCount, ComputeCount, CephStorageCount match in both cases so it doesn't look related to the other bug after all. The only differences I can see is that the flavors for the roles aren't set (that can be done from editing each role using the pencil icon), and the NtpServer wasn't input either for the non-default plan. It may be worthwhile trying again with these set as well? The counts match, but did you notice that the counters get integer values in the default plan, and in the uploaded plans they are strings (with quotes). For example: "ControllerCount": 3, as opposed to "ControllerCount": "3". The NtpServer was also unsent in the default plan when I first deployed successfully. It is only set now (when I did the mistral environment-get) because of another bug I was testing. I'll try it again anyways now, and also set the flavors as you pointed out. I just successfully deployed a non-default, customised plan (overcloud SSL, 1 controller, 1 compute) in my upstream environment. I'm fairly sure this was working on OSP10 as well but am currently debugging separate hardware issues there, I'll report back when that is resolved and I can test again (unless someone else beats me to it). I think there is something else going on here, and that what is happening is not related to every non-default plan. Udi, well spotted on the string vs integer parameters. I'm not sure what kind of impact it may have, perhaps we can check by using the following commands and retrying the deployment?
$ openstack action execution run tripleo.parameters.update '{"container":"unmodified_from_directory", "parameters":{"ControllerCount":3}}'
$ openstack action execution run tripleo.parameters.update '{"container":"unmodified_from_directory", "parameters":{"ComputeCount":2}}'
$ openstack action execution run tripleo.parameters.update '{"container":"unmodified_from_directory", "parameters":{"CephStorageCount":1}}'
I just noticed I have a deployed environment where the CephStorageCount was a string yet was picked up fine, so I don't think it has much impact... Could you try some of the following to try and debug this? - Try again the non-default plan but without pacemaker, since the errors seem related? - Try to delete the overcloud plan, and create a plan called 'overcloud' with your custom templates. Does the failure still happen then? Thank you. I just got a successful deployment: virtual environment with 1 controller, 1 compute and 1 ceph node and a non-default plan. [stack@instack ~]$ cat /var/lib/rhos-release/latest-installed 10 -p 2016-11-15.2 I just tried a non-HA deployment on bare metals and got the same failure Udi's failure list above seems to indicate cluster-related issues. I just ran an SSL-enabled, IPv6 network deployment, 3 controller, 1 compute, virt deployment and had essentially the same issue. I also just had a look another one of Udi's deployments that he kicked off before leaving for the day, and that, too, had cluster-related issues. The line in question seems to be: Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth unmodified_from_directory-controller-0 unmodified_from_directory-controller-1 unmodified_from_directory-controller-2 -u hacluster -p 1rRZ7Pva5hFKoM0c --force returned 1 instead of one of [0] Same issue on rdo-list https://www.redhat.com/archives/rdo-list/2016-September/msg00034.html It sounds like this is triggered by specifying more than one controller in the plan. Somehow, the additional controllers aren't created (or are created later), and the clustering auth code runs when only a single node is available. The heat process dies because it can't find the other controller nodes in the cluster. I somehow got a HA deployment, 3 controllers + 1 compute complete successfully with the Nov 15 puddle, in a virt environment, with a non-default plan. I created the plan via the CLI (openstack overcloud plan create newplan, which uses the templates in /usr/share) but set the parameters and started the deployment via the UI. No complicated environments, only pacemaker. I'll attach my Mistral environment and the pcs status info from the controllers, let me know if any other information would be helpful before I recycle the deployment (probably by morning). Created attachment 1222765 [details]
Mistral environment - 3+1 pacemaker non-default (successful)
Created attachment 1222766 [details]
Controller pcs status - 3+1 pacemaker non-default (successful)
(In reply to Honza Pokorny from comment #21) > It sounds like this is triggered by specifying more than one controller in > the plan. Somehow, the additional controllers aren't created (or are created > later), and the clustering auth code runs when only a single node is > available. The heat process dies because it can't find the other controller > nodes in the cluster. This seems correct. The lp bug associated with this bz indicates that the CloudDomain property is not defined in a non-default plan, causing the nodes to not be able to resolve each others' hostnames. The bootstrap Controller node can see itself locally via information contained in /etc/hosts, so it sees fewer failures than the non-bootstrapped nodes. Testing a deployment right now with CloudDomain defined to confirm this. I just managed to deploy a 3-controller, 1-compute plan with no other settings. The plan was created using "openstack overcloud plan create <name>". I created the plan via the CLI but triggered it via the GUI after setting assigning the nodes. This leads me to believe that the problem lies with the template upload logic, too. Summary: * 1+ controllers * Uploaded via GUI So I took a look at the env at comment 27: The problem seems to be the failure of the following command: # journalctl -u os-collect-config |grep "pcs cluster auth" Nov 21 18:48:14 puma01-plan-controller-0.localdomain os-collect-config[10125]: Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Failed to call refresh: /sbin/pcs cluster auth puma01_plan-controller-0 -u hacluster -p i3kdKVeXzZzu4WYQ --force returned 1 instead of one of [0] Nov 21 18:48:14 puma01-plan-controller-0.localdomain os-collect-config[10125]: Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth puma01_plan-controller-0 -u hacluster -p i3kdKVeXzZzu4WYQ --force returned 1 instead of one of [0] Now the reason the above is failing is the following: [root@puma01-plan-controller-0 hieradata]# hostname puma01-plan-controller-0.localdomain [root@puma01-plan-controller-0 hieradata]# grep puma01-plan-controller-0 /etc/hosts [root@puma01-plan-controller-0 hieradata]# In hosts we have slightly different hostname format (underscore vs. dash): 172.16.0.29 puma01_plan-controller-0.localdomain puma01_plan-controller-0 As a matter of fact if I point pcs to localhost it all works: /sbin/pcs cluster auth localhost -u hacluster -p i3kdKVeXzZzu4WYQ --force localhost: Authorized So you need to fix the hostname vs what's in /etc/hosts I can reproduce the above pcs issue even when the plan name only contains alphanumeric characters. Does https://review.openstack.org/#/c/400966/ fix this? I managed to successfully deploy an SSL overcloud. 3 controllers, 1 compute. It's the same setup as before with the exception of IPv6 network isolation. There must be something wrong with the way the network is set up, and the above cluster issues are the result. I'm trying to deploy an IPv4 network isolation plan, to isolate (welp) the issue further. My first attempt at IPv4 ended being a frozen deployment. Workarounds used for the above: * Manual node count setting * Manual setting of CloudDomain to "localdomain" I ran a few more tests, here are additional data points. When I write 3+1+1, I mean "3 controllers + 1 Compute + 1 Ceph" with Pacemaker and Storage environments enabled, as well as the workarounds for bug 1395810 (manually set node counts and flavors) and bug 1397570 (manually set CloudDomain). Every test was done with non-default plans. * On his machine, Udi uploaded the tarball with his failing plan and deployed 1 Controller + 1 Compute. The deployment completed successfully. * On Udi's machine, I created a new plan with Udi's problem tarball and tried a 3+1+1 deployment. It failed. * On Udi's machine, I created a fresh new plan from the CLI (openstack overcloud plan create <new name>) and launched a 3+1+1 deployment. It worked. * On Udi's machine, I made a tarball of the templates under /usr/share/* etc, uploaded them via the UI, and launched a 3+1+1 deployment. It worked. * On my (virt) setup, I uploaded Udi's tarball and started a 3+1+1 deployment. It worked. * On Udi's machine, I created yet another new plan using his tarball and launched a 3+1+1 deployment. This time, it worked. Whatever is happening here, it is not happening with "every non-default plan" like the subject suggests. I think there is something else going on, and we should keep the bug open to get to the bottom of it, but I don't think the issue is that generalised. I've run a number of deployments with non-default plans, and never hit the pcs issue. Over the last week, I have tried to assess the status of this bug. I have run probably ten deployments, and have rebuilt the undercloud several times. The only deployment I was able to successfully finish is when the undercloud is installed, I log in and hit "Deploy". The deployment fails under any other circumstance. * Default plan, 1 compute, 1 controller * Default plan, 1 compute, 1 controller, network isolation * Default plan, 1 compute, 3 controller * Default plan, 1 computer, 3 controller, 1 ceph * Non-default plan, 1 compute, 3 controller * Non-default plan, 1 computer, 3 controller, 1 ceph The most common error is of the following type: Resource CREATE failed: ResourceInError: resources.Compute.resources[0].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" I checked the ComputeCount, and ControllerCount settings in the Mistral environment. I tried setting the "CloudDomain" manually. I tried explicitly setting the NtpServer. This is using latest stable undercloud, from packages. This bug hasn't happened in a very long time. There is also no need to include the pacemaker environment file. |
Created attachment 1221252 [details] output of failures list Description of problem: The default plan is the one that is loaded in the UI by default when the undercloud is installed. It is created from the templates in /usr/share/openstac-tripleo-heat-templates. When I upload a new plan to the system, which is just the same templates that were used to make the default plan (unmodified), my plan fails in deployment while the default plan can be deployed successfully. How reproducible: 100% Steps to Reproduce: 1. Create a tgz of the templates: tar -czf default_templates.tgz -C /usr/share/openstac-tripleo-heat-templates . 2. Upload the plan to the GUI 3. Deploy the plan. I deploy 3 controllers + 2 computes + 1 ceph on bare metals, and I use the "pacemaker" and "storage environment" environments. Actual results: Deployment hangs and fails after several hours. See attachment for the output of "failures list". Expected results: The same deployment succeeds when using the default plan, so it should succeed with the plan I uploaded as well since they're identical.