Bug 1395755

Summary: Can't deploy a non-default plan
Product: Red Hat OpenStack Reporter: Udi Kalifon <ukalifon>
Component: openstack-tripleo-uiAssignee: Honza Pokorny <hpokorny>
Status: CLOSED CURRENTRELEASE QA Contact: Ola Pavlenko <opavlenk>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: akrivoka, apannu, dtrainor, hpokorny, jjoyce, jpichon, jschluet, michele, sclewis, shardy, slinaber, tvignaud, ukalifon
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-06 12:44:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
output of failures list
none
mistral environment-get overcloud
none
mistral environment-get RHELOSP-18748
none
Mistral environment for the default plan
none
Mistral environment for the custom (though unmodified) plan
none
Mistral environment - 3+1 pacemaker non-default (successful)
none
Controller pcs status - 3+1 pacemaker non-default (successful) none

Description Udi Kalifon 2016-11-16 15:31:03 UTC
Created attachment 1221252 [details]
output of failures list

Description of problem:
The default plan is the one that is loaded in the UI by default when the undercloud is installed. It is created from the templates in /usr/share/openstac-tripleo-heat-templates. When I upload a new plan to the system, which is just the same templates that were used to make the default plan (unmodified), my plan fails in deployment while the default plan can be deployed successfully.


How reproducible:
100%


Steps to Reproduce:
1. Create a tgz of the templates: tar -czf default_templates.tgz -C /usr/share/openstac-tripleo-heat-templates .
2. Upload the plan to the GUI
3. Deploy the plan. I deploy 3 controllers + 2 computes + 1 ceph on bare metals, and I use the "pacemaker" and "storage environment" environments.


Actual results:
Deployment hangs and fails after several hours. See attachment for the output of "failures list".


Expected results:
The same deployment succeeds when using the default plan, so it should succeed with the plan I uploaded as well since they're identical.

Comment 1 Julie Pichon 2016-11-16 16:12:11 UTC
Could you provide the mistral environment information for both plans (especially the one that failed) as well, for comparison?

$ mistral environment-get overcloud # for the default plan
$ mistral environment-get <plan-name> # for the non-default one

I'm testing with a new plan now, but I know I have successfully deployed non-default plans in the past. These were fairly simple deployments though, 2-3 nodes with mostly the default templates.

Comment 2 Julie Pichon 2016-11-16 18:09:44 UTC
Right now I'm wondering if this may be related to bug 1395810, which the Mistral environment(s) would help confirm. I think because of that other BZ, it's possible that the two deployments were not equivalent to start with (the bug includes a workaround so that can be tried again).

Comment 3 Dan Trainor 2016-11-16 18:10:04 UTC
I was successful getting a non-default deployment completed successfully.  My deployment specification was for three Controllers and one Compute, however only one Controller and one Compute were deployed.  All indications are that this was how the plan was executed based on defaults (o-t-h-t/puppet/services/neutron-api.yaml, ControllerCount: 1) instead of what the plan in the UI was configured for.

Comment 4 Dan Trainor 2016-11-16 18:17:48 UTC
Comparing the Mistral environments between the Default plan and my Custom plan indicate that there is no ControllerCount nor NodeCount parameter_defaults for my Custom plan.

Comment 5 Dan Trainor 2016-11-16 18:18:38 UTC
Created attachment 1221313 [details]
mistral environment-get overcloud

Comment 6 Dan Trainor 2016-11-16 18:19:07 UTC
Created attachment 1221314 [details]
mistral environment-get RHELOSP-18748

Comment 7 Dan Trainor 2016-11-16 18:23:17 UTC
Just for kicks, I did a test where I unassigned all nodes and re-assigned them to roles to see if that update would drop a ControllerCount or NodeCount in my non-default deployment plan.  However this had no effect.

Previously, I had just re-purposed the existing Node assignment from one plan to the other, without any explicit node reassignment.

Comment 8 Julie Pichon 2016-11-16 18:25:05 UTC
FWIW this is the behaviour described in bug 1395810, if the cause of both bugs turns out to be the same we can mark one as duplicate - however because Dan's deployment was eventually successful while Udi's failed, I think there may also be something else going on here.

Comment 9 Udi Kalifon 2016-11-17 09:45:39 UTC
Created attachment 1221517 [details]
Mistral environment for the default plan

Attacing the output of the 2 environments. The only difference that I see between them is that my plan has "template": "overcloud.yaml", and the default plan has "root_template": "overcloud.yaml" instead. However that doesn't seem to be able to explain why a deployment would hang and fail after many hours...

Comment 10 Udi Kalifon 2016-11-17 09:46:50 UTC
Created attachment 1221518 [details]
Mistral environment for the custom (though unmodified) plan

This is the attachment for the uploaded plan

Comment 11 Julie Pichon 2016-11-17 09:53:57 UTC
Thank you for the information Udi! The ControllerCount, ComputeCount, CephStorageCount match in both cases so it doesn't look related to the other bug after all.

The only differences I can see is that the flavors for the roles aren't set (that can be done from editing each role using the pencil icon), and the NtpServer wasn't input either for the non-default plan. It may be worthwhile trying again with these set as well?

Comment 12 Udi Kalifon 2016-11-17 10:00:57 UTC
The counts match, but did you notice that the counters get integer values in the default plan, and in the uploaded plans they are strings (with quotes). For example: "ControllerCount": 3, as opposed to "ControllerCount": "3".

The NtpServer was also unsent in the default plan when I first deployed successfully. It is only set now (when I did the mistral environment-get) because of another bug I was testing. I'll try it again anyways now, and also set the flavors as you pointed out.

Comment 13 Julie Pichon 2016-11-17 17:25:17 UTC
I just successfully deployed a non-default, customised plan (overcloud SSL, 1 controller, 1 compute) in my upstream environment. I'm fairly sure this was working on OSP10 as well but am currently debugging separate hardware issues there, I'll report back when that is resolved and I can test again (unless someone else beats me to it).

I think there is something else going on here, and that what is happening is not related to every non-default plan.

Comment 14 Julie Pichon 2016-11-17 18:24:31 UTC
Udi, well spotted on the string vs integer parameters. I'm not sure what kind of impact it may have, perhaps we can check by using the following commands and retrying the deployment?

$ openstack action execution run tripleo.parameters.update '{"container":"unmodified_from_directory", "parameters":{"ControllerCount":3}}'
$ openstack action execution run tripleo.parameters.update '{"container":"unmodified_from_directory", "parameters":{"ComputeCount":2}}'
$ openstack action execution run tripleo.parameters.update '{"container":"unmodified_from_directory", "parameters":{"CephStorageCount":1}}'

Comment 15 Julie Pichon 2016-11-18 17:49:33 UTC
I just noticed I have a deployed environment where the CephStorageCount was a string yet was picked up fine, so I don't think it has much impact...

Could you try some of the following to try and debug this?
- Try again the non-default plan but without pacemaker, since the errors seem related?
- Try to delete the overcloud plan, and create a plan called 'overcloud' with your custom templates. Does the failure still happen then?

Thank you.

Comment 16 Ana Krivokapic 2016-11-18 18:00:59 UTC
I just got a successful deployment: virtual environment with 1 controller, 1 compute and 1 ceph node and a non-default plan.

Comment 17 Ana Krivokapic 2016-11-18 18:07:41 UTC
[stack@instack ~]$ cat /var/lib/rhos-release/latest-installed
10   -p 2016-11-15.2

Comment 18 Udi Kalifon 2016-11-20 10:55:04 UTC
I just tried a non-HA deployment on bare metals and got the same failure

Comment 19 Honza Pokorny 2016-11-21 19:41:08 UTC
Udi's failure list above seems to indicate cluster-related issues.

I just ran an SSL-enabled, IPv6 network deployment, 3 controller, 1 compute, virt deployment and had essentially the same issue.

I also just had a look another one of Udi's deployments that he kicked off before leaving for the day, and that, too, had cluster-related issues.

The line in question seems to be:

Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth unmodified_from_directory-controller-0 unmodified_from_directory-controller-1 unmodified_from_directory-controller-2 -u hacluster -p 1rRZ7Pva5hFKoM0c --force returned 1 instead of one of [0]

Comment 20 Honza Pokorny 2016-11-22 12:52:33 UTC
Same issue on rdo-list

https://www.redhat.com/archives/rdo-list/2016-September/msg00034.html

Comment 21 Honza Pokorny 2016-11-22 15:16:41 UTC
It sounds like this is triggered by specifying more than one controller in the plan. Somehow, the additional controllers aren't created (or are created later), and the clustering auth code runs when only a single node is available.  The heat process dies because it can't find the other controller nodes in the cluster.

Comment 22 Julie Pichon 2016-11-22 16:10:18 UTC
I somehow got a HA deployment, 3 controllers + 1 compute complete successfully with the Nov 15 puddle, in a virt environment, with a non-default plan. I created the plan via the CLI (openstack overcloud plan create newplan, which uses the templates in /usr/share) but set the parameters and started the deployment via the UI. No complicated environments, only pacemaker.

I'll attach my Mistral environment and the pcs status info from the controllers, let me know if any other information would be helpful before I recycle the deployment (probably by morning).

Comment 23 Julie Pichon 2016-11-22 16:18:00 UTC
Created attachment 1222765 [details]
Mistral environment - 3+1 pacemaker non-default (successful)

Comment 24 Julie Pichon 2016-11-22 16:18:48 UTC
Created attachment 1222766 [details]
Controller pcs status - 3+1 pacemaker non-default (successful)

Comment 25 Dan Trainor 2016-11-22 16:27:13 UTC
(In reply to Honza Pokorny from comment #21)
> It sounds like this is triggered by specifying more than one controller in
> the plan. Somehow, the additional controllers aren't created (or are created
> later), and the clustering auth code runs when only a single node is
> available.  The heat process dies because it can't find the other controller
> nodes in the cluster.

This seems correct.  The lp bug associated with this bz indicates that the CloudDomain property is not defined in a non-default plan, causing the nodes to not be able to resolve each others' hostnames.  The bootstrap Controller node can see itself locally via information contained in /etc/hosts, so it sees fewer failures than the non-bootstrapped nodes.

Testing a deployment right now with CloudDomain defined to confirm this.

Comment 26 Honza Pokorny 2016-11-22 17:25:27 UTC
I just managed to deploy a 3-controller, 1-compute plan with no other settings. The plan was created using "openstack overcloud plan create <name>". I created the plan via the CLI but triggered it via the GUI after setting assigning the nodes. This leads me to believe that the problem lies with the template upload logic, too.

Summary:

* 1+ controllers
* Uploaded via GUI

Comment 28 Michele Baldessari 2016-11-22 19:12:41 UTC
So I took a look at the env at comment 27:

The problem seems to be the failure of the following command:
# journalctl -u os-collect-config |grep "pcs cluster auth"
Nov 21 18:48:14 puma01-plan-controller-0.localdomain os-collect-config[10125]: Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Failed to call refresh: /sbin/pcs cluster auth puma01_plan-controller-0 -u hacluster -p i3kdKVeXzZzu4WYQ --force returned 1 instead of one of [0]
Nov 21 18:48:14 puma01-plan-controller-0.localdomain os-collect-config[10125]: Error: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth puma01_plan-controller-0 -u hacluster -p i3kdKVeXzZzu4WYQ --force returned 1 instead of one of [0]


Now the reason the above is failing is the following:
[root@puma01-plan-controller-0 hieradata]# hostname
puma01-plan-controller-0.localdomain

[root@puma01-plan-controller-0 hieradata]# grep puma01-plan-controller-0 /etc/hosts
[root@puma01-plan-controller-0 hieradata]# 

In hosts we have slightly different hostname format (underscore vs. dash):
172.16.0.29        puma01_plan-controller-0.localdomain        puma01_plan-controller-0

As a matter of fact if I point pcs to localhost it all works:
/sbin/pcs cluster auth localhost -u hacluster -p i3kdKVeXzZzu4WYQ --force 
localhost: Authorized

So you need to fix the hostname vs what's in /etc/hosts

Comment 29 Honza Pokorny 2016-11-23 12:54:28 UTC
I can reproduce the above pcs issue even when the plan name only contains alphanumeric characters.

Comment 30 Steven Hardy 2016-11-23 13:40:54 UTC
Does https://review.openstack.org/#/c/400966/ fix this?

Comment 31 Honza Pokorny 2016-11-23 19:11:03 UTC
I managed to successfully deploy an SSL overcloud. 3 controllers, 1 compute.  It's the same setup as before with the exception of IPv6 network isolation.  There must be something wrong with the way the network is set up, and the above cluster issues are the result.  I'm trying to deploy an IPv4 network isolation plan, to isolate (welp) the issue further.  My first attempt at IPv4 ended being a frozen deployment.

Comment 32 Honza Pokorny 2016-11-23 19:12:01 UTC
Workarounds used for the above:

* Manual node count setting
* Manual setting of CloudDomain to "localdomain"

Comment 33 Julie Pichon 2016-11-25 19:22:43 UTC
I ran a few more tests, here are additional data points. When I write 3+1+1, I mean "3 controllers + 1 Compute + 1 Ceph" with Pacemaker and Storage environments enabled, as well as the workarounds for bug  1395810 (manually set node counts and flavors) and bug 1397570 (manually set CloudDomain). Every test was done with non-default plans.

* On his machine, Udi uploaded the tarball with his failing plan and deployed 1 Controller + 1 Compute. The deployment completed successfully.
* On Udi's machine, I created a new plan with Udi's problem tarball and tried a 3+1+1 deployment. It failed.
* On Udi's machine, I created a fresh new plan from the CLI (openstack overcloud plan create <new name>) and launched a 3+1+1 deployment. It worked.
* On Udi's machine, I made a tarball of the templates under /usr/share/* etc, uploaded them via the UI, and launched a 3+1+1 deployment. It worked.
* On my (virt) setup, I uploaded Udi's tarball and started a 3+1+1 deployment. It worked.
* On Udi's machine, I created yet another new plan using his tarball and launched a 3+1+1 deployment. This time, it worked.

Whatever is happening here, it is not happening with "every non-default plan" like the subject suggests. I think there is something else going on, and we should keep the bug open to get to the bottom of it, but I don't think the issue is that generalised. I've run a number of deployments with non-default plans, and never hit the pcs issue.

Comment 36 Honza Pokorny 2017-02-03 13:47:28 UTC
Over the last week, I have tried to assess the status of this bug.  I have run probably ten deployments, and have rebuilt the undercloud several times.  The only deployment I was able to successfully finish is when the undercloud is installed, I log in and hit "Deploy".  The deployment fails under any other circumstance.

* Default plan, 1 compute, 1 controller
* Default plan, 1 compute, 1 controller, network isolation
* Default plan, 1 compute, 3 controller
* Default plan, 1 computer, 3 controller, 1 ceph
* Non-default plan, 1 compute, 3 controller
* Non-default plan, 1 computer, 3 controller, 1 ceph

The most common error is of the following type:

Resource CREATE failed: ResourceInError: resources.Compute.resources[0].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

I checked the ComputeCount, and ControllerCount settings in the Mistral environment.  I tried setting the "CloudDomain" manually.  I tried explicitly setting the NtpServer.

This is using latest stable undercloud, from packages.

Comment 37 Udi Kalifon 2017-07-06 12:44:04 UTC
This bug hasn't happened in a very long time. There is also no need to include the pacemaker environment file.