Bug 1285551

Summary: [Director] rhel-osp-director: failing to replace controller on HA deployment.
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: documentationAssignee: Dan Macpherson <dmacpher>
Status: CLOSED CURRENTRELEASE QA Contact: RHOS Documentation Team <rhos-docs>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: dmacpher, fdinitto, jcoufal, jstransk, mburns, rhel-osp-director-maint, sclewis, srevivo
Target Milestone: gaKeywords: Documentation
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-23 05:29:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2015-11-25 23:09:52 UTC
rhel-osp-director: failing to replace controller on HA deployment.

Environment:
openstack-ceilometer-alarm-2015.1.0-10.el7ost.noarch
openstack-keystone-2015.1.0-4.el7ost.noarch
openstack-dashboard-theme-2015.1.0-10.el7ost.noarch
openstack-tripleo-image-elements-0.9.6-6.el7ost.noarch
openstack-ironic-discoverd-1.1.0-5.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-16.el7ost.noarch
openstack-swift-object-2.3.0-1.el7ost.noarch
openstack-nova-common-2015.1.0-16.el7ost.noarch
openstack-neutron-openvswitch-2015.1.0-12.el7ost.noarch
openstack-nova-api-2015.1.0-16.el7ost.noarch
redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch
openstack-heat-engine-2015.1.0-4.el7ost.noarch
openstack-ceilometer-common-2015.1.0-10.el7ost.noarch
openstack-tempest-kilo-20150708.2.el7ost.noarch
python-django-openstack-auth-1.2.0-3.el7ost.noarch
openstack-tuskar-ui-0.3.0-13.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-1.git49b57eb.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-4.el7ost.noarch
instack-undercloud-2.1.2-22.el7ost.noarch
openstack-ironic-common-2015.1.0-9.el7ost.noarch
openstack-nova-conductor-2015.1.0-16.el7ost.noarch
openstack-swift-account-2.3.0-1.el7ost.noarch
openstack-swift-proxy-2.3.0-1.el7ost.noarch
openstack-ceilometer-notification-2015.1.0-10.el7ost.noarch
openstack-ceilometer-collector-2015.1.0-10.el7ost.noarch
openstack-ceilometer-api-2015.1.0-10.el7ost.noarch
openstack-ironic-api-2015.1.0-9.el7ost.noarch
openstack-swift-plugin-swift3-1.7-3.el7ost.noarch
openstack-tuskar-ui-extras-0.0.4-1.el7ost.noarch
openstack-glance-2015.1.0-6.el7ost.noarch
python-openstackclient-1.0.3-2.el7ost.noarch
openstack-heat-api-2015.1.0-4.el7ost.noarch
openstack-ceilometer-central-2015.1.0-10.el7ost.noarch
openstack-puppet-modules-2015.1.8-8.el7ost.noarch
openstack-nova-compute-2015.1.0-16.el7ost.noarch
openstack-neutron-ml2-2015.1.0-12.el7ost.noarch
openstack-nova-scheduler-2015.1.0-16.el7ost.noarch
openstack-dashboard-2015.1.0-10.el7ost.noarch
openstack-nova-cert-2015.1.0-16.el7ost.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-45.el7ost.noarch
openstack-neutron-common-2015.1.0-12.el7ost.noarch
openstack-heat-common-2015.1.0-4.el7ost.noarch
openstack-tuskar-0.4.18-3.el7ost.noarch
openstack-selinux-0.6.37-1.el7ost.noarch
openstack-swift-container-2.3.0-1.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch
openstack-nova-console-2015.1.0-16.el7ost.noarch
openstack-neutron-2015.1.0-12.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-4.el7ost.noarch
openstack-ironic-conductor-2015.1.0-9.el7ost.noarch
openstack-swift-2.3.0-1.el7ost.noarch



Steps to reproduce:
1. Deploy overcloud.
2. Attempt to replace 1 controller following this procedure:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/Director_Installation_and_Usage/Replacing_Controller_Nodes.html.
Run this step in the above procedure:
" The manual configuration is complete. Rerun the Overcloud deployment command to continue the stack update:

[stack@director ~]$ openstack overcloud deploy --templates --control-scale 3"

Result:
The openstack update fails.
The suggested "If the Overcloud stack update fails again, it might be due to an issue with the Keystone service. Log into the new node and restart the Keystone service again"  doesn't help. 
I managed to reach a state where all pcs resources started on all controllers, but the overcloud update still fails.


[stack@instack ~]$ openstack overcloud deploy --templates --control-scale 3 --compute-scale 2    --ntp-server 10.5.26.10 --timeout 90 -e network-environment.yaml                                            
Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates                                                                                                                             
ERROR: openstack Heat Stack update failed.                                                                                                                                                                   
[stack@instack ~]$ heat resource-list -n 5 overcloud|grep -v COMPLETE
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+---------------------------------------------+                                                                                                                                                                                                
| resource_name                               | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                             |                                                                                                                                                                                                
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+---------------------------------------------+                                                                                                                                                                                                
| ControllerNodesPostDeployment               | 5fe7ad15-1b8d-43e3-b1c9-8532ad519ac8          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2015-11-25T22:55:12Z |                                             |                                                                                                                                                                                                
| ControllerOvercloudServicesDeployment_Step6 | 346db6a9-6185-4dd2-8697-9d65650dcfc1          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2015-11-25T22:56:08Z | ControllerNodesPostDeployment               |                                                                                                                                                                                                
| 0                                           | 4b41d03b-7bce-479b-bab1-6a47aab5ca48          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-11-25T22:56:13Z | ControllerOvercloudServicesDeployment_Step6 |                                                                                                                                                                                                
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+---------------------------------------------+                                                                                                                                                                                                
[stack@instack ~]$ heat resource-show overcloud ControllerNodesPostDeployment                                                   
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+                                                                                                                                                      
| Property               | Value                                                                                                                                                                                                                                                   |                                                                                                                                                      
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+                                                                                                                                                      
| attributes             | {}                                                                                                                                                                                                                                                      |                                                                                                                                                      
| description            |                                                                                                                                                                                                                                                         |                                                                                                                                                      
| links                  | http://192.0.2.1:8004/v1/de0dd82d60a949d1a17fc5b846fad5ed/stacks/overcloud/8c438299-1b2e-4800-b483-5b2bc546d80a/resources/ControllerNodesPostDeployment (self)                                                                                          |                                                                                                                                                      
|                        | http://192.0.2.1:8004/v1/de0dd82d60a949d1a17fc5b846fad5ed/stacks/overcloud/8c438299-1b2e-4800-b483-5b2bc546d80a (stack)                                                                                                                                 |                                                                                                                                                      
|                        | http://192.0.2.1:8004/v1/de0dd82d60a949d1a17fc5b846fad5ed/stacks/overcloud-ControllerNodesPostDeployment-ct6rlo5vyjyq/5fe7ad15-1b8d-43e3-b1c9-8532ad519ac8 (nested)                                                                                     |                                                                                                                                                      
| logical_resource_id    | ControllerNodesPostDeployment                                                                                                                                                                                                                           |                                                                                                                                                      
| physical_resource_id   | 5fe7ad15-1b8d-43e3-b1c9-8532ad519ac8                                                                                                                                                                                                                    |                                                                                                                                                      
| required_by            | CephStorageNodesPostDeployment                                                                                                                                                                                                                          |                                                                                                                                                      
|                        | BlockStorageNodesPostDeployment                                                                                                                                                                                                                         |                                                                                                                                                      
| resource_name          | ControllerNodesPostDeployment                                                                                                                                                                                                                           |                                                                                                                                                      
| resource_status        | UPDATE_FAILED                                                                                                                                                                                                                                           |                                                                                                                                                      
| resource_status_reason | ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6"" |                                                                                                                                                      
| resource_type          | OS::TripleO::ControllerPostDeployment                                                                                                                                                                                                                   |                                                                                                                                                      
| updated_time           | 2015-11-25T22:55:12Z                                                                                                                                                                                                                                    |                                                                                                                                                      
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Comment 2 Alexander Chuzhoy 2015-11-25 23:14:57 UTC
The logs on other controllers are too big to attach.

Comment 3 Jiri Stransky 2015-11-27 16:41:11 UTC
We investigated the environment with Marios, the root cause of Step 6 failing is:

Notice: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]/returns: keystoneclient.openstack.common.apiclient.exceptions.InternalServerError: An unexpected error prevented the server from fulfilling your request: [Errno 13] Permission denied: '/etc/keystone/policy.json' (Disable debug mode to suppress these details.) (HTTP 500) (Request-ID: req-b314b043-211b-48ff-8590-c5446abb359d)

The cause of this is wrong ownership of the config files on the newly added controller node (controller-0 correct, controller-3 incorrect):

[root@overcloud-controller-0 keystone]# ll
total 84
-rw-r-----. 1 root     keystone  1504 Apr 30  2015 default_catalog.templates
-rw-------. 1 keystone keystone 58431 Nov 25 11:40 keystone.conf
-rw-r-----. 1 root     keystone  1046 Apr 30  2015 logging.conf
-rw-r-----. 1 keystone keystone  8755 Apr 30  2015 policy.json
drwxr-xr-x. 4 keystone keystone    32 Nov 25 11:40 ssl
-rw-r-----. 1 keystone keystone   665 Apr 30  2015 sso_callback_template.html

[root@overcloud-controller-3 keystone]# ll
total 84
-rw-r-----. 1 root     root      1504 Nov 25 16:22 default_catalog.templates
-rw-------. 1 keystone keystone 58431 Nov 25 16:23 keystone.conf
-rw-r-----. 1 root     root      1046 Nov 25 16:22 logging.conf
-rw-r-----. 1 root     root      8755 Nov 25 16:22 policy.json
drwxr-xr-x. 4 keystone keystone    32 Nov 25 16:22 ssl
-rw-r-----. 1 root     root       665 Nov 25 16:22 sso_callback_template.html


Solution would be to add additional chown commands to the instructions about replacing a new controller node. After the `scp -r stack.0.1:~/keystone /etc/.` there should be this to fix up the ownerships:

chown -R keystone: /etc/keystone    
chown root /etc/keystone/logging.conf /etc/keystone/default_catalog.templates

This should result in the same file ownerships as on the original controller nodes.