Bug 1392995

Summary: Replacing a Ceph storage node fails with StackValidationFailed: resources.CephStorageAllNodesDeployment: Property error: CephStorageAllNodesDeployment.Properties.input_values: The Referenced Attribute (CephStorage resource.0.hostname) is incorrect.
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Steven Hardy <shardy>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: bcrochet, brad, dbecker, jcoufal, jefbrown, jschluet, jslagle, mburns, mcornea, morazi, pgrist, rhel-osp-director-maint, sasha, sclewis, shardy
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-5.0.0-1.7.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 16:31:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs and templates none

Description Marius Cornea 2016-11-08 16:20:19 UTC
Description of problem:

Following the Ceph storage node replacement procedure @
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/red-hat-ceph-storage-for-the-overcloud/#Replacing_Ceph_Storage_Nodes

The openstack overcloud node delete step fails with the following error:

overcloud.CephStorageAllNodesDeployment:
  resource_type: OS::Heat::StructuredDeployments
  physical_resource_id: 75b6c232-21c9-46e5-9b46-c07ef8d7b7af
  status: UPDATE_FAILED
  status_reason: |
    StackValidationFailed: resources.CephStorageAllNodesDeployment: Property error: CephStorageAllNodesDeployment.Properties.input_values: The Referenced Attribute (CephStorage resource.0.hostname) is incorrect.
overcloud.ComputeAllNodesDeployment:
  resource_type: OS::Heat::StructuredDeployments
  physical_resource_id: 7927559f-55f1-4c7f-b58d-1fe2fab9705c
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-5.0.0-1.3.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 ceph storage nodes
2. Stop one of the Ceph storage nodes
3. Disable and remove from the crush map the OSDs running on the stop node according to the procedure in
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/red-hat-ceph-storage-for-the-overcloud/#Replacing_Ceph_Storage_Nodes
4. Delete the Ceph node:
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud node delete --stack overcloud --templates $THT \
-e $THT/environments/network-isolation.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
03915d83-6026-4a4f-9e93-a3807c9e0d8e


Actual results:
Stack update fails with:
    StackValidationFailed: resources.CephStorageAllNodesDeployment: Property error: CephStorageAllNodesDeployment.Properties.input_values: The Referenced Attribute (CephStorage resource.0.hostname) is incorrect.

Expected results:
Stack update completes ok 

Additional info:

Comment 1 James Slagle 2016-11-08 19:13:09 UTC
can you provide:

all your custom templates
heat-api.log, heat-engine.log from the undercloud
plan contents (download the overcloud container contents from swift and tgz that)

Comment 2 James Slagle 2016-11-08 22:32:25 UTC
i'd also be interested which ceph node the uuid 03915d83-6026-4a4f-9e93-a3807c9e0d8e corresponds to. Is it the first one? Does the issue reproduce if you try to delete the last ceph node instead?

Also, for OSP 10, I don't think you have to pass --templates and all the -e's to the node delete command.

Comment 3 James Slagle 2016-11-08 22:34:38 UTC
(In reply to James Slagle from comment #2)

> Also, for OSP 10, I don't think you have to pass --templates and all the
> -e's to the node delete command.

Brad, can you confirm this bit ^?

Comment 4 James Slagle 2016-11-09 01:11:00 UTC
(In reply to James Slagle from comment #3)
> (In reply to James Slagle from comment #2)
> 
> > Also, for OSP 10, I don't think you have to pass --templates and all the
> > -e's to the node delete command.
> 
> Brad, can you confirm this bit ^?

checked with him on irc and he confirmed that you don't need to pass --templates or the -e's anymore to the openstack overcloud node delete command.

Comment 5 Marius Cornea 2016-11-09 08:18:41 UTC
Created attachment 1218844 [details]
Logs and templates

Comment 6 Steven Hardy 2016-11-09 08:21:48 UTC
This is because we now set the bootstrap node for all roles (to enable deployment of any puppet profile which expects to detect the first node in the cluster aka bootstrap node).

Previously only the Controller set this, but now we have a hard-coded reference to node "0" here in the overcloud template:

https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud.j2.yaml#L234

 input_values:
        bootstrap_nodeid: {get_attr: [{{role.name}}, resource.0.hostname]}
        bootstrap_nodeid_ip: {get_attr: [{{role.name}}, resource.0.ip_address]}

We need some way for the node delete workflow to change this index when replacing node "0", or another way to detect the first node in the group without using the node name (this looks like an index but I think it's referring to the resource name in the resource group, so it should be e.g "1" after this removal, ideally we'd use a list lookup here instead, perhaps that's a possible way to fix this).

Comment 7 Steven Hardy 2016-11-10 11:45:35 UTC
https://review.openstack.org/#/c/395699/ posted upstream which I believe resolves this issue, done some local testing but feedback welcome.

Comment 8 Marius Cornea 2016-11-10 18:03:37 UTC
(In reply to Steven Hardy from comment #7)
> https://review.openstack.org/#/c/395699/ posted upstream which I believe
> resolves this issue, done some local testing but feedback welcome.

Tested it on my env as well and it looks good.

Comment 16 errata-xmlrpc 2016-12-14 16:31:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html