Bug 1269919 - Overcloud shows up with two controller-1 hosts
Summary: Overcloud shows up with two controller-1 hosts
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Lucas Alvares Gomes
QA Contact: Toure Dunnon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-10-08 13:35 UTC by Joe Talerico
Modified: 2016-08-18 13:42 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-18 13:42:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joe Talerico 2015-10-08 13:35:44 UTC
Description of problem:
Overcloud deployment fails to set the proper hostname on a controller. 

https://gist.github.com/jtaleric/7324f258dc367992ecbd

Reviewing the os-collect-config :

https://gist.github.com/jtaleric/cc70810844fb71fcea2c

With ec2 information 

[root@overcloud-controller-1 heat-config]# grep -ri Contro /var/lib/os-collect-config
/var/lib/os-collect-config/heat_local.json.orig:   "stack_name": "overcloud-Controller-thhu62wu2afs-1-u4xkhezwigxr",
/var/lib/os-collect-config/heat_local.json.orig:   "path": "Controller.Metadata"
/var/lib/os-collect-config/heat_local.json:   "stack_name": "overcloud-Controller-thhu62wu2afs-1-u4xkhezwigxr",
/var/lib/os-collect-config/heat_local.json:   "path": "Controller.Metadata"
/var/lib/os-collect-config/ec2.json.orig: "local-hostname": "overcloud-controller-2",
/var/lib/os-collect-config/ec2.json.orig: "public-hostname": "overcloud-controller-2",
/var/lib/os-collect-config/ec2.json.orig: "hostname": "overcloud-controller-2",
/var/lib/os-collect-config/ec2.json: "local-hostname": "overcloud-controller-2",
/var/lib/os-collect-config/ec2.json: "public-hostname": "overcloud-controller-2",
/var/lib/os-collect-config/ec2.json: "hostname": "overcloud-controller-2",
/var/lib/os-collect-config/heat_local.json.last:   "stack_name": "overcloud-Controller-thhu62wu2afs-1-u4xkhezwigxr",
/var/lib/os-collect-config/heat_local.json.last:   "path": "Controller.Metadata"
/var/lib/os-collect-config/ec2.json.last: "local-hostname": "overcloud-controller-2",
/var/lib/os-collect-config/ec2.json.last: "public-hostname": "overcloud-controller-2",
/var/lib/os-collect-config/ec2.json.last: "hostname": "overcloud-controller-2",

Comment 2 Steven Hardy 2015-10-08 16:54:19 UTC
So, I did some investigaation, and it turns out this isn't a Heat issue.

The problem is Nova/Ironic inject a config drive, but then it's not cleaned between deployments.  And the device name isn't consistent.

Blkid shows there are two config drives, one contains same as metadata API, other is wrong (same as cfn-init-data)
/dev/sda1 is OK, /dev/sdd1 is wrong.


$ blkid | grep config-2
/dev/sdd1: UUID="2015-09-29-16-31-25-00" LABEL="config-2" TYPE="iso9660"                                                                  /dev/sda1: UUID="2015-10-08-09-45-12-00" LABEL="config-2" TYPE="iso9660"                                                                  

In this case we see after mounting the partitions that one contains the correct data, and the other does not.

Unfortunately cloud-init uses the wrong one, and it takes precedence over the nova metadata API

Comment 3 Lucas Alvares Gomes 2015-10-08 17:11:19 UTC
Thanks for the investigation @Steven. Currently in OSP-7 we use the bash ramdisk generated with the DIB deploy-ironic element to deploy the image using Ironic. This ramdisk does not support erasing the disk devices before and after the deployment of the image so the previous tenant data will be in the disk.

The long term solution would be to use IPA (ironic-python-agent) so we can use the cleaning capabilities of it to wipe the data of the disks [1]

A short term solution perhaps would be updating the "deploy-ironic" element in DIB [2] to find the local disks prior to the deployment and clean the partition tables of it. Since we don't support attaching volumes or anything like that for bare metal it should be a safe operation.

Let me know what you guys think about it.

[1] http://docs.openstack.org/developer/ironic/deploy/cleaning.html?highlight=cleaning

[2] https://github.com/openstack/diskimage-builder/blob/master/elements/deploy-ironic/init.d/80-deploy-ironic#L104

Comment 4 Tim Wilkinson 2015-10-08 18:50:14 UTC
Assuming you will implement the long term solution, the even shorter term solution for us was to delete the offending partition. Only one of the 19 ironic nodes had multiple partitions of type iso9660. Deleted overcloud stack and successfully redeployed.

Comment 6 Hugh Brock 2016-02-05 11:15:33 UTC
Given the easy availability of a workaround, this is not a release blocker, especially given that the cleaning capability is present in IPA.

Lucas, what is required to have the deployment code use IPA to clean up the disks in between deployments?

I've moved the bug to 8.0.z since it is an Ironic bug, and Ironic is part of core.

Comment 8 Dmitry Tantsur 2016-02-08 10:31:57 UTC
Hmm, wait folks. Lucas, cleaning has nothing to do with it, our deploy procedure already includes wiping the partition table: https://github.com/openstack/ironic/blob/stable/kilo/ironic/drivers/modules/deploy_utils.py#L573-L574

Joe, do you see any errors in the ironic conductor log? I'm looking for something similar to https://github.com/openstack/ironic/blob/stable/kilo/ironic/drivers/modules/deploy_utils.py#L449-L450

Comment 9 Joe Talerico 2016-02-08 15:36:33 UTC
@Dmitry - I don't think we have sosreports of this even when things went south. Reviewing the deploy_utils.py, I do not recall seeing that error.

Comment 10 Dmitry Tantsur 2016-02-19 11:52:51 UTC
Oh, I think I understand what happened. You have several physical disks, right? Probably one deployment ended up on /dev/sda, another one on /dev/sdd. Ironic won't (and should not) clean up partitions in this case, unless cleaning is enabled (and we have it disabled by default). As using a random disk is not the best option IMO, I suggest you use root device hints for specifying the same disk every time: http://docs.openstack.org/developer/ironic/liberty/deploy/install-guide.html#specifying-the-disk-for-deployment

I think this is not a bug, but I do think we need to update the documentation to make this potential issue clear. What do you guys think?

Comment 11 Dmitry Tantsur 2016-02-19 11:59:35 UTC
Upstream documentation update: https://review.openstack.org/#/c/282298/

Comment 12 Lucas Alvares Gomes 2016-03-22 16:30:22 UTC
(In reply to Dmitry Tantsur from comment #10)
> Oh, I think I understand what happened. You have several physical disks,
> right? Probably one deployment ended up on /dev/sda, another one on
> /dev/sdd. Ironic won't (and should not) clean up partitions in this case,
> unless cleaning is enabled (and we have it disabled by default). As using a
> random disk is not the best option IMO, I suggest you use root device hints
> for specifying the same disk every time:
> http://docs.openstack.org/developer/ironic/liberty/deploy/install-guide.
> html#specifying-the-disk-for-deployment
> 
> I think this is not a bug, but I do think we need to update the
> documentation to make this potential issue clear. What do you guys think?

Yeah correct, that's why I think cleaning may solve the problem as well.

But I was checking here and we have a problem, the patch adding support for cleaning with iSCSI drivers [0] is not included in stable/liberty.

[0] https://review.openstack.org/#/c/220898/

Comment 13 Lucas Alvares Gomes 2016-08-18 13:42:20 UTC
Closing this bug because unfortunately cleaning is not available for Ironic in OSP7 since it does not uses the IPA ramdisk.


Note You need to log in before you can comment on or make changes to this bug.