Description of problem: Overcloud deployment fails to set the proper hostname on a controller. https://gist.github.com/jtaleric/7324f258dc367992ecbd Reviewing the os-collect-config : https://gist.github.com/jtaleric/cc70810844fb71fcea2c With ec2 information [root@overcloud-controller-1 heat-config]# grep -ri Contro /var/lib/os-collect-config /var/lib/os-collect-config/heat_local.json.orig: "stack_name": "overcloud-Controller-thhu62wu2afs-1-u4xkhezwigxr", /var/lib/os-collect-config/heat_local.json.orig: "path": "Controller.Metadata" /var/lib/os-collect-config/heat_local.json: "stack_name": "overcloud-Controller-thhu62wu2afs-1-u4xkhezwigxr", /var/lib/os-collect-config/heat_local.json: "path": "Controller.Metadata" /var/lib/os-collect-config/ec2.json.orig: "local-hostname": "overcloud-controller-2", /var/lib/os-collect-config/ec2.json.orig: "public-hostname": "overcloud-controller-2", /var/lib/os-collect-config/ec2.json.orig: "hostname": "overcloud-controller-2", /var/lib/os-collect-config/ec2.json: "local-hostname": "overcloud-controller-2", /var/lib/os-collect-config/ec2.json: "public-hostname": "overcloud-controller-2", /var/lib/os-collect-config/ec2.json: "hostname": "overcloud-controller-2", /var/lib/os-collect-config/heat_local.json.last: "stack_name": "overcloud-Controller-thhu62wu2afs-1-u4xkhezwigxr", /var/lib/os-collect-config/heat_local.json.last: "path": "Controller.Metadata" /var/lib/os-collect-config/ec2.json.last: "local-hostname": "overcloud-controller-2", /var/lib/os-collect-config/ec2.json.last: "public-hostname": "overcloud-controller-2", /var/lib/os-collect-config/ec2.json.last: "hostname": "overcloud-controller-2",
So, I did some investigaation, and it turns out this isn't a Heat issue. The problem is Nova/Ironic inject a config drive, but then it's not cleaned between deployments. And the device name isn't consistent. Blkid shows there are two config drives, one contains same as metadata API, other is wrong (same as cfn-init-data) /dev/sda1 is OK, /dev/sdd1 is wrong. $ blkid | grep config-2 /dev/sdd1: UUID="2015-09-29-16-31-25-00" LABEL="config-2" TYPE="iso9660" /dev/sda1: UUID="2015-10-08-09-45-12-00" LABEL="config-2" TYPE="iso9660" In this case we see after mounting the partitions that one contains the correct data, and the other does not. Unfortunately cloud-init uses the wrong one, and it takes precedence over the nova metadata API
Thanks for the investigation @Steven. Currently in OSP-7 we use the bash ramdisk generated with the DIB deploy-ironic element to deploy the image using Ironic. This ramdisk does not support erasing the disk devices before and after the deployment of the image so the previous tenant data will be in the disk. The long term solution would be to use IPA (ironic-python-agent) so we can use the cleaning capabilities of it to wipe the data of the disks [1] A short term solution perhaps would be updating the "deploy-ironic" element in DIB [2] to find the local disks prior to the deployment and clean the partition tables of it. Since we don't support attaching volumes or anything like that for bare metal it should be a safe operation. Let me know what you guys think about it. [1] http://docs.openstack.org/developer/ironic/deploy/cleaning.html?highlight=cleaning [2] https://github.com/openstack/diskimage-builder/blob/master/elements/deploy-ironic/init.d/80-deploy-ironic#L104
Assuming you will implement the long term solution, the even shorter term solution for us was to delete the offending partition. Only one of the 19 ironic nodes had multiple partitions of type iso9660. Deleted overcloud stack and successfully redeployed.
Given the easy availability of a workaround, this is not a release blocker, especially given that the cleaning capability is present in IPA. Lucas, what is required to have the deployment code use IPA to clean up the disks in between deployments? I've moved the bug to 8.0.z since it is an Ironic bug, and Ironic is part of core.
Hmm, wait folks. Lucas, cleaning has nothing to do with it, our deploy procedure already includes wiping the partition table: https://github.com/openstack/ironic/blob/stable/kilo/ironic/drivers/modules/deploy_utils.py#L573-L574 Joe, do you see any errors in the ironic conductor log? I'm looking for something similar to https://github.com/openstack/ironic/blob/stable/kilo/ironic/drivers/modules/deploy_utils.py#L449-L450
@Dmitry - I don't think we have sosreports of this even when things went south. Reviewing the deploy_utils.py, I do not recall seeing that error.
Oh, I think I understand what happened. You have several physical disks, right? Probably one deployment ended up on /dev/sda, another one on /dev/sdd. Ironic won't (and should not) clean up partitions in this case, unless cleaning is enabled (and we have it disabled by default). As using a random disk is not the best option IMO, I suggest you use root device hints for specifying the same disk every time: http://docs.openstack.org/developer/ironic/liberty/deploy/install-guide.html#specifying-the-disk-for-deployment I think this is not a bug, but I do think we need to update the documentation to make this potential issue clear. What do you guys think?
Upstream documentation update: https://review.openstack.org/#/c/282298/
(In reply to Dmitry Tantsur from comment #10) > Oh, I think I understand what happened. You have several physical disks, > right? Probably one deployment ended up on /dev/sda, another one on > /dev/sdd. Ironic won't (and should not) clean up partitions in this case, > unless cleaning is enabled (and we have it disabled by default). As using a > random disk is not the best option IMO, I suggest you use root device hints > for specifying the same disk every time: > http://docs.openstack.org/developer/ironic/liberty/deploy/install-guide. > html#specifying-the-disk-for-deployment > > I think this is not a bug, but I do think we need to update the > documentation to make this potential issue clear. What do you guys think? Yeah correct, that's why I think cleaning may solve the problem as well. But I was checking here and we have a problem, the patch adding support for cleaning with iSCSI drivers [0] is not included in stable/liberty. [0] https://review.openstack.org/#/c/220898/
Closing this bug because unfortunately cleaning is not available for Ironic in OSP7 since it does not uses the IPA ramdisk.