- When deploying Ceph with OSPd it may be necessary to format Ceph Storage node disks to GPT with a first-boot script as documented in OSP10 [1]. - The new default behavior in OSP11 is for Ironic to clean the disks [2] when a new node is set to available. - Thus the first boot script _may_ not be necessary. This BZ tracks testing done by DFG:Ceph to verify if the first-boot script is still necessary given the new change in Ironic. Footnotes: [1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT [2] https://github.com/openstack/ironic-lib/blob/4ae48d0b212c16c8b49d4f1144c073b3a3206597/ironic_lib/disk_utils.py#L360
Setting status to ON_QA as the feature has been implemented and needs to be tested.
The scenario that OSPd users often run into when deploying Ceph OSDs is: 0. Ironic introspects hardware and sets node to available 1. Deploy overcloud with OSDs (new FSID X is generated) 2. If disks are factory clean, then they are made into OSDs w/ FSID X 3. Run 'openstack stack delete overcloud' to test a new deploy 4. Deploy overcloud with OSDs (new FSID Y is generated) 5. Because X!=Y the deploy fails with "Exec[ceph-osd-check-fsid-mismatch-/dev/sde] has failures" To work around this, users have used a first-boot script [1] so that when step #1 above is run the disks are wiped. We expect that the steps will be changed in OSP11 as follows: 0. Ironic introspects hardware and sets node to available (#0 will invoke cleaning every time the node enters the pool of nodes ready for being scheduled on available state) 1. Deploy overcloud with OSDs (new FSID X is generated) 2. Disks are clean and are made into OSDs w/ FSID X 3. Run 'openstack stack delete overcloud' to test a new deploy (#3 runs 'nova delete' which results in the disks getting cleaned before the next deploy) 4. Deploy overcloud with OSDs (new FSID Y is generated) 5. Disks are clean and are made into OSDs w/ FSID Y and so on .... The next step in this bug is to very the steps above work as described.
More info on the Ironic change: In OSP11, Ironic's automated_clean [1] should default to true and it will run `wipefs --force --all` to delete the disk metadata [2]. This should get rid of previous GPT or other lables so 'ceph-disk prepare' can set a GPT label, but that is what needs to be be tested. We expect the following to be part of the new deployment cycle: Introspection -> Ironic cleaning -> Nova boot (optionally Nova stop/start) Nova delete -> Ironic cleaning -> Nova boot -> Nova delete ... During Ironic cleaning the node is booted on a RAM disk so that wipefs can be run and then the node is shut down. Though by default, the cleaning won't do a full shred of the disk to a security standard so wipefs command should be quick, there will be an extra boot between cycles which will a take time that wasn't taken before. See docs [3] for additional details. [1] https://github.com/openstack/ironic/blob/master/etc/ironic/ironic.conf.sample#L956-L969 [2] https://github.com/openstack/ironic-lib/blob/4ae48d0b212c16c8b49d4f1144c073b3a3206597/ironic_lib/disk_utils.py#L360 [3] http://docs.openstack.org/developer/ironic/deploy/cleaning.html
Hi all, I've verified that clean_nodes is working in OSP11. Nodes are wiped: 1) when first imported (openstack baremetal import --json /home/stack/instackenv.json) 2) when nodes are bulk introspected (openstack baremetal introspection bulk start) 3) when a node is deleted (openstack stack delete ospte --yes --wait) I redeployed and confirmed that the wipe on delete is successful, with a working ceph cluster w/ a new fsid after the deployment. I've attached a full log of steps. The CMD lines are what is run, with stack list/ironic node-list and nova list updating every 15 seconds. I've trimmed the log to only reflect changes in states. Happy that this works! Wondering if it is really necessary to clean on both steps 1 and 2 though, since it is a little redundant and adds time to the overall deployment.
Created attachment 1248663 [details] commands and node state changes during a deployment
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
*** Bug 1377867 has been marked as a duplicate of this bug. ***
verified. set clean_nodes=true in undercloud.conf will wipe clean the disks of the overcloud nodes
Published: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/red_hat_ceph_storage_for_the_overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT
*** Bug 1570584 has been marked as a duplicate of this bug. ***