Bug 1418040 - Verify impact of Ironic cleaning on Ceph as deployed by OSPd
Summary: Verify impact of Ironic cleaning on Ceph as deployed by OSPd
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z1
: 11.0 (Ocata)
Assignee: David Critch
QA Contact: Yogev Rabl
Don Domingo
URL:
Whiteboard:
: 1377867 1570584 (view as bug list)
Depends On:
Blocks: 1377867 1387433 ciscoosp11bugs 1432309
TreeView+ depends on / blocked
 
Reported: 2017-01-31 16:52 UTC by John Fulton
Modified: 2022-03-13 14:39 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
Clone Of:
: 1432309 (view as bug list)
Environment:
Last Closed: 2017-07-17 06:59:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
commands and node state changes during a deployment (63.57 KB, text/plain)
2017-02-08 20:59 UTC, David Critch
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-13561 0 None None None 2022-03-13 14:39:37 UTC

Internal Links: 1377867

Description John Fulton 2017-01-31 16:52:40 UTC
- When deploying Ceph with OSPd it may be necessary to format Ceph Storage node disks to GPT with a first-boot script as documented in OSP10 [1]. 

- The new default behavior in OSP11 is for Ironic to clean the disks [2] when a new node is set to available. 

- Thus the first boot script _may_ not be necessary. 

This BZ tracks testing done by DFG:Ceph to verify if the first-boot script is still necessary given the new change in Ironic.  

Footnotes:

[1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT

[2] https://github.com/openstack/ironic-lib/blob/4ae48d0b212c16c8b49d4f1144c073b3a3206597/ironic_lib/disk_utils.py#L360

Comment 1 John Fulton 2017-01-31 18:41:41 UTC
Setting status to ON_QA as the feature has been implemented and needs to be tested.

Comment 2 John Fulton 2017-01-31 20:13:27 UTC
The scenario that OSPd users often run into when deploying Ceph OSDs is: 

0. Ironic introspects hardware and sets node to available
1. Deploy overcloud with OSDs (new FSID X is generated)
2. If disks are factory clean, then they are made into OSDs w/ FSID X
3. Run 'openstack stack delete overcloud' to test a new deploy
4. Deploy overcloud with OSDs (new FSID Y is generated)
5. Because X!=Y the deploy fails with "Exec[ceph-osd-check-fsid-mismatch-/dev/sde] has failures"

To work around this, users have used a first-boot script [1] so that
when step #1 above is run the disks are wiped.

We expect that the steps will be changed in OSP11 as follows:

0. Ironic introspects hardware and sets node to available
   (#0 will invoke cleaning every time the node enters the
   pool of nodes ready for being scheduled on available state)
1. Deploy overcloud with OSDs (new FSID X is generated)
2. Disks are clean and are made into OSDs w/ FSID X
3. Run 'openstack stack delete overcloud' to test a new deploy
   (#3 runs 'nova delete' which results in the disks getting cleaned before the next deploy)
4. Deploy overcloud with OSDs (new FSID Y is generated)
5. Disks are clean and are made into OSDs w/ FSID Y

and so on ....

The next step in this bug is to very the steps above work as described.

Comment 3 John Fulton 2017-01-31 21:08:06 UTC
More info on the Ironic change: 

In OSP11, Ironic's automated_clean [1] should default to true and it
will run `wipefs --force --all` to delete the disk metadata [2]. This
should get rid of previous GPT or other lables so 'ceph-disk prepare'
can set a GPT label, but that is what needs to be be tested. We expect
the following to be part of the new deployment cycle:

 Introspection -> Ironic cleaning -> Nova boot (optionally Nova stop/start)
 Nova delete -> Ironic cleaning -> Nova boot -> Nova delete ...

During Ironic cleaning the node is booted on a RAM disk so that wipefs
can be run and then the node is shut down. Though by default, the
cleaning won't do a full shred of the disk to a security standard so
wipefs command should be quick, there will be an extra boot between
cycles which will a take time that wasn't taken before. See docs [3]
for additional details. 

[1] https://github.com/openstack/ironic/blob/master/etc/ironic/ironic.conf.sample#L956-L969
[2] https://github.com/openstack/ironic-lib/blob/4ae48d0b212c16c8b49d4f1144c073b3a3206597/ironic_lib/disk_utils.py#L360
[3] http://docs.openstack.org/developer/ironic/deploy/cleaning.html

Comment 4 David Critch 2017-02-08 20:58:42 UTC
Hi all,

I've verified that clean_nodes is working in OSP11. 

Nodes are wiped:
1) when first imported (openstack baremetal import --json /home/stack/instackenv.json)
2) when nodes are bulk introspected (openstack baremetal introspection bulk start)
3) when a node is deleted (openstack stack delete ospte --yes --wait)

I redeployed and confirmed that the wipe on delete is successful, with a working ceph cluster w/ a new fsid after the deployment.

I've attached a full log of steps. The CMD lines are what is run, with stack list/ironic node-list and nova list updating every 15 seconds. I've trimmed the log to only reflect changes in states.

Happy that this works! Wondering if it is really necessary to clean on both steps 1 and 2 though, since it is a little redundant and adds time to the overall deployment.

Comment 5 David Critch 2017-02-08 20:59:31 UTC
Created attachment 1248663 [details]
commands and node state changes during a deployment

Comment 9 Red Hat Bugzilla Rules Engine 2017-03-01 17:02:56 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 10 John Fulton 2017-03-10 18:55:24 UTC
*** Bug 1377867 has been marked as a duplicate of this bug. ***

Comment 21 Yogev Rabl 2017-04-07 17:04:16 UTC
verified. 
set clean_nodes=true in undercloud.conf will wipe clean the disks of the overcloud nodes

Comment 24 John Fulton 2018-05-02 14:28:37 UTC
*** Bug 1570584 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.