Bug 1418040

Summary: Verify impact of Ironic cleaning on Ceph as deployed by OSPd
Product: Red Hat OpenStack Reporter: John Fulton <johfulto>
Component: rhosp-directorAssignee: David Critch <dcritch>
Status: CLOSED CURRENTRELEASE QA Contact: Yogev Rabl <yrabl>
Severity: high Docs Contact: Don Domingo <ddomingo>
Priority: high    
Version: 11.0 (Ocata)CC: akrzos, bengland, dbecker, ddomingo, dwilson, fhubik, gfidente, jefbrown, johfulto, jomurphy, jtaleric, mburns, morazi, rhel-osp-director-maint, sasha, sclewis, smalleni, twilkins, yrabl
Target Milestone: z1Keywords: TestOnly, Triaged, ZStream
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
Story Points: ---
Clone Of:
: 1432309 (view as bug list) Environment:
Last Closed: 2017-07-17 06:59:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1377867, 1387433, 1399824, 1432309    
Attachments:
Description Flags
commands and node state changes during a deployment none

Description John Fulton 2017-01-31 16:52:40 UTC
- When deploying Ceph with OSPd it may be necessary to format Ceph Storage node disks to GPT with a first-boot script as documented in OSP10 [1]. 

- The new default behavior in OSP11 is for Ironic to clean the disks [2] when a new node is set to available. 

- Thus the first boot script _may_ not be necessary. 

This BZ tracks testing done by DFG:Ceph to verify if the first-boot script is still necessary given the new change in Ironic.  

Footnotes:

[1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT

[2] https://github.com/openstack/ironic-lib/blob/4ae48d0b212c16c8b49d4f1144c073b3a3206597/ironic_lib/disk_utils.py#L360

Comment 1 John Fulton 2017-01-31 18:41:41 UTC
Setting status to ON_QA as the feature has been implemented and needs to be tested.

Comment 2 John Fulton 2017-01-31 20:13:27 UTC
The scenario that OSPd users often run into when deploying Ceph OSDs is: 

0. Ironic introspects hardware and sets node to available
1. Deploy overcloud with OSDs (new FSID X is generated)
2. If disks are factory clean, then they are made into OSDs w/ FSID X
3. Run 'openstack stack delete overcloud' to test a new deploy
4. Deploy overcloud with OSDs (new FSID Y is generated)
5. Because X!=Y the deploy fails with "Exec[ceph-osd-check-fsid-mismatch-/dev/sde] has failures"

To work around this, users have used a first-boot script [1] so that
when step #1 above is run the disks are wiped.

We expect that the steps will be changed in OSP11 as follows:

0. Ironic introspects hardware and sets node to available
   (#0 will invoke cleaning every time the node enters the
   pool of nodes ready for being scheduled on available state)
1. Deploy overcloud with OSDs (new FSID X is generated)
2. Disks are clean and are made into OSDs w/ FSID X
3. Run 'openstack stack delete overcloud' to test a new deploy
   (#3 runs 'nova delete' which results in the disks getting cleaned before the next deploy)
4. Deploy overcloud with OSDs (new FSID Y is generated)
5. Disks are clean and are made into OSDs w/ FSID Y

and so on ....

The next step in this bug is to very the steps above work as described.

Comment 3 John Fulton 2017-01-31 21:08:06 UTC
More info on the Ironic change: 

In OSP11, Ironic's automated_clean [1] should default to true and it
will run `wipefs --force --all` to delete the disk metadata [2]. This
should get rid of previous GPT or other lables so 'ceph-disk prepare'
can set a GPT label, but that is what needs to be be tested. We expect
the following to be part of the new deployment cycle:

 Introspection -> Ironic cleaning -> Nova boot (optionally Nova stop/start)
 Nova delete -> Ironic cleaning -> Nova boot -> Nova delete ...

During Ironic cleaning the node is booted on a RAM disk so that wipefs
can be run and then the node is shut down. Though by default, the
cleaning won't do a full shred of the disk to a security standard so
wipefs command should be quick, there will be an extra boot between
cycles which will a take time that wasn't taken before. See docs [3]
for additional details. 

[1] https://github.com/openstack/ironic/blob/master/etc/ironic/ironic.conf.sample#L956-L969
[2] https://github.com/openstack/ironic-lib/blob/4ae48d0b212c16c8b49d4f1144c073b3a3206597/ironic_lib/disk_utils.py#L360
[3] http://docs.openstack.org/developer/ironic/deploy/cleaning.html

Comment 4 David Critch 2017-02-08 20:58:42 UTC
Hi all,

I've verified that clean_nodes is working in OSP11. 

Nodes are wiped:
1) when first imported (openstack baremetal import --json /home/stack/instackenv.json)
2) when nodes are bulk introspected (openstack baremetal introspection bulk start)
3) when a node is deleted (openstack stack delete ospte --yes --wait)

I redeployed and confirmed that the wipe on delete is successful, with a working ceph cluster w/ a new fsid after the deployment.

I've attached a full log of steps. The CMD lines are what is run, with stack list/ironic node-list and nova list updating every 15 seconds. I've trimmed the log to only reflect changes in states.

Happy that this works! Wondering if it is really necessary to clean on both steps 1 and 2 though, since it is a little redundant and adds time to the overall deployment.

Comment 5 David Critch 2017-02-08 20:59:31 UTC
Created attachment 1248663 [details]
commands and node state changes during a deployment

Comment 9 Red Hat Bugzilla Rules Engine 2017-03-01 17:02:56 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 10 John Fulton 2017-03-10 18:55:24 UTC
*** Bug 1377867 has been marked as a duplicate of this bug. ***

Comment 21 Yogev Rabl 2017-04-07 17:04:16 UTC
verified. 
set clean_nodes=true in undercloud.conf will wipe clean the disks of the overcloud nodes

Comment 24 John Fulton 2018-05-02 14:28:37 UTC
*** Bug 1570584 has been marked as a duplicate of this bug. ***