Bug 1251264

Summary: Unable to redeploy an overcloud if I interrupt the current stack deployment
Product: Red Hat OpenStack Reporter: Gaëtan Trellu <gtrellu>
Component: openstack-ironicAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED NOTABUG QA Contact: Toure Dunnon <tdunnon>
Severity: low Docs Contact:
Priority: urgent    
Version: 8.0 (Liberty)CC: chlong, emacchi, gchenuet, gtrellu, kbasil, lmartins, mburns, mcornea, nauvray, rhel-osp-director-maint, srevivo, vmindru
Target Milestone: gaKeywords: FutureFeature, Triaged
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-16 09:40:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gaëtan Trellu 2015-08-06 20:47:04 UTC
Description of problem:

During a Heat stack deployment (an overcloud) if I interrupt a current Heat stack deployment by CTRL+C for example, i'm not able to redeploy an overcloud anymore.

Ironic raised an error about an iSCSI device who doesn't exist.

Version-Release number of selected component (if applicable):

rhos-release-0.65-1.noarch
python-rdomanager-oscplugin-0.0.8-43.el7ost.noarch
puddle images 2015-07-30.1

How reproducible:

Interrupt a Heat stack overcloud deployment by CTRL+C

Steps to Reproduce:

1. Interrupt the Heat stack deployment by CTRL+C
2. Delete the Heat stack in failed state
3. Redeploy the Heat stack

Actual results:

Unable to redeploy the Heat stack because Ironic complaining about a missing iSCSI device.

Expected results:

Be able to redeploy the Heat stack.

Additional info:

Please find the error log output here: http://pastebin.test.redhat.com/303590

Comment 3 chris alfonso 2015-08-18 19:19:59 UTC
There is some upstream work going on to fix this but it might be a little further out than our A1 release.

Comment 4 Zane Bitter 2015-08-19 13:45:36 UTC
Oops, I led Chris astray here because I thought this was bug 1253773, but it isn't. Looks like something for the Ironic team to investigate... unfortunately it looks like that pastebin has expired (at least, it's empty). Any chance of reposting?

I'll clear the target milestone so it gets re-triaged.

Comment 5 Gaëtan Trellu 2015-08-19 13:49:03 UTC
Please find the paste without expiration: http://pastebin.test.redhat.com/306571

Comment 6 chris alfonso 2015-08-31 16:48:23 UTC
2015-08-04 14:44:42.398 11538 DEBUG ironic.common.utils [-] Command stderr is: "iscsiadm: invalid error code 65280
2015-08-04 14:44:46.865 11538 DEBUG ironic.common.utils [-] Command stderr is: "iscsiadm: invalid error code 65280
2015-08-04 14:44:48.016 11538 DEBUG ironic.common.utils [-] Command stderr is: "iscsiadm: invalid error code 65280
2015-08-04 14:44:49.446 11538 DEBUG ironic.drivers.modules.deploy_utils [-] Unable to stat device /dev/disk/by-path/ip-192.0.2.6:3260-iscsi-iqn-7e21bd17-922d-45b4-9baf-e7f780b7fbca-lun-1-part2. Attempt 1 out of 3. Error: [Errno 2] No such file or directory: '/dev/disk/by-path/ip-192.0.2.6:3260-iscsi-iqn-7e21bd17-922d-45b4-9baf-e7f780b7fbca-lun-1-part2' is_block_device /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:299
2015-08-04 14:44:50.446 11538 DEBUG ironic.drivers.modules.deploy_utils [-] Unable to stat device /dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2. Attempt 1 out of 3. Error: [Errno 2] No such file or directory: '/dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2' is_block_device /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:299
2015-08-04 14:48:48.128 11561 WARNING wsme.api [-] Client-side error: Node 23a87a60-35f1-4031-9e7a-60f04e8d3295 is locked by host localhost.localdomain, please retry after the current operation is completed.


just in case the paste expires again

Comment 8 Lucas Alvares Gomes 2016-02-03 16:26:42 UTC
Hi Gaëtan,

I will need some information to get my head around the issue. So, in Ironic we do not have a way to gracefully interrupt a deployment at any moment [0], independent of the situation it will eventually error out (or time out).

So the problem you are seem is:

2015-08-04 14:44:50.446 11538 DEBUG ironic.drivers.modules.deploy_utils [-] Unable to stat device /dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2. Attempt 1 out of 3. Error: [Errno 2] No such file or directory: '/dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2' is_block_device /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:299

This is coming from [1], you can see on that method that after we exhaust the number of attempts to find that iSCSI block device it will raise an InstanceDeployFailure, so the deploy will error out.

Now question, when you try to redeploy what is the state of the nodes in Ironic ? Please attach the output of the command following command:

$ ironic node-list

I will assume that some nodes are in a "deploy failed" state (or some other "error" state). If that's the case you will need to tear down the error'd nodes until you have all nodes in the "available" state again. You can do that by issuing the following command:

$ ironic node-set-provision-state <UUID or name> deleted

Let me know if that works for you.

...

A suggestion here would be to have some sanity check prior to trying to deploy the cloud - be it a custom script that the operator run or Heat - this check will fetch the list of nodes in Ironic and make sure that we have the same amount (or more) nodes in "available" state as requested for the deployment (e.g 2 controllers and 2 computes, we need at least 4 nodes to be "available" in Ironic).

...

There are some attempts to implement a way to abort the deployment of the nodes in Ironic at any stage [2], but, as far as the architecture of the deployment in Ironic goes that may not be an immediate abort, it's most likely that we will implement it in a way which the abortion of the deployment will be deferred to later because that's the safest thing to do [3].


[0] We do allow the deployment to be interrupted when the node is in "wait call-back" provision state. When it moves to deploying (like when it copying the image via iSCSI) that's done on a background thread making it hard to interrupt.

[1] https://github.com/openstack/ironic/blob/stable/liberty/ironic/drivers/modules/deploy_utils.py#L330-L347

[2] https://review.openstack.org/#/c/203660/

[3] We can't simple interrupt some actions with baremetal, e.g if we are dealing with flashing firmware and so on this may result on bricking the machine.

Hope that helps,
Lucas

Comment 9 Gaëtan Trellu 2016-02-04 14:12:36 UTC
Hi Lucas,

Thanks for your reply.
I don't remember the state of the nodes (it was six months ago).

We were in 7.0 GA, so maybe the behavior has changed since.

Thanks for all the links !

Gaëtan

Comment 10 Lucas Alvares Gomes 2016-05-16 09:40:50 UTC
Hi,

Ok since we can't reproduce this problem (it was 6 months ago) I will close this bug. Feel free to re-open it if the problem appears again.

Thanks