Bug 1251264
Summary: | Unable to redeploy an overcloud if I interrupt the current stack deployment | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Gaëtan Trellu <gtrellu> |
Component: | openstack-ironic | Assignee: | Lucas Alvares Gomes <lmartins> |
Status: | CLOSED NOTABUG | QA Contact: | Toure Dunnon <tdunnon> |
Severity: | low | Docs Contact: | |
Priority: | urgent | ||
Version: | 8.0 (Liberty) | CC: | chlong, emacchi, gchenuet, gtrellu, kbasil, lmartins, mburns, mcornea, nauvray, rhel-osp-director-maint, srevivo, vmindru |
Target Milestone: | ga | Keywords: | FutureFeature, Triaged |
Target Release: | 9.0 (Mitaka) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-05-16 09:40:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Gaëtan Trellu
2015-08-06 20:47:04 UTC
There is some upstream work going on to fix this but it might be a little further out than our A1 release. Oops, I led Chris astray here because I thought this was bug 1253773, but it isn't. Looks like something for the Ironic team to investigate... unfortunately it looks like that pastebin has expired (at least, it's empty). Any chance of reposting? I'll clear the target milestone so it gets re-triaged. Please find the paste without expiration: http://pastebin.test.redhat.com/306571 2015-08-04 14:44:42.398 11538 DEBUG ironic.common.utils [-] Command stderr is: "iscsiadm: invalid error code 65280 2015-08-04 14:44:46.865 11538 DEBUG ironic.common.utils [-] Command stderr is: "iscsiadm: invalid error code 65280 2015-08-04 14:44:48.016 11538 DEBUG ironic.common.utils [-] Command stderr is: "iscsiadm: invalid error code 65280 2015-08-04 14:44:49.446 11538 DEBUG ironic.drivers.modules.deploy_utils [-] Unable to stat device /dev/disk/by-path/ip-192.0.2.6:3260-iscsi-iqn-7e21bd17-922d-45b4-9baf-e7f780b7fbca-lun-1-part2. Attempt 1 out of 3. Error: [Errno 2] No such file or directory: '/dev/disk/by-path/ip-192.0.2.6:3260-iscsi-iqn-7e21bd17-922d-45b4-9baf-e7f780b7fbca-lun-1-part2' is_block_device /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:299 2015-08-04 14:44:50.446 11538 DEBUG ironic.drivers.modules.deploy_utils [-] Unable to stat device /dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2. Attempt 1 out of 3. Error: [Errno 2] No such file or directory: '/dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2' is_block_device /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:299 2015-08-04 14:48:48.128 11561 WARNING wsme.api [-] Client-side error: Node 23a87a60-35f1-4031-9e7a-60f04e8d3295 is locked by host localhost.localdomain, please retry after the current operation is completed. just in case the paste expires again Hi Gaëtan, I will need some information to get my head around the issue. So, in Ironic we do not have a way to gracefully interrupt a deployment at any moment [0], independent of the situation it will eventually error out (or time out). So the problem you are seem is: 2015-08-04 14:44:50.446 11538 DEBUG ironic.drivers.modules.deploy_utils [-] Unable to stat device /dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2. Attempt 1 out of 3. Error: [Errno 2] No such file or directory: '/dev/disk/by-path/ip-192.0.2.7:3260-iscsi-iqn-8301e620-b6af-42c8-be41-b168f7dbad34-lun-1-part2' is_block_device /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:299 This is coming from [1], you can see on that method that after we exhaust the number of attempts to find that iSCSI block device it will raise an InstanceDeployFailure, so the deploy will error out. Now question, when you try to redeploy what is the state of the nodes in Ironic ? Please attach the output of the command following command: $ ironic node-list I will assume that some nodes are in a "deploy failed" state (or some other "error" state). If that's the case you will need to tear down the error'd nodes until you have all nodes in the "available" state again. You can do that by issuing the following command: $ ironic node-set-provision-state <UUID or name> deleted Let me know if that works for you. ... A suggestion here would be to have some sanity check prior to trying to deploy the cloud - be it a custom script that the operator run or Heat - this check will fetch the list of nodes in Ironic and make sure that we have the same amount (or more) nodes in "available" state as requested for the deployment (e.g 2 controllers and 2 computes, we need at least 4 nodes to be "available" in Ironic). ... There are some attempts to implement a way to abort the deployment of the nodes in Ironic at any stage [2], but, as far as the architecture of the deployment in Ironic goes that may not be an immediate abort, it's most likely that we will implement it in a way which the abortion of the deployment will be deferred to later because that's the safest thing to do [3]. [0] We do allow the deployment to be interrupted when the node is in "wait call-back" provision state. When it moves to deploying (like when it copying the image via iSCSI) that's done on a background thread making it hard to interrupt. [1] https://github.com/openstack/ironic/blob/stable/liberty/ironic/drivers/modules/deploy_utils.py#L330-L347 [2] https://review.openstack.org/#/c/203660/ [3] We can't simple interrupt some actions with baremetal, e.g if we are dealing with flashing firmware and so on this may result on bricking the machine. Hope that helps, Lucas Hi Lucas, Thanks for your reply. I don't remember the state of the nodes (it was six months ago). We were in 7.0 GA, so maybe the behavior has changed since. Thanks for all the links ! Gaëtan Hi, Ok since we can't reproduce this problem (it was 6 months ago) I will close this bug. Feel free to re-open it if the problem appears again. Thanks |