Bug 1713790
| Summary: | Re-deployment of overcloud fails with "Failure prepping block device" | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Bob Fournier <bfournie> |
| Component: | openstack-ironic | Assignee: | RHOS Maint <rhos-maint> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | mlammon |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 13.0 (Queens) | CC: | bfournie, chshort, dhill, hjensas, jkreger, lbragsta, mburns |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-12-11 01:29:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Bob Fournier
2019-05-24 20:29:04 UTC
When you delete, how long are you waiting? Wondering if cleanup/undeployment is actually completing.... Seems line Nova thinks it an use the node. Any indication of what is actually occurring inside the ironic conductor or what the present information state of the node is? I discussed this issue with an associate, and essentially: 1) Deploy failed without performing a deployment. 2) This resulted in an orphaned vif. 3) Nova gave up after _2_ tries to remove the vif upon giving up on the deployment, which never occurred because the physical machine was locked due to power sync. So to move past this: 1) Set [ironic]api_max_retries=30 (or 60, which should be default, but the deployment for some reason was only trying twice.) 2) restart the openstack-nova-compute service 3) Ensure that there are no orphaned vifs on the unused baremetal nodes showing in available state, using "openstack baremetal node vif list" and "openstack baremetal node vif detach" as necessary. 4) Retry the deployment. Could it be possible that this line is wrong :
virt/ironic/client_wrapper.py: max_retries = CONF.ironic.api_max_retries if retry_on_conflict else 1
?
Or maybe the above is right but what's below is wrong and if retry_on_conflict is false, it would only try once (or twice if there's a logic bug) ?
try:
# FIXME(lucasagomes): The "retry_on_conflict" parameter was added
# to basically causes the deployment to fail faster in case the
# node picked by the scheduler is already associated with another
# instance due bug #1341420.
self.ironicclient.call('node.update', node.uuid, patch,
retry_on_conflict=False)
...
for attempt in range(0, last_attempt + 1):
try:
self.ironicclient.call("node.vif_attach", node.uuid,
port_id, retry_on_conflict=False)
That does seem to be part of the conundrum. The issue they encountered was in two places. Initial information setting which failed due to the node being locked which issues a conflict as a reply. The second is upon detaching the vif, the client_wrapper gave up after only two retries when it _should_ try for a while because if that record is not removed then it becomes orphaned as that is one of the last interactions with ironic, if not the very last, and no state change ever occurred to cause ironic to know it should go clean up the record. :( Dave - is there any chance of retrying this at customer site with a change to max_api_retries as suggested? Set [ironic]api_max_retries=30 (or 60, which should be default, but the deployment for some reason was only trying twice.) I'll test this in my lab as soon as I have the chance. I have some other issues these days with other bugs . I'll keep this BZ as needinfo to me for the time being. Dave - closing this for now as we don't really have anything to go on except for the recommendation for the max_api_retries config change. Please reopen if we can duplicate this. |