Bug 1668028
Summary: | Failure to introspect with OSP13 | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | David Hill <dhill> |
Component: | openstack-tripleo-common | Assignee: | Steve Baker <sbaker> |
Status: | CLOSED ERRATA | QA Contact: | Alistair Tonner <atonner> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 13.0 (Queens) | CC: | bfournie, dhill, jjoyce, jkreger, jschluet, kobi.ginon, mburns, mgarciac, mzheng, pkesavar, pweeks, sbaker, slinaber, tvignaud, ukalifon |
Target Milestone: | z14 | Keywords: | Reopened, Triaged, ZStream |
Target Release: | 13.0 (Queens) | Flags: | pweeks:
needinfo+
|
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-common-8.7.1-25.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-16 13:55:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Hill
2019-01-21 17:55:13 UTC
So node is locked, is this happening on all the nodes? Is there an issue with the BMC's responding? Can we get Ironic logs? Dave - do you have more info on this? Its normal for the nodes to be locked for a short period of time. We don't really have anything to go on. The case seems closed, please reopen if you have the requested logs. Hi David i wonder why this issue was closed it happens 1/15 deployments , is there a workaround ? Waiting for messages on queue 'tripleo' with no timeout. [{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes=' {u'client_method_name': u'node.set_provision_state'} ', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}] No JSON object could be decoded 2020-04-01 09:36:47,227 - CbisDeployment - ERROR - e is: error occurred during command: openstack overcloud node provide --all-manageable error: Waiting for messages on queue 'tripleo' with no timeout. [{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes=' {u'client_method_name': u'node.set_provision_state'} ', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}] No JSON object could be decoded Hi David i wonder why this issue was closed it happens 1/15 deployments , is there a workaround ? Waiting for messages on queue 'tripleo' with no timeout. [{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes=' {u'client_method_name': u'node.set_provision_state'} ', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}] No JSON object could be decoded 2020-04-01 09:36:47,227 - CbisDeployment - ERROR - e is: error occurred during command: openstack overcloud node provide --all-manageable error: Waiting for messages on queue 'tripleo' with no timeout. [{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes=' {u'client_method_name': u'node.set_provision_state'} ', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}] No JSON object could be decoded It was closed because the customer reporting this issue closed the case. Are you using the latest bits ? Did you open a support case ? yes latest not opening a ticket , just wondering on a workaroud if exists ? I'd suggest you read comment #2 and open a support case if you hit this issue. The supplied logs show the actions to set BMC state took 80-300+ seconds with some requests even backing up because it seems requests are being sent to make changes and manipulate the node faster than its configuration can be changed. This seems to be due to configuration jobs having to be executed to change the running state of the BMC configuration. i.e. Change boot device when power cycling or please boot in UEFI mode or BIOS mode. Unfortunately it is hard for me to classify this as a bug in ironic. Ironic seems to be doing the needful as fast as possible. Which realistically leaves two options: 1) tripleo-common needs to instantiate ironic's client with more retries. Currently this is 12 retries every 5 seconds, or for 60 seconds in total. 2) Or tripleo-common's code could poll and see if the node is locked for manipulation before issuing actions. One additional item, often these delays are due to power-on-self-test memory checks while the configuration job is being applied, as the machine boots into the bios. Ensure that memory testing is disabled in the bios firmware settings, which should speed up the configuration times and may actually make this issue go away. The introspection timeout is 20 minutes, which I assume is enough even for slow servers. The problem seems to be that after a successful introspection the provide fails because the node is still locked. It is not clear to me if the node stays locked indefinitely. If the introspect finishes (failure or success) and the node remains locked, then that is something we need to investigate. If the lock is *eventually* released, I would suggest the following as a workaround. Stop specifying --provide as an argument and run the provide command separately, so openstack overcloud node introspect --all-manageable openstack overcloud node provide --all-manageable or openstack overcloud node introspect <node1> <node2> ... openstack overcloud node provide <node1> <node2> ... If the provide command fails with the locked node error, wait then try again. We need to know how long this lock lingers after introspect completes. Is it seconds, triggering a race when doing introspect --provide? Is it minutes? Does it remain on that node forever? (setting NEEDINFO for this information) hi related to running the command separately : openstack overcloud node provide <node1> <node2> ... we do see that usually 1 or 2 blades are stack with this lock and the command: openstack overcloud node provide --all-manageable will wait forever during that time waiting for this blades. if while it is being hanged we identify those blades and run (while the comamnd openstack overcloud node provide --all-manageable is still excuted) a separate command : (assume node-1 was locked by the example) openstack overcloud node provide <node1> Then surprisingly the node is being changed to available but the still hanging command : openstack overcloud node provide --all-manageable Does not go through till it gets the Total timeout of the deployment. Important note: running each node separately is something we want to avoid as for big numbers of blades this is not rational to do also i suspect that One the command : openstack overcloud node provide --all-manageable , identifies a Lock , it is not releasing something internally and waits for Ever regards Interesting, I'll look into why provide --all-manageable hits lock issues when provide <node> does not Just updating this, I've proposed a potential fix upstream [1] which can be backported if it shows to help [1] https://review.opendev.org/#/c/745991/ I haven't seen this issue in a while now ... if I see it again, I'll update this BZ. Attached case is closed so we can't ask the customer to test it for us either. Attached is an upstream stable/queens fix which should help with this issue. OSP-17 will have a similar fix, but in ansible instead of mistral. This bug is to track the OSP-13 backport of bug #1848560 The fix for this is ready for rhpkg push when this bug gets its +1 flags Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5575 |