Description of problem: I registered bare metal nodes and got no error. I tried to configure boot for them - and got an error about 2 of the nodes that they are "locked" by the server. It turns out that these 2 nodes are really unresponsive and you can't even log into their management interface, so for the while there was nothing I could do except delete the nodes and proceed with the rest... However you can't delete them either when they are unresponsive: ERROR: openstack Node aee1f80b-5c52-418e-a092-e34572fa88ba is locked by host ****.redhat.com, please retry after the current operation is completed. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 1282, in destroy_node purpose='node deletion') as task: File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 147, in acquire driver_name=driver_name, purpose=purpose) File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 229, in __init__ self.release_resources() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__ six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 212, in __init__ reserve_node() File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f return Retrying(*dargs, **dkw).call(f, *args, **kw) File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call raise attempt.get() File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 199, in reserve_node self.node = objects.Node.reserve(context, CONF.host, node_id) File "/usr/lib/python2.7/site-packages/ironic/objects/base.py", line 109, in wrapper result = fn(cls, context, *args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/objects/node.py", line 193, in reserve db_node = cls.dbapi.reserve_node(tag, node_id) File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 226, in reserve_node host=node['reservation']) NodeLocked: Node aee1f80b-5c52-418e-a092-e34572fa88ba is locked by host ****.redhat.com, please retry after the current operation is completed. (HTTP 409) In this situation, I am completely stuck and can't install on any of the nodes. I have to reprovision the setup and start from scratch. Version-Release number of selected component (if applicable): python-rdomanager-oscplugin-0.0.8-18.el7ost.noarch How reproducible: When the nodes are really stuck Steps to Reproduce: 1. Register nodes. I don't have a reproducible way to make some of them stuck. 2. Try to delete the failed nodes: openstack baremetal delete aee1f80b-... 3. Also try: ironic node-set-maintenance aee1f80b-... on Actual results: The delete operation is retried 61 times and fails over and over. The operation ironic node-set-maintenance hangs for a long time and then also generates a similar exception. Expected results: Allow deletion of nodes regardless of the hardware's health. In real life, sometimes nodes break down, and we can't allow the rest of the nodes (there could be thousands) to be stuck because of a single failure. Additional info: I see no errors or hints when running: sudo journalctl -u openstack-ironic-api -u openstack-ironic-conductor
I will need some more information about the states of the node to proceed. Can you please tell me the node-show output of both nodes? $ ironic node-show <uuid> And attach the ironic-conductor log as well if possible.
Udi, see comment 3
Lucas, Please provide the workaround necessary via updating the record manually in mysql, and make sure the doc text exists for this. Removing from the blocker list.
Hi @chris, Right, yeah this is a workaround that we really want to avoid and I've been looking at states where we can get stuck and trying to fix then to avoid this. So please use it only as a last resort. In case you're really stuck please put the node back to "available" state, by modifying the database as: UPDATE nodes SET provision_state="available", target_provision_state=NULL WHERE uuid=<uuid>; For example: [stack@localhost devstack]$ sudo mysql -u root -p Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 91 Server version: 10.0.20-MariaDB MariaDB Server Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> use ironic; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [ironic]> UPDATE nodes SET provision_state="available", target_provision_state=NULL WHERE uuid="b76e1671-7a4c-4066-be7a-dc4e97c8dddd"; Query OK, 0 rows affected (0.06 sec) Rows matched: 1 Changed: 0 Warnings: 0 MariaDB [ironic]> exit Bye
(In reply to Lucas Alvares Gomes from comment #6) > Hi @chris, > > Right, yeah this is a workaround that we really want to avoid and I've been > looking at states where we can get stuck and trying to fix then to avoid > this. So please use it only as a last resort. > > In case you're really stuck please put the node back to "available" state, > by modifying the database as: > > UPDATE nodes SET provision_state="available", target_provision_state=NULL > WHERE uuid=<uuid>; > Actually, it would be good to clean up the "reservation" field as well in case the node is also locked by a specific conductor: UPDATE nodes SET provision_state="available", target_provision_state=NULL, reservation=NULL WHERE uuid=<uuid>;
@Lucas, We start with the nodes unregistered and turned off. We register them with the command "openstack baremetal import --json instackenv.json", and then we see them in power off state, provision state "available" and maintenance mode "off". I'm assuming that the states we see are just the default ones you always get when you register new nodes, and then ironic works in the background to connect to the IPMI interfaces and update the real states of the machines. If the interfaces on some of the nodes is really down, the nodes will be "locked" when you try to configure boot for them... But of course, that's just how I see it, and I might be completely wrong because I don't really know how the code works.
Hi @Udi, Well kinda, the "available" provison state and maintenance "False" are default. But the power state when you register a node is actually None, so in the background (as a periodic task) Ironic will check the power state of the node every X seconds (defaults to 60 seconds) to see if what it has in the database is the actual state of the node [1]. It could be that the operation somehow got stuck for a long time since the version we use in ospd we acquire an exclusive lock from the beginning of this operation. Upstream @Dmitry worked to minimize the usage of exclusive locks for this problem [2] but this haven't been backported yet. A workaround around this lock problem right now would be to restart the ironic-conductor and it will free up all the locks that specific conductor was holding. So you don't have to change the database or anything. [1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L2139-L2165 [2] https://review.openstack.org/#/c/202562/
Sorry forgot to say that. (In reply to Lucas Alvares Gomes from comment #14) > Hi @Udi, > > Well kinda, the "available" provison state and maintenance "False" are > default. But the power state when you register a node is actually None, so > in the background (as a periodic task) Ironic will check the power state of > the node every X seconds (defaults to 60 seconds) to see if what it has in > the database is the actual state of the node [1]. > Sorry forgot to mention that if the state can be sync'ed Ironic will put the node in maintenance mode to alert the operator that it can't manage it [1]. [1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L2139-L2165
Patch merged upstream and should be fixed for y2.
Based on Udi's comment, the bug status is not up to date. Can anybody please fix that?
Patch is posted and merged upstream, but not yet backported. It was not part of the 2015.1.2 release, so it needs a manual backport.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1234