Bug 1241424

Summary:	Can't delete bare metal nodes that are stuck and unresponsive, or put them in maintenance
Product:	Red Hat OpenStack	Reporter:	Udi Kalifon <ukalifon>
Component:	openstack-ironic	Assignee:	Lucas Alvares Gomes <lmartins>
Status:	CLOSED ERRATA	QA Contact:	Toure Dunnon <tdunnon>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	Director	CC:	david.costakos, dmacpher, dyocum, hbrock, jcoufal, jslagle, lmartins, mburns, nbarcet, ohochman, rhel-osp-director-maint, sclewis, srevivo, tcarlin, ukalifon
Target Milestone:	z5	Keywords:	Triaged, ZStream
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-ironic-2015.1.2-3.el7ost	Doc Type:	Known Issue
Doc Text:	Sometimes bare metal nodes can lock into a certain state if ironic-conductor stops abruptly. This means users cannot delete these nodes or change their state. As a workaround, log into the director's database and use the following query to set the node back to "available" state and remove the lock: UPDATE nodes SET provision_state="available", target_provision_state=NULL, reservation=NULL WHERE uuid=<node uuid>;	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-06-15 18:04:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Udi Kalifon 2015-07-09 08:22:37 UTC

Description of problem:
I registered bare metal nodes and got no error. I tried to configure boot for them - and got an error about 2 of the nodes that they are "locked" by the server. It turns out that these 2 nodes are really unresponsive and you can't even log into their management interface, so for the while there was nothing I could do except delete the nodes and proceed with the rest... However you can't delete them either when they are unresponsive:

ERROR: openstack Node aee1f80b-5c52-418e-a092-e34572fa88ba is locked by host ****.redhat.com, please retry after the current operation is completed.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 1282, in destroy_node
    purpose='node deletion') as task:

  File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 147, in acquire
    driver_name=driver_name, purpose=purpose)

  File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 229, in __init__
    self.release_resources()

  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
    six.reraise(self.type_, self.value, self.tb)

  File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 212, in __init__
    reserve_node()

  File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)

  File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
    raise attempt.get()

  File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
    six.reraise(self.value[0], self.value[1], self.value[2])

  File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

  File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 199, in reserve_node
    self.node = objects.Node.reserve(context, CONF.host, node_id)

  File "/usr/lib/python2.7/site-packages/ironic/objects/base.py", line 109, in wrapper
    result = fn(cls, context, *args, **kwargs)

  File "/usr/lib/python2.7/site-packages/ironic/objects/node.py", line 193, in reserve
    db_node = cls.dbapi.reserve_node(tag, node_id)

  File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 226, in reserve_node
    host=node['reservation'])

NodeLocked: Node aee1f80b-5c52-418e-a092-e34572fa88ba is locked by host ****.redhat.com, please retry after the current operation is completed.
 (HTTP 409)


In this situation, I am completely stuck and can't install on any of the nodes. I have to reprovision the setup and start from scratch.


Version-Release number of selected component (if applicable):
python-rdomanager-oscplugin-0.0.8-18.el7ost.noarch


How reproducible:
When the nodes are really stuck


Steps to Reproduce:
1. Register nodes. I don't have a reproducible way to make some of them stuck.
2. Try to delete the failed nodes: openstack baremetal delete aee1f80b-...
3. Also try: ironic node-set-maintenance aee1f80b-... on


Actual results:
The delete operation is retried 61 times and fails over and over. The operation ironic node-set-maintenance hangs for a long time and then also generates a similar exception.


Expected results:
Allow deletion of nodes regardless of the hardware's health. In real life, sometimes nodes break down, and we can't allow the rest of the nodes (there could be thousands) to be stuck because of a single failure.


Additional info:
I see no errors or hints when running: sudo journalctl -u openstack-ironic-api -u openstack-ironic-conductor

Comment 3 Lucas Alvares Gomes 2015-07-09 12:51:02 UTC

I will need some more information about the states of the node to proceed. Can you please tell me the node-show output of both nodes?

$ ironic node-show <uuid>

And attach the ironic-conductor log as well if possible.

Comment 4 Mike Burns 2015-07-09 14:06:08 UTC

Udi, see comment 3

Comment 5 chris alfonso 2015-07-13 17:34:27 UTC

Lucas,

Please provide the workaround necessary via updating the record manually in mysql, and make sure the doc text exists for this. Removing from the blocker list.

Comment 6 Lucas Alvares Gomes 2015-07-14 11:41:41 UTC

Hi @chris,

Right, yeah this is a workaround that we really want to avoid and I've been looking at states where we can get stuck and trying to fix then to avoid this. So please use it only as a last resort.

In case you're really stuck please put the node back to "available" state, by modifying the database as:

UPDATE nodes SET provision_state="available", target_provision_state=NULL WHERE uuid=<uuid>;

For example:

[stack@localhost devstack]$ sudo mysql -u root -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 91
Server version: 10.0.20-MariaDB MariaDB Server

Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use ironic;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [ironic]> UPDATE nodes SET provision_state="available", target_provision_state=NULL WHERE uuid="b76e1671-7a4c-4066-be7a-dc4e97c8dddd";
Query OK, 0 rows affected (0.06 sec)
Rows matched: 1  Changed: 0  Warnings: 0

MariaDB [ironic]> exit
Bye

Comment 7 Lucas Alvares Gomes 2015-07-14 11:50:18 UTC

(In reply to Lucas Alvares Gomes from comment #6)
> Hi @chris,
> 
> Right, yeah this is a workaround that we really want to avoid and I've been
> looking at states where we can get stuck and trying to fix then to avoid
> this. So please use it only as a last resort.
> 
> In case you're really stuck please put the node back to "available" state,
> by modifying the database as:
> 
> UPDATE nodes SET provision_state="available", target_provision_state=NULL
> WHERE uuid=<uuid>;
> 

Actually, it would be good to clean up the "reservation" field as well in case the node is also locked by a specific conductor:

UPDATE nodes SET provision_state="available", target_provision_state=NULL, reservation=NULL WHERE uuid=<uuid>;

Comment 13 Udi Kalifon 2015-08-30 07:49:37 UTC

@Lucas,

We start with the nodes unregistered and turned off. We register them with the command "openstack baremetal import --json instackenv.json", and then we see them in power off state, provision state "available" and maintenance mode "off".

I'm assuming that the states we see are just the default ones you always get when you register new nodes, and then ironic works in the background to connect to the IPMI interfaces and update the real states of the machines. If the interfaces on some of the nodes is really down, the nodes will be "locked" when you try to configure boot for them... 

But of course, that's just how I see it, and I might be completely wrong because I don't really know how the code works.

Comment 14 Lucas Alvares Gomes 2015-08-31 15:41:18 UTC

Hi @Udi,

Well kinda, the "available" provison state and maintenance "False" are default. But the power state when you register a node is actually None, so in the background (as a periodic task) Ironic will check the power state of the node every X seconds (defaults to 60 seconds) to see if what it has in the database is the actual state of the node [1].

It could be that the operation somehow got stuck for a long time since the version we use in ospd we acquire an exclusive lock from the beginning of this operation. Upstream @Dmitry worked to minimize the usage of exclusive locks for this problem [2] but this haven't been backported yet.

A workaround around this lock problem right now would be to restart the ironic-conductor and it will free up all the locks that specific conductor was holding. So you don't have to change the database or anything.

[1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L2139-L2165

[2] https://review.openstack.org/#/c/202562/

Comment 15 Lucas Alvares Gomes 2015-08-31 15:42:39 UTC

Sorry forgot to say that. (In reply to Lucas Alvares Gomes from comment #14)
> Hi @Udi,
> 
> Well kinda, the "available" provison state and maintenance "False" are
> default. But the power state when you register a node is actually None, so
> in the background (as a periodic task) Ironic will check the power state of
> the node every X seconds (defaults to 60 seconds) to see if what it has in
> the database is the actual state of the node [1].
> 

Sorry forgot to mention that if the state can be sync'ed Ironic will put the node in maintenance mode to alert the operator that it can't manage it [1].

[1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L2139-L2165

Comment 16 Udi Kalifon 2015-09-17 06:07:00 UTC

Patch merged upstream and should be fixed for y2.

Comment 17 Jaromir Coufal 2015-10-20 11:43:56 UTC

Based on Udi's comment, the bug status is not up to date. Can anybody please fix that?

Comment 18 Mike Burns 2015-11-06 13:59:44 UTC

Patch is posted and merged upstream, but not yet backported.  It was not part of the 2015.1.2 release, so it needs a manual backport.

Comment 24 errata-xmlrpc 2016-06-15 18:04:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1234