Bug 1212134

Summary: 'instack-ironic-deployment --discover-nodes' is failing with 'node locked by host' error
Product: Red Hat OpenStack Reporter: Ronelle Landy <rlandy>
Component: python-ironicclientAssignee: Dmitry Tantsur <dtantsur>
Status: CLOSED ERRATA QA Contact: Toure Dunnon <tdunnon>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: apevec, dsneddon, dtantsur, jliberma, kobi.ginon, lhh, mlopes, mtanino, tsekiyam, ukalifon, whayutin
Target Milestone: gaKeywords: Automation, Triaged
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, certain operations in OpenStack Bare Metal Provisioning (Ironic) would fail to run while the node was in a `locked` state. This update implements a `retry` function in the Ironic client. As a result, certain operations take longer to run, but do not fail due to `node locked` errors.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-05 13:22:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ronelle Landy 2015-04-15 15:42:11 UTC
Description of problem:

On baremetal systems, when running 'instack-ironic-deployment --discover-nodes' 'Polling discoverd for discovery results' fails at random points (sometimes after the first node, sometimes after the third or fifth node) with the error: 

'Node xxx is locked by host y, please retry after the current operation is completed. (HTTP 409)'

    [stack@host15 ~]$ instack-ironic-deployment --discover-nodes
    /usr/lib/python2.7/site-packages/keystoneclient/shell.py:65: DeprecationWarning: The keystone CLI is deprecated in favor of python-openstackclient. For a Python library, continue using python-keystoneclient.
      'python-keystoneclient.', DeprecationWarning)
    Preparing for deployment...
      Discovering nodes.
        Sending node ID eca6c28d-c853-4083-8bfd-771d3225d1da to discoverd for discovery ... DONE.
        Sending node ID 0d28569b-2c81-4418-bcfb-08ee5fabf550 to discoverd for discovery ... DONE.
        Sending node ID 827d87dc-5d05-4301-9cc6-38a2e1025204 to discoverd for discovery ... DONE.
        Sending node ID 5b409224-5a27-4f51-bc67-dbd4a13a4b67 to discoverd for discovery ... DONE.
        Sending node ID 9084b41c-3a18-4371-b37b-465d7315839f to discoverd for discovery ... DONE.
        Sending node ID 93e3a5cf-aaf5-460c-bf85-40cac5810129 to discoverd for discovery ... DONE.
       Polling discoverd for discovery results ...
           Result for node eca6c28d-c853-4083-8bfd-771d3225d1da is ... DISCOVERED.
           Result for node 0d28569b-2c81-4418-bcfb-08ee5fabf550 is ... DISCOVERED.
           Result for node 827d87dc-5d05-4301-9cc6-38a2e1025204 is ... DISCOVERED.
    Node 827d87dc-5d05-4301-9cc6-38a2e1025204 is locked by host host15.x.x, please retry after the current operation is completed. (HTTP 409)

Increasing the timeout on https://github.com/rdo-management/instack-undercloud/blob/master/scripts/instack-ironic-deployment#L160 from 15 to 150 allows all 6 nodes to be discovered without hitting the locked error.

Version-Release number of selected component (if applicable):

[stack@host15 ~]$ rpm -qa | grep ironic
openstack-ironic-conductor-2015.1-dev548.g60ceade.el7.centos.noarch
openstack-ironic-common-2015.1-dev548.g60ceade.el7.centos.noarch
openstack-ironic-api-2015.1-dev548.g60ceade.el7.centos.noarch
python-ironic-discoverd-1.1.0-0.99.20150413.1600git.el7.centos.noarch
openstack-ironic-discoverd-1.1.0-0.99.20150413.1600git.el7.centos.noarch
python-ironicclient-0.4.1.25-g3b171c5.el7.centos.noarch

How reproducible:

Always - although the node on which the fail happens varies.

Steps to Reproduce:
1. Follow the doc to install rdo-manager on baremetal 
2. https://repos.fedorapeople.org/repos/openstack-m/instack-undercloud/html/deploy-overcloud.html at step instack-ironic-deployment --discover-nodes, notice this will fail

Actual results:
Polling discoverd for discovery results  fails

Expected results:
All nodes should be discovered and reported on w/o error

Additional info:

Comment 4 Dmitry Tantsur 2015-04-15 16:06:59 UTC
I believe the proper fix is for ironicclient to retry on 409, so I'll work upstream on it. Then this whole class of problems will be gone.

Comment 5 Dmitry Tantsur 2015-04-16 14:40:23 UTC
Upstream patch for retrying: https://review.openstack.org/#/c/174359

Comment 6 Dmitry Tantsur 2015-04-21 10:58:50 UTC
Upstream patch landed in master, pending stable/kilo: https://review.openstack.org/#/c/175301/

ironicclient in delorean rebased on the latter patch, so it should be fixed now.

Comment 11 Ronelle Landy 2015-06-22 20:56:55 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1233452 ...

shows similar errors during overcloud deploy

Comment 12 Dmitry Tantsur 2015-06-23 07:19:28 UTC
So I wonder if we need more longer retry time... Anyway, let's continue in that report, they're slightly different.

Comment 13 Udi Kalifon 2015-07-09 08:25:14 UTC
(In reply to Dmitry Tantsur from comment #4)
> I believe the proper fix is for ironicclient to retry on 409, so I'll work
> upstream on it. Then this whole class of problems will be gone.

The retry is not always the solution. See bug https://bugzilla.redhat.com/show_bug.cgi?id=1241424

Comment 14 Dmitry Tantsur 2015-07-09 08:29:40 UTC
Yeah, but that's another bug

Comment 17 errata-xmlrpc 2015-08-05 13:22:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548

Comment 18 kobi ginon 2017-01-01 19:45:14 UTC
So
Dmitry Tantsur 
what is the solution for this issue at the end of the day.
We are suffering from this issue,
In most cases the introspection pass,
but i can see lately after we upgraded to linux 7.3 many more cases of failures.

What i identify is that when 1 bare metal finishes introspection
and the the second one is more or less in the same stage.

I'm starting to see the warning Message, the second bare metal
starts a shutdown but seems like it is delaying during the shutdown
although introspection Finishes without waiting for the second baremetal to shutdown completely.

The user starts the overcloud deployment since he got back the prompt and he is familiar with those 'harmless warning' but deployment will fail.

So is there a solution for the reported issue?
is my description gives more light to this issue or is it a completely differen case ?

regards

Comment 19 Dmitry Tantsur 2017-01-02 09:34:32 UTC
Hello!

It's hard to judge at first glance, but I suspect your issue might be a slightly different one. Which version of OSP are you using? Could you please paste the output of 'ironic node-list' after the (failed) introspection?

Comment 20 kobi ginon 2017-01-02 13:39:07 UTC
(In reply to Dmitry Tantsur from comment #19)
> Hello!
> 
> It's hard to judge at first glance, but I suspect your issue might be a
> slightly different one. Which version of OSP are you using? Could you please
> paste the output of 'ironic node-list' after the (failed) introspection?

Hi Dmitry - thanks for the prompt reply 
i'm using ospd8
python-ironic-inspector-client-1.2.0-6.el7ost.noarch
openstack-ironic-inspector-2.2.6-1.el7ost.noarch
openstack-ironic-conductor-4.2.5-3.el7ost.noarch
openstack-ironic-common-4.2.5-3.el7ost.noarch
python-ironicclient-0.8.1-1.el7ost.noarch
openstack-ironic-api-4.2.5-3.el7ost.noarch


Below is the output on screen - i have 1 controller and 1 compute in this test
You will notice that only one of them changed to state to available.
and then a warning when you try to start deployment.
As i mentioned it i could see on screen that the second one started shutdown
But did not managed to finish it while the introspection Finished and did not wait for it to finish.

19:53:37 Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd power state is in transition. Waiting up to 120 seconds for it to complete.
19:53:47 performing introspection.
20:00:34 Request returned failure status.
20:00:34 Error contacting Ironic server: Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 1 of 6
20:00:38 Request returned failure status.
20:00:38 Error contacting Ironic server: Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 2 of 6
20:00:42 Request returned failure status.
20:00:42 Error contacting Ironic server: Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 3 of 6
20:00:46 Request returned failure status.
20:00:46 Error contacting Ironic server: Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 4 of 6
20:00:50 Request returned failure status.
20:00:50 Error contacting Ironic server: Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 5 of 6
20:00:54 Request returned failure status.
20:00:54 Error contacting Ironic server: Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 6 of 6
20:00:54 Node 49ebe10b-e8c4-4cf2-835a-d8147181d6fd is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)
20:00:54 Setting nodes for introspection to manageable...
20:00:54 Starting introspection of node: 301e92a5-002d-4f3e-a614-af67f4a5dc4c
20:00:54 Starting introspection of node: 49ebe10b-e8c4-4cf2-835a-d8147181d6fd
20:00:54 Waiting for introspection to finish...
20:00:54 Introspection for UUID 301e92a5-002d-4f3e-a614-af67f4a5dc4c finished successfully.
20:00:54 Introspection for UUID 49ebe10b-e8c4-4cf2-835a-d8147181d6fd finished successfully.
20:00:54 Setting manageable nodes to available...
20:00:54 Node 301e92a5-002d-4f3e-a614-af67f4a5dc4c has been set to available.
20:00:57 Ironic Node introspection succeeded
20:00:57 performing overcloud deployment.
20:01:10 Error: only 0 of 1 requested ironic nodes are tagged to profile compute (for flavor compute)
20:01:10 Recommendation: tag more nodes using ironic node-update <NODE ID> replace properties/capabilities=profile:compute,boot_option:local
20:01:10 Configuration has 1 errors, fix them before proceeding. Ignoring these errors is likely to lead to a failed deploy.

Comment 21 kobi ginon 2017-01-03 08:50:51 UTC
Hi Dmitry
Please read comment 20 first
But i would like to add that i found the code that you suggested 
https://review.openstack.org/#/c/175301/
Already existing in my ospd8 version

i m just not sure if i want to enlarge the number of retries
DEFAULT_MAX_RETRIES
how do i do it from ironic.conf file ? which filed
or is there another way to do it ?


regards