Bug 1369220 - openstack baremetal import takes a long time to return
Summary: openstack baremetal import takes a long time to return
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 10.0 (Newton)
Assignee: Derek Higgins
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-22 17:09 UTC by Sai Sindhur Malleni
Modified: 2017-01-09 14:57 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-09 14:57:38 UTC
Target Upstream Version:


Attachments (Terms of Use)
Logs on undercloud (9.10 MB, application/x-gzip)
2016-11-07 16:15 UTC, Sai Sindhur Malleni
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 379482 0 None None None 2016-10-05 09:57:52 UTC

Description Sai Sindhur Malleni 2016-08-22 17:09:26 UTC
Description of problem:
openstack baremetal import takes a very long time to return(sometimes had to kill it after > 20mins for a small node count of 9). Although the nodes go from enroll to manageable as can be seen from an ironic node-list, the baremetal import command doesn't return. Sometimes, it returns but after a very long time. In previous releases the baremetal import command hardly took any time.

Version-Release number of selected component (if applicable):
RHOP 10
Puddle from 2016-08-18.1

How reproducible:
50 % of the times
Steps to Reproduce:
1. Build undercloud
2. Make instackenv.json
3. Use openstack baremetal import to import node data

Actual results:
Command takes very long time to return even for small node count

Expected results:
Command should return without much delay

Additional info:

Comment 2 Lucas Alvares Gomes 2016-08-29 11:24:38 UTC
Re-assigning to python-tripleoclient since the import command from the "baremetal" namespace lives there.

Comment 3 Derek Higgins 2016-09-19 11:43:09 UTC
(In reply to Sindhur from comment #0)
> Actual results:
> Command takes very long time to return even for small node count
Have you had any errors in the ironic logs while the import was being done?
I've tried importing 9 nodes several times and it consistently takes about 45 seconds.

Using 
python-ironicclient-1.7.0-0.20160902094012.464044f.el7ost.noarch
python-tripleoclient-5.0.0-0.20160907170033.b0d7ce7.el7ost.noarch

Comment 4 Dave Wilson 2016-09-26 18:52:16 UTC
Just to add to to import timings. It takes ~200 seconds consistently to import 38 nodes in my opsd10 env. I've not seen errors in the logs and --debug doesn't reveal anything of importance

Comment 5 Dave Wilson 2016-09-27 15:44:17 UTC
one other data point I mistakenly omitted from "commment 4". The same node count (38) on my opsd9 env takes 6 seconds

Comment 6 Derek Higgins 2016-09-27 16:09:53 UTC
@dwilson the original bug was about registration sometimes taking 20+ minutes(and sometimes not returning at all). You're comment appears to be about a general slowdown when comparing ospd10 too 9, can you create a separate bug about this and I'll investigate it.

Comment 7 Dave Wilson 2016-09-27 16:30:41 UTC
I'm seeing registration times hang and not return too on some iterations with 10. The original bz by Sindhur refers to a "slowdown" (compared to his 9).  He had inconsistent results as sometimes his ingestion would take 10 seconds. It's the same issue's as stated above, just adding supporting data points.

Comment 8 Derek Higgins 2016-09-29 13:30:47 UTC
The changes in registration time are at least explainable,

In OSPd 9 the baremetal import command goes through the following sequence
o Get token from keystone
o Get a list of currently registered nodes from ironic
o For each node (assuming they hadn't already been registered)
  - POST node details to ironic
  - POST port details to ironic
  - GET node validate details to verify they contained power management credentials
  - POST command for ironic to power off the node
o Get a list of currently registered nodes from ironic

This all happens fairly quickly as no additional services are involved and there arn't any waits for asynchronous tasks to happen


In OSPd 10 things have changed
o Get token from keystone
o POST a "register" workflow to mistral (tripleo.baremetal.v1.register_or_update)
o open a websocket to wait for the mistral workflow to finish
 - In mistral the following happens
  - for each node register it with ironic (essentially the same the the entire OSPd 9 process)
  - for each node
   - set the node provision state too manageable
   - wait for ironic to actually set the node state (1 sec sleep between checks)
   - set the node provision state too available
   - wait for ironic to actually set the node state (1 sec sleep between checks)
   - set the node provision state too manageable
o POST a "provide" workflow to mistral (tripleo.baremetal.v1.provide)
o open a websocket to wait for the mistral workflow to finish
 - In mistral the following happens
  - for each node
   - set the node provision state too available

It looks like we are duplicating the logic to set the nodes to manageable and then too available. I think there is a bug here, I'll take a look into it.

Once the duplication is removed, I expect the overall registration time to reduce, but it wont be on a par to OSPd 9 as, the state changing logic didn't occur in OSPd 9 and mistral overhead didn't exist.

Comment 9 Derek Higgins 2016-10-05 09:59:49 UTC
Fix for the double registration attached, this should speed up registration somewhat. But it wont be down to OSPd 9 levels because as I said more now happens during registration and there is mistral overhead.

Comment 10 Dan Sneddon 2016-10-14 16:54:13 UTC
I think this should be fixed with the upstream patch at https://review.openstack.org/#/c/379482/. Changed status to MODIFIED. Derek, is that the right status?

Comment 12 Derek Higgins 2016-10-19 08:14:06 UTC
(In reply to Dan Sneddon from comment #10)
> I think this should be fixed with the upstream patch at
> https://review.openstack.org/#/c/379482/. Changed status to MODIFIED. Derek,
> is that the right status?

The upstream patch should make the situation better but this patch started out being about a registration process that never completed, I've never been able to reproduce this. maybe we can go ahead an change the status as the double registration is fixed. And create a new bug if the original reporter is still seeing the problem.

Comment 13 Dan Sneddon 2016-10-19 17:41:26 UTC
Based on comment #12, I believe this situation should be improved. If the original reporter is still seeing the problem, we will open a new bug targeted at the current development cycle.

Comment 14 Sai Sindhur Malleni 2016-11-07 15:39:27 UTC
So, seeing this again. On firing the baremetal import command, it never returns even after waiting for a long time, however ironic node-list shows them as imported and set to manageable. What are the logs/other info needed. Please advise.

Comment 15 Derek Higgins 2016-11-07 15:53:17 UTC
(In reply to Sai Sindhur Malleni from comment #14)
> So, seeing this again. On firing the baremetal import command, it never
> returns even after waiting for a long time, however ironic node-list shows
> them as imported and set to manageable. What are the logs/other info needed.
> Please advise.

If you could attach your ironic and mistral logs it would be great

Comment 16 Sai Sindhur Malleni 2016-11-07 16:15:09 UTC
Created attachment 1218107 [details]
Logs on undercloud

Comment 17 Dmitry Tantsur 2016-11-08 11:03:57 UTC
May be related to https://bugzilla.redhat.com/show_bug.cgi?id=1383627

Comment 18 Dmitry Tantsur 2016-11-18 11:56:15 UTC
There were several improvements in the recent puddles. Could you please check how long it takes now?

Comment 19 Derek Higgins 2016-11-18 13:04:02 UTC
Also I'm seeing lots of ipmi errors in your ironic-conductor logs
    Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n'
Is it possible there are hardware / network or configuration problem?

Comment 20 Scott Lewis 2016-11-21 17:16:37 UTC
(In reply to Dmitry Tantsur from comment #18)
> There were several improvements in the recent puddles. Could you please
> check how long it takes now?

dmitry,
are you suggesting removing this from OSP-10 GA, and releasing afterwards? I see you've changed the release flag to 10.0.z.

Comment 21 Dmitry Tantsur 2016-11-21 17:19:11 UTC
Scott, sorry I don't quite get the question. We're not sure about the root cause of the bug, and we're pretty sure we won't be able to sort it out for the GA. However, I suspect we might have fixed it as part of several other bugs, hence my question to Sai. Hope it helps.

Comment 22 Scott Lewis 2016-11-21 17:27:29 UTC
(In reply to Dmitry Tantsur from comment #21)
> Scott, sorry I don't quite get the question. We're not sure about the root
> cause of the bug, and we're pretty sure we won't be able to sort it out for
> the GA. However, I suspect we might have fixed it as part of several other
> bugs, hence my question to Sai. Hope it helps.

Hi Dmitry,
I'm wondering if we should drop this bug from the OSP-10 GA advisory (i.e. not ship it), since it'll either be fixed with other bugs, or after the GA.

Comment 23 Dmitry Tantsur 2016-11-21 17:28:26 UTC
Yes, I think we should.

Comment 28 Derek Higgins 2017-01-09 14:57:38 UTC
Closing this again, unable to reproduce and as per Comment 19 suspect IPMI issues.


Note You need to log in before you can comment on or make changes to this bug.