Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1369220

Summary:

openstack baremetal import takes a long time to return

Product:

Red Hat OpenStack

Reporter:

Sai Sindhur Malleni <smalleni>

Component:

python-tripleoclient

Assignee:

Derek Higgins <derekh>

Status:

CLOSED WORKSFORME

QA Contact:

Arik Chernetsky <achernet>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

10.0 (Newton)

CC:

derekh, dsneddon, dtantsur, dwilson, hbrock, jschluet, jslagle, kbasil, mburns, mcornea, racedoro, rhel-osp-director-maint, sclewis, smalleni, srevivo, tim.darnell

Target Milestone:

---

Keywords:

Reopened, ZStream

Target Release:

10.0 (Newton)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-01-09 14:57:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Logs on undercloud	none

Description Sai Sindhur Malleni 2016-08-22 17:09:26 UTC

Description of problem:
openstack baremetal import takes a very long time to return(sometimes had to kill it after > 20mins for a small node count of 9). Although the nodes go from enroll to manageable as can be seen from an ironic node-list, the baremetal import command doesn't return. Sometimes, it returns but after a very long time. In previous releases the baremetal import command hardly took any time.

Version-Release number of selected component (if applicable):
RHOP 10
Puddle from 2016-08-18.1

How reproducible:
50 % of the times
Steps to Reproduce:
1. Build undercloud
2. Make instackenv.json
3. Use openstack baremetal import to import node data

Actual results:
Command takes very long time to return even for small node count

Expected results:
Command should return without much delay

Additional info:

Comment 2 Lucas Alvares Gomes 2016-08-29 11:24:38 UTC

Re-assigning to python-tripleoclient since the import command from the "baremetal" namespace lives there.

Comment 3 Derek Higgins 2016-09-19 11:43:09 UTC

(In reply to Sindhur from comment #0)
> Actual results:
> Command takes very long time to return even for small node count
Have you had any errors in the ironic logs while the import was being done?
I've tried importing 9 nodes several times and it consistently takes about 45 seconds.

Using 
python-ironicclient-1.7.0-0.20160902094012.464044f.el7ost.noarch
python-tripleoclient-5.0.0-0.20160907170033.b0d7ce7.el7ost.noarch

Comment 4 Dave Wilson 2016-09-26 18:52:16 UTC

Just to add to to import timings. It takes ~200 seconds consistently to import 38 nodes in my opsd10 env. I've not seen errors in the logs and --debug doesn't reveal anything of importance

Comment 5 Dave Wilson 2016-09-27 15:44:17 UTC

one other data point I mistakenly omitted from "commment 4". The same node count (38) on my opsd9 env takes 6 seconds

Comment 6 Derek Higgins 2016-09-27 16:09:53 UTC

@dwilson the original bug was about registration sometimes taking 20+ minutes(and sometimes not returning at all). You're comment appears to be about a general slowdown when comparing ospd10 too 9, can you create a separate bug about this and I'll investigate it.

Comment 7 Dave Wilson 2016-09-27 16:30:41 UTC

I'm seeing registration times hang and not return too on some iterations with 10. The original bz by Sindhur refers to a "slowdown" (compared to his 9).  He had inconsistent results as sometimes his ingestion would take 10 seconds. It's the same issue's as stated above, just adding supporting data points.

Comment 8 Derek Higgins 2016-09-29 13:30:47 UTC

The changes in registration time are at least explainable,

In OSPd 9 the baremetal import command goes through the following sequence
o Get token from keystone
o Get a list of currently registered nodes from ironic
o For each node (assuming they hadn't already been registered)
  - POST node details to ironic
  - POST port details to ironic
  - GET node validate details to verify they contained power management credentials
  - POST command for ironic to power off the node
o Get a list of currently registered nodes from ironic

This all happens fairly quickly as no additional services are involved and there arn't any waits for asynchronous tasks to happen


In OSPd 10 things have changed
o Get token from keystone
o POST a "register" workflow to mistral (tripleo.baremetal.v1.register_or_update)
o open a websocket to wait for the mistral workflow to finish
 - In mistral the following happens
  - for each node register it with ironic (essentially the same the the entire OSPd 9 process)
  - for each node
   - set the node provision state too manageable
   - wait for ironic to actually set the node state (1 sec sleep between checks)
   - set the node provision state too available
   - wait for ironic to actually set the node state (1 sec sleep between checks)
   - set the node provision state too manageable
o POST a "provide" workflow to mistral (tripleo.baremetal.v1.provide)
o open a websocket to wait for the mistral workflow to finish
 - In mistral the following happens
  - for each node
   - set the node provision state too available

It looks like we are duplicating the logic to set the nodes to manageable and then too available. I think there is a bug here, I'll take a look into it.

Once the duplication is removed, I expect the overall registration time to reduce, but it wont be on a par to OSPd 9 as, the state changing logic didn't occur in OSPd 9 and mistral overhead didn't exist.

Comment 9 Derek Higgins 2016-10-05 09:59:49 UTC

Fix for the double registration attached, this should speed up registration somewhat. But it wont be down to OSPd 9 levels because as I said more now happens during registration and there is mistral overhead.

Comment 10 Dan Sneddon 2016-10-14 16:54:13 UTC

I think this should be fixed with the upstream patch at https://review.openstack.org/#/c/379482/. Changed status to MODIFIED. Derek, is that the right status?

Comment 12 Derek Higgins 2016-10-19 08:14:06 UTC

(In reply to Dan Sneddon from comment #10)
> I think this should be fixed with the upstream patch at
> https://review.openstack.org/#/c/379482/. Changed status to MODIFIED. Derek,
> is that the right status?

The upstream patch should make the situation better but this patch started out being about a registration process that never completed, I've never been able to reproduce this. maybe we can go ahead an change the status as the double registration is fixed. And create a new bug if the original reporter is still seeing the problem.

Comment 13 Dan Sneddon 2016-10-19 17:41:26 UTC

Based on comment #12, I believe this situation should be improved. If the original reporter is still seeing the problem, we will open a new bug targeted at the current development cycle.

Comment 14 Sai Sindhur Malleni 2016-11-07 15:39:27 UTC

So, seeing this again. On firing the baremetal import command, it never returns even after waiting for a long time, however ironic node-list shows them as imported and set to manageable. What are the logs/other info needed. Please advise.

Comment 15 Derek Higgins 2016-11-07 15:53:17 UTC

(In reply to Sai Sindhur Malleni from comment #14)
> So, seeing this again. On firing the baremetal import command, it never
> returns even after waiting for a long time, however ironic node-list shows
> them as imported and set to manageable. What are the logs/other info needed.
> Please advise.

If you could attach your ironic and mistral logs it would be great

Comment 16 Sai Sindhur Malleni 2016-11-07 16:15:09 UTC

Created attachment 1218107 [details]
Logs on undercloud

Comment 17 Dmitry Tantsur 2016-11-08 11:03:57 UTC

May be related to https://bugzilla.redhat.com/show_bug.cgi?id=1383627

Comment 18 Dmitry Tantsur 2016-11-18 11:56:15 UTC

There were several improvements in the recent puddles. Could you please check how long it takes now?

Comment 19 Derek Higgins 2016-11-18 13:04:02 UTC

Also I'm seeing lots of ipmi errors in your ironic-conductor logs
    Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n'
Is it possible there are hardware / network or configuration problem?

Comment 20 Scott Lewis 2016-11-21 17:16:37 UTC

(In reply to Dmitry Tantsur from comment #18)
> There were several improvements in the recent puddles. Could you please
> check how long it takes now?

dmitry,
are you suggesting removing this from OSP-10 GA, and releasing afterwards? I see you've changed the release flag to 10.0.z.

Comment 21 Dmitry Tantsur 2016-11-21 17:19:11 UTC

Scott, sorry I don't quite get the question. We're not sure about the root cause of the bug, and we're pretty sure we won't be able to sort it out for the GA. However, I suspect we might have fixed it as part of several other bugs, hence my question to Sai. Hope it helps.

Comment 22 Scott Lewis 2016-11-21 17:27:29 UTC

(In reply to Dmitry Tantsur from comment #21)
> Scott, sorry I don't quite get the question. We're not sure about the root
> cause of the bug, and we're pretty sure we won't be able to sort it out for
> the GA. However, I suspect we might have fixed it as part of several other
> bugs, hence my question to Sai. Hope it helps.

Hi Dmitry,
I'm wondering if we should drop this bug from the OSP-10 GA advisory (i.e. not ship it), since it'll either be fixed with other bugs, or after the GA.

Comment 23 Dmitry Tantsur 2016-11-21 17:28:26 UTC

Yes, I think we should.

Comment 28 Derek Higgins 2017-01-09 14:57:38 UTC

Closing this again, unable to reproduce and as per Comment 19 suspect IPMI issues.