Hide Forgot
Description of problem: openstack baremetal import takes a very long time to return(sometimes had to kill it after > 20mins for a small node count of 9). Although the nodes go from enroll to manageable as can be seen from an ironic node-list, the baremetal import command doesn't return. Sometimes, it returns but after a very long time. In previous releases the baremetal import command hardly took any time. Version-Release number of selected component (if applicable): RHOP 10 Puddle from 2016-08-18.1 How reproducible: 50 % of the times Steps to Reproduce: 1. Build undercloud 2. Make instackenv.json 3. Use openstack baremetal import to import node data Actual results: Command takes very long time to return even for small node count Expected results: Command should return without much delay Additional info:
Re-assigning to python-tripleoclient since the import command from the "baremetal" namespace lives there.
(In reply to Sindhur from comment #0) > Actual results: > Command takes very long time to return even for small node count Have you had any errors in the ironic logs while the import was being done? I've tried importing 9 nodes several times and it consistently takes about 45 seconds. Using python-ironicclient-1.7.0-0.20160902094012.464044f.el7ost.noarch python-tripleoclient-5.0.0-0.20160907170033.b0d7ce7.el7ost.noarch
Just to add to to import timings. It takes ~200 seconds consistently to import 38 nodes in my opsd10 env. I've not seen errors in the logs and --debug doesn't reveal anything of importance
one other data point I mistakenly omitted from "commment 4". The same node count (38) on my opsd9 env takes 6 seconds
@dwilson the original bug was about registration sometimes taking 20+ minutes(and sometimes not returning at all). You're comment appears to be about a general slowdown when comparing ospd10 too 9, can you create a separate bug about this and I'll investigate it.
I'm seeing registration times hang and not return too on some iterations with 10. The original bz by Sindhur refers to a "slowdown" (compared to his 9). He had inconsistent results as sometimes his ingestion would take 10 seconds. It's the same issue's as stated above, just adding supporting data points.
The changes in registration time are at least explainable, In OSPd 9 the baremetal import command goes through the following sequence o Get token from keystone o Get a list of currently registered nodes from ironic o For each node (assuming they hadn't already been registered) - POST node details to ironic - POST port details to ironic - GET node validate details to verify they contained power management credentials - POST command for ironic to power off the node o Get a list of currently registered nodes from ironic This all happens fairly quickly as no additional services are involved and there arn't any waits for asynchronous tasks to happen In OSPd 10 things have changed o Get token from keystone o POST a "register" workflow to mistral (tripleo.baremetal.v1.register_or_update) o open a websocket to wait for the mistral workflow to finish - In mistral the following happens - for each node register it with ironic (essentially the same the the entire OSPd 9 process) - for each node - set the node provision state too manageable - wait for ironic to actually set the node state (1 sec sleep between checks) - set the node provision state too available - wait for ironic to actually set the node state (1 sec sleep between checks) - set the node provision state too manageable o POST a "provide" workflow to mistral (tripleo.baremetal.v1.provide) o open a websocket to wait for the mistral workflow to finish - In mistral the following happens - for each node - set the node provision state too available It looks like we are duplicating the logic to set the nodes to manageable and then too available. I think there is a bug here, I'll take a look into it. Once the duplication is removed, I expect the overall registration time to reduce, but it wont be on a par to OSPd 9 as, the state changing logic didn't occur in OSPd 9 and mistral overhead didn't exist.
Fix for the double registration attached, this should speed up registration somewhat. But it wont be down to OSPd 9 levels because as I said more now happens during registration and there is mistral overhead.
I think this should be fixed with the upstream patch at https://review.openstack.org/#/c/379482/. Changed status to MODIFIED. Derek, is that the right status?
(In reply to Dan Sneddon from comment #10) > I think this should be fixed with the upstream patch at > https://review.openstack.org/#/c/379482/. Changed status to MODIFIED. Derek, > is that the right status? The upstream patch should make the situation better but this patch started out being about a registration process that never completed, I've never been able to reproduce this. maybe we can go ahead an change the status as the double registration is fixed. And create a new bug if the original reporter is still seeing the problem.
Based on comment #12, I believe this situation should be improved. If the original reporter is still seeing the problem, we will open a new bug targeted at the current development cycle.
So, seeing this again. On firing the baremetal import command, it never returns even after waiting for a long time, however ironic node-list shows them as imported and set to manageable. What are the logs/other info needed. Please advise.
(In reply to Sai Sindhur Malleni from comment #14) > So, seeing this again. On firing the baremetal import command, it never > returns even after waiting for a long time, however ironic node-list shows > them as imported and set to manageable. What are the logs/other info needed. > Please advise. If you could attach your ironic and mistral logs it would be great
Created attachment 1218107 [details] Logs on undercloud
May be related to https://bugzilla.redhat.com/show_bug.cgi?id=1383627
There were several improvements in the recent puddles. Could you please check how long it takes now?
Also I'm seeing lots of ipmi errors in your ironic-conductor logs Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n' Is it possible there are hardware / network or configuration problem?
(In reply to Dmitry Tantsur from comment #18) > There were several improvements in the recent puddles. Could you please > check how long it takes now? dmitry, are you suggesting removing this from OSP-10 GA, and releasing afterwards? I see you've changed the release flag to 10.0.z.
Scott, sorry I don't quite get the question. We're not sure about the root cause of the bug, and we're pretty sure we won't be able to sort it out for the GA. However, I suspect we might have fixed it as part of several other bugs, hence my question to Sai. Hope it helps.
(In reply to Dmitry Tantsur from comment #21) > Scott, sorry I don't quite get the question. We're not sure about the root > cause of the bug, and we're pretty sure we won't be able to sort it out for > the GA. However, I suspect we might have fixed it as part of several other > bugs, hence my question to Sai. Hope it helps. Hi Dmitry, I'm wondering if we should drop this bug from the OSP-10 GA advisory (i.e. not ship it), since it'll either be fixed with other bugs, or after the GA.
Yes, I think we should.
Closing this again, unable to reproduce and as per Comment 19 suspect IPMI issues.