Hide Forgot
Description of problem: listing of events on a stack with 350 nodes is too slow (50 seconds). Structure of stack is not too complex (one level of nested stack), templates definition is here: https://github.com/redhat-openstack/openshift-on-openstack Version-Release number of selected component (if applicable): python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch How reproducible: I understand it's hard to reproduce because of required HW resources, in this case openshift-on-openstack templates and 350 openshift compute nodes were deployed. I don't think this is specific to openshift-on-openstack templates so any big stack which includes nested stack will work as a reproducer. My guess would be that there is no reference of top-level stack in events of nested stack so Heat has to do many selects (per each nested stack) to get all events. Actual results: $ time openstack stack event list --nested-depth=2 test ... real 0m52.846s user 0m3.366s sys 0m2.363s Expected results: List of events is returned in few seconds. Related/similar to BZ https://bugzilla.redhat.com/show_bug.cgi?id=1396391#c0
According to https://review.openstack.org/#/c/326229/, this should have been optimized already. It seems we got a regression, I'll try to have a look.
I tested with master on tripleo (rather big stack), and it seems fine: $ time openstack stack event list --nested-depth 4 overcloud | wc -l 1120 real 0m3.671s user 0m1.316s sys 0m0.108s Not blazing fast, but good enough it seems. How many events do you have?
Some back-of-the-envelope calculations suggest it's probably a little over 9000 events just to create a stack like this (about 7700 resources, but the stack is scaled out in stages). And those are spread over 350+ stacks. However, as you pointed out, the code appears to be doing exactly 3 DB queries regardless of the number of stacks involved. That suggests that if there's a problem it's likely to be in DB optimisation rather than application optimisation. I wonder if we're missing an index we should have?
Created attachment 1227353 [details] stack events (nested depth 2)
So the given list contains 30k events. I don't think there is much we can do to improve it. We probably should have had a default limit, but it's hard to change that for backward compatibility. In the mean time, you should use limits, markers and filters if you need quick responses. You may also want to use stack failure list.