Bug 1419667

Summary:	host list API call takes a long time
Product:	Red Hat Satellite	Reporter:	Chris Duryee <cduryee>
Component:	Hosts	Assignee:	Shimon Shtein <sshtein>
Status:	CLOSED ERRATA	QA Contact:	Roman Plevka <rplevka>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	6.2.7	CC:	akofink, bbuckingham, bkearney, cdonnell, cduryee, chorn, dan, dlobatog, dzhukous, egolov, fgarciad, hmore, inecas, jcallaha, jwildman, kelly.brown1, mtenheuv, peter.vreman, pgervase, rplevka, sigbjorn, sshtein, stkachen, zhunting
Target Milestone:	Unspecified	Keywords:	Performance, PrioBumpField, Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tfm-rubygem-katello-3.0.0.117-1,foreman-1.11.0.73-1,rubygem-katello-3.0.0.117-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-01 13:59:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1296845, 1426378

Description Chris Duryee 2017-02-06 17:13:00 UTC

Description of problem:

The following call can take a very long time to complete:

curl -k -u admin:password -H "Accept:application/json,version=2" 'https://localhost/api/hosts?per_page=4500'

log result:

2017-02-06 11:46:58 [app] [I]   Rendered api/v2/hosts/index.json.rabl within api/v2/layouts/index_layout (267610.0ms)
2017-02-06 11:46:58 [app] [I] Completed 200 OK in 268842ms (Views: 133061.4ms | ActiveRecord: 134686.3ms)

It looks like a lot of the time is related to looking up errata info. It's not any one particular call, but running sql statements for each host can really add up.

If there was an additional flag to only provide some basic details without the full errata info, that works too. Specifically, the following is helpful:

 * Hostname
 * Subscription status
 * Subnet name

Version-Release number of selected component (if applicable): 6.2.7

Comment 6 Daniel Lobato Garcia 2017-02-13 09:32:50 UTC

Chris - using per_page=4500 is definitely not a good idea. The whole reason why the API is paginated is to avoid these kind of waits. 

What kind of script uses this? Why can it not use the paginated responses?

Comment 9 Chris Duryee 2017-02-13 13:57:30 UTC

re #6, the tricky part is that for scripting purposes, some users may want to just grab everything at once instead of paginating. This is a common issue for users who want to integrate their Satellite's host data with other systems, and have a script that runs via cron job to sync things over.

Usually for these purposes, it's OK to have a much smaller subset of the data. For example, even just a few fields (hostname and one or two others) may be needed. Also, it's OK if it takes a bit of time for the API to return (maybe 10-30 seconds).

Ideally, users would be able to hit the API for this data and not have to run sql. This bug may be more of an RFE to use the API in this way, not sure.

Comment 11 Dan Heath 2017-02-13 16:53:47 UTC

Hi - I should point out that even using pagination and a fairly modest per_page size results in longer than expected response times:

$ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=20" -o /dev/null

real    0m5.258s
user    0m0.085s
sys     0m0.105s

$ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=50" -o /dev/null

real    0m9.011s
user    0m0.099s
sys     0m0.099s

$ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=100" -o /dev/null

real    0m22.034s
user    0m0.081s
sys     0m0.102s

Are those sort of numbers to be expected?  The net result is the same - a full host list can take 15-20 minutes to fetch.

Thanks - Dan.

Comment 13 Shimon Shtein 2017-02-19 19:00:22 UTC

The feature is tracked upstream: http://projects.theforeman.org/issues/3019. Unfortunately it didn't get enough traction till now. I will see if we can improve this sort of queries.

Take into consideration memory usage too - ~4500K records will result in pretty big memory footprint on server side.

I have to admit that on our lab setup, the timings were not that bad - around 1 min for 1000 hosts, or 12 sec for 100 hosts. It's an environment with ~5K hosts total.

From the attached log I can see that each 10 hosts request took around 3 seconds to complete.
I can see that it was running in parallel with GenerateApplicability tasks that are creating load on both memory and database on the server.
Worth trying the same script in a more "quiet" environment. Also I would recommend fetching 100 hosts at a time, since it would do less "handshakes" in order to get the results.

Comment 14 Satellite Program 2017-02-21 15:40:04 UTC

Connecting redmine issue http://projects.theforeman.org/issues/3019 from this bug

Comment 15 Shimon Shtein 2017-02-21 19:53:40 UTC

After investigation, I can see the reason that the query is so slow: It's a huge N+1 problem (around 100 sql queries for 10 hosts page).

The good news are that it's already solved (almost) in upstream:
https://github.com/theforeman/foreman/pull/3887 in Foreman core
https://github.com/Katello/katello/pull/6236 in Katello to fix the UI
and I am working on a similar patch to fix the API part.
The bad news are that we will have to backport it in order to get it in 6.2.z

Comment 20 Peter Vreman 2017-02-24 15:49:28 UTC

I confirm that sometimes it is better for performance to use 1 big query to retrieve all results at once than using multiple queries. This is because each new query has an additional latency.

Example from a reporting tool across multiple satellite instances i have created:
I modified the code to use a single query to grab all facts of all hosts using a query with per_page=999999. This takes 30 seconds to finish.
In the old sutation i queried for each of my 300 hosts. Each individual host query took in avarage 0.75 second, total than in average 250 seconds. the single big  takes only 30 seconds.
That is about 4 minutes faster, makes the reporting feel much more snappier and makes it usefull to run more frequent to see progress during large updates to the landscape.

Comment 21 Shimon Shtein 2017-02-26 07:52:17 UTC

Last part was merged in upstream: https://github.com/Katello/katello/pull/6619

Comment 24 Shimon Shtein 2017-02-28 18:54:44 UTC

For completeness sake, we should also add index to reports table as described in https://github.com/theforeman/foreman/pull/4204
This one can be cherrypicked automatically.

After applying all four patches, I see a significant improvement in response times.

Comment 25 Shimon Shtein 2017-03-05 09:25:09 UTC

*** Bug 1297683 has been marked as a duplicate of this bug. ***

Comment 29 Shimon Shtein 2017-03-14 18:20:50 UTC

After a test round on big database, found two issues that should be addressed too in order to get normal performance:
1. In core need to change eager_load to preload (already done as part of MR 218)
2. Add OpenSCAP's policies table to fix its N+1 query too (attached redmine #18841, hopefully the patch can be applied downstream automatically). Pull request https://github.com/theforeman/foreman_openscap/pull/247.

Comment 32 Shimon Shtein 2017-03-21 13:18:44 UTC

@chris, yes please, send me the backups. I will try to spin up a clean machine and I will test the changes there.

Comment 33 Shimon Shtein 2017-03-29 17:18:19 UTC

@chris, after applying everything from this BZ, we are left with only one N+1 query: foreman_tasks status. There is already an open upstream issue to track it.
I can't say that it's lightning fast now, but it looks pretty much OK to me.

Comment 39 Satellite Program 2017-04-24 18:14:17 UTC

Upstream bug assigned to sshtein

Comment 41 Shimon Shtein 2017-04-26 08:03:23 UTC

Summary:

The cause for slowness was identified as a N+1 query caused by output of fields from relations added by plugins, like Katello and OpenSCAP. To address this problem a change was made to foreman core [1] to allow plugins declaring needed relations before the query is executed. Once the framework was in place relevant plugins could declare the needed fields:
Katello [2]
OpenSCAP [3]


[1] http://projects.theforeman.org/issues/16646 and http://projects.theforeman.org/issues/18843
[2] http://projects.theforeman.org/issues/16010 and http://projects.theforeman.org/issues/18586
[3] http://projects.theforeman.org/issues/18841

Another issue that affected query performance was lack of index in reports table (http://projects.theforeman.org/issues/18136).

There is still one N+1 query in the code, it comes from remote execution status field. Since the fix is not trivial, it will be tracked in its own BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1445623

Comment 44 errata-xmlrpc 2017-05-01 13:59:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1191

Comment 45 Shimon Shtein 2017-11-26 11:03:19 UTC

*** Bug 1271274 has been marked as a duplicate of this bug. ***