1419667 – host list API call takes a long time

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1419667 - host list API call takes a long time

Summary: host list API call takes a long time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Hosts
Sub Component:
Version:	6.2.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	Unspecified
Assignee:	Shimon Shtein
QA Contact:	Roman Plevka
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1271274 1297683 (view as bug list)
Depends On:
Blocks:	1296845 1426378
TreeView+	depends on / blocked

Reported:	2017-02-06 17:13 UTC by Chris Duryee
Modified:	2021-12-10 14:53 UTC (History)
CC List:	24 users (show)
Fixed In Version:	tfm-rubygem-katello-3.0.0.117-1,foreman-1.11.0.73-1,rubygem-katello-3.0.0.117-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-01 13:59:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	18136	Normal	Closed	Reports type column should have an index	2020-10-05 20:42:44 UTC
Foreman Issue Tracker	18841	Normal	Closed	Avoid N+1 query in foreman hosts index by declaring proper scope	2020-10-05 20:42:44 UTC
Foreman Issue Tracker	18843	Normal	Closed	Change eager_load to includes statement in hosts api controller's index action	2020-10-05 20:42:44 UTC
Red Hat Bugzilla	1447958	urgent	CLOSED	[regression] host search by organization never finishes but causes mem.leak in foreman process	2021-08-30 12:18:58 UTC
Red Hat Knowledge Base (Solution)	1274723	None	None	None	2018-12-06 20:49:59 UTC
Red Hat Product Errata	RHBA-2017:1191	normal	SHIPPED_LIVE	Satellite 6.2.9 Async Bug Release	2017-05-01 17:49:42 UTC

Internal Links: 1447958

Description Chris Duryee 2017-02-06 17:13:00 UTC

Description of problem:

The following call can take a very long time to complete:

curl -k -u admin:password -H "Accept:application/json,version=2" 'https://localhost/api/hosts?per_page=4500'

log result:

2017-02-06 11:46:58 [app] [I]   Rendered api/v2/hosts/index.json.rabl within api/v2/layouts/index_layout (267610.0ms)
2017-02-06 11:46:58 [app] [I] Completed 200 OK in 268842ms (Views: 133061.4ms | ActiveRecord: 134686.3ms)

It looks like a lot of the time is related to looking up errata info. It's not any one particular call, but running sql statements for each host can really add up.

If there was an additional flag to only provide some basic details without the full errata info, that works too. Specifically, the following is helpful:

 * Hostname
 * Subscription status
 * Subnet name

Version-Release number of selected component (if applicable): 6.2.7

Comment 6 Daniel Lobato Garcia 2017-02-13 09:32:50 UTC

Chris - using per_page=4500 is definitely not a good idea. The whole reason why the API is paginated is to avoid these kind of waits. 

What kind of script uses this? Why can it not use the paginated responses?

Comment 9 Chris Duryee 2017-02-13 13:57:30 UTC

re #6, the tricky part is that for scripting purposes, some users may want to just grab everything at once instead of paginating. This is a common issue for users who want to integrate their Satellite's host data with other systems, and have a script that runs via cron job to sync things over.

Usually for these purposes, it's OK to have a much smaller subset of the data. For example, even just a few fields (hostname and one or two others) may be needed. Also, it's OK if it takes a bit of time for the API to return (maybe 10-30 seconds).

Ideally, users would be able to hit the API for this data and not have to run sql. This bug may be more of an RFE to use the API in this way, not sure.

Comment 11 Dan Heath 2017-02-13 16:53:47 UTC

Hi - I should point out that even using pagination and a fairly modest per_page size results in longer than expected response times:

$ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=20" -o /dev/null

real    0m5.258s
user    0m0.085s
sys     0m0.105s

$ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=50" -o /dev/null

real    0m9.011s
user    0m0.099s
sys     0m0.099s

$ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=100" -o /dev/null

real    0m22.034s
user    0m0.081s
sys     0m0.102s

Are those sort of numbers to be expected?  The net result is the same - a full host list can take 15-20 minutes to fetch.

Thanks - Dan.

Comment 13 Shimon Shtein 2017-02-19 19:00:22 UTC

The feature is tracked upstream: http://projects.theforeman.org/issues/3019. Unfortunately it didn't get enough traction till now. I will see if we can improve this sort of queries.

Take into consideration memory usage too - ~4500K records will result in pretty big memory footprint on server side.

I have to admit that on our lab setup, the timings were not that bad - around 1 min for 1000 hosts, or 12 sec for 100 hosts. It's an environment with ~5K hosts total.

From the attached log I can see that each 10 hosts request took around 3 seconds to complete.
I can see that it was running in parallel with GenerateApplicability tasks that are creating load on both memory and database on the server.
Worth trying the same script in a more "quiet" environment. Also I would recommend fetching 100 hosts at a time, since it would do less "handshakes" in order to get the results.

Comment 14 Satellite Program 2017-02-21 15:40:04 UTC

Connecting redmine issue http://projects.theforeman.org/issues/3019 from this bug

Comment 15 Shimon Shtein 2017-02-21 19:53:40 UTC

After investigation, I can see the reason that the query is so slow: It's a huge N+1 problem (around 100 sql queries for 10 hosts page).

The good news are that it's already solved (almost) in upstream:
https://github.com/theforeman/foreman/pull/3887 in Foreman core
https://github.com/Katello/katello/pull/6236 in Katello to fix the UI
and I am working on a similar patch to fix the API part.
The bad news are that we will have to backport it in order to get it in 6.2.z

Comment 20 Peter Vreman 2017-02-24 15:49:28 UTC

I confirm that sometimes it is better for performance to use 1 big query to retrieve all results at once than using multiple queries. This is because each new query has an additional latency.

Example from a reporting tool across multiple satellite instances i have created:
I modified the code to use a single query to grab all facts of all hosts using a query with per_page=999999. This takes 30 seconds to finish.
In the old sutation i queried for each of my 300 hosts. Each individual host query took in avarage 0.75 second, total than in average 250 seconds. the single big  takes only 30 seconds.
That is about 4 minutes faster, makes the reporting feel much more snappier and makes it usefull to run more frequent to see progress during large updates to the landscape.

Comment 21 Shimon Shtein 2017-02-26 07:52:17 UTC

Last part was merged in upstream: https://github.com/Katello/katello/pull/6619

Comment 24 Shimon Shtein 2017-02-28 18:54:44 UTC

For completeness sake, we should also add index to reports table as described in https://github.com/theforeman/foreman/pull/4204
This one can be cherrypicked automatically.

After applying all four patches, I see a significant improvement in response times.

Comment 25 Shimon Shtein 2017-03-05 09:25:09 UTC

*** Bug 1297683 has been marked as a duplicate of this bug. ***

Comment 29 Shimon Shtein 2017-03-14 18:20:50 UTC

After a test round on big database, found two issues that should be addressed too in order to get normal performance:
1. In core need to change eager_load to preload (already done as part of MR 218)
2. Add OpenSCAP's policies table to fix its N+1 query too (attached redmine #18841, hopefully the patch can be applied downstream automatically). Pull request https://github.com/theforeman/foreman_openscap/pull/247.

Comment 32 Shimon Shtein 2017-03-21 13:18:44 UTC

@chris, yes please, send me the backups. I will try to spin up a clean machine and I will test the changes there.

Comment 33 Shimon Shtein 2017-03-29 17:18:19 UTC

@chris, after applying everything from this BZ, we are left with only one N+1 query: foreman_tasks status. There is already an open upstream issue to track it.
I can't say that it's lightning fast now, but it looks pretty much OK to me.

Comment 39 Satellite Program 2017-04-24 18:14:17 UTC

Upstream bug assigned to sshtein

Comment 41 Shimon Shtein 2017-04-26 08:03:23 UTC

Summary:

The cause for slowness was identified as a N+1 query caused by output of fields from relations added by plugins, like Katello and OpenSCAP. To address this problem a change was made to foreman core [1] to allow plugins declaring needed relations before the query is executed. Once the framework was in place relevant plugins could declare the needed fields:
Katello [2]
OpenSCAP [3]


[1] http://projects.theforeman.org/issues/16646 and http://projects.theforeman.org/issues/18843
[2] http://projects.theforeman.org/issues/16010 and http://projects.theforeman.org/issues/18586
[3] http://projects.theforeman.org/issues/18841

Another issue that affected query performance was lack of index in reports table (http://projects.theforeman.org/issues/18136).

There is still one N+1 query in the code, it comes from remote execution status field. Since the fix is not trivial, it will be tracked in its own BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1445623

Comment 44 errata-xmlrpc 2017-05-01 13:59:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1191

Comment 45 Shimon Shtein 2017-11-26 11:03:19 UTC

*** Bug 1271274 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

akofink
bbuckingham
bkearney
cdonnell
cduryee
chorn
dan
dlobatog
dzhukous
egolov
fgarciad
hmore
inecas
jcallaha
jwildman
kelly.brown1
mtenheuv
peter.vreman
pgervase
rplevka
sigbjorn
sshtein
stkachen
zhunting