Bug 1419667
| Summary: | host list API call takes a long time | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Chris Duryee <cduryee> |
| Component: | Hosts | Assignee: | Shimon Shtein <sshtein> |
| Status: | CLOSED ERRATA | QA Contact: | Roman Plevka <rplevka> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 6.2.7 | CC: | akofink, bbuckingham, bkearney, cdonnell, cduryee, chorn, dan, dlobatog, dzhukous, egolov, fgarciad, hmore, inecas, jcallaha, jwildman, kelly.brown1, mtenheuv, peter.vreman, pgervase, rplevka, sigbjorn, sshtein, stkachen, zhunting |
| Target Milestone: | Unspecified | Keywords: | Performance, PrioBumpField, Triaged |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | tfm-rubygem-katello-3.0.0.117-1,foreman-1.11.0.73-1,rubygem-katello-3.0.0.117-1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-05-01 13:59:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1296845, 1426378 | ||
|
Description
Chris Duryee
2017-02-06 17:13:00 UTC
Chris - using per_page=4500 is definitely not a good idea. The whole reason why the API is paginated is to avoid these kind of waits. What kind of script uses this? Why can it not use the paginated responses? re #6, the tricky part is that for scripting purposes, some users may want to just grab everything at once instead of paginating. This is a common issue for users who want to integrate their Satellite's host data with other systems, and have a script that runs via cron job to sync things over. Usually for these purposes, it's OK to have a much smaller subset of the data. For example, even just a few fields (hostname and one or two others) may be needed. Also, it's OK if it takes a bit of time for the API to return (maybe 10-30 seconds). Ideally, users would be able to hit the API for this data and not have to run sql. This bug may be more of an RFE to use the API in this way, not sure. Hi - I should point out that even using pagination and a fairly modest per_page size results in longer than expected response times: $ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=20" -o /dev/null real 0m5.258s user 0m0.085s sys 0m0.105s $ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=50" -o /dev/null real 0m9.011s user 0m0.099s sys 0m0.099s $ time curl -s "https://satellite/api/v2/hosts?page=1&per_page=100" -o /dev/null real 0m22.034s user 0m0.081s sys 0m0.102s Are those sort of numbers to be expected? The net result is the same - a full host list can take 15-20 minutes to fetch. Thanks - Dan. The feature is tracked upstream: http://projects.theforeman.org/issues/3019. Unfortunately it didn't get enough traction till now. I will see if we can improve this sort of queries. Take into consideration memory usage too - ~4500K records will result in pretty big memory footprint on server side. I have to admit that on our lab setup, the timings were not that bad - around 1 min for 1000 hosts, or 12 sec for 100 hosts. It's an environment with ~5K hosts total. From the attached log I can see that each 10 hosts request took around 3 seconds to complete. I can see that it was running in parallel with GenerateApplicability tasks that are creating load on both memory and database on the server. Worth trying the same script in a more "quiet" environment. Also I would recommend fetching 100 hosts at a time, since it would do less "handshakes" in order to get the results. Connecting redmine issue http://projects.theforeman.org/issues/3019 from this bug After investigation, I can see the reason that the query is so slow: It's a huge N+1 problem (around 100 sql queries for 10 hosts page). The good news are that it's already solved (almost) in upstream: https://github.com/theforeman/foreman/pull/3887 in Foreman core https://github.com/Katello/katello/pull/6236 in Katello to fix the UI and I am working on a similar patch to fix the API part. The bad news are that we will have to backport it in order to get it in 6.2.z I confirm that sometimes it is better for performance to use 1 big query to retrieve all results at once than using multiple queries. This is because each new query has an additional latency. Example from a reporting tool across multiple satellite instances i have created: I modified the code to use a single query to grab all facts of all hosts using a query with per_page=999999. This takes 30 seconds to finish. In the old sutation i queried for each of my 300 hosts. Each individual host query took in avarage 0.75 second, total than in average 250 seconds. the single big takes only 30 seconds. That is about 4 minutes faster, makes the reporting feel much more snappier and makes it usefull to run more frequent to see progress during large updates to the landscape. Last part was merged in upstream: https://github.com/Katello/katello/pull/6619 For completeness sake, we should also add index to reports table as described in https://github.com/theforeman/foreman/pull/4204 This one can be cherrypicked automatically. After applying all four patches, I see a significant improvement in response times. *** Bug 1297683 has been marked as a duplicate of this bug. *** After a test round on big database, found two issues that should be addressed too in order to get normal performance: 1. In core need to change eager_load to preload (already done as part of MR 218) 2. Add OpenSCAP's policies table to fix its N+1 query too (attached redmine #18841, hopefully the patch can be applied downstream automatically). Pull request https://github.com/theforeman/foreman_openscap/pull/247. @chris, yes please, send me the backups. I will try to spin up a clean machine and I will test the changes there. @chris, after applying everything from this BZ, we are left with only one N+1 query: foreman_tasks status. There is already an open upstream issue to track it. I can't say that it's lightning fast now, but it looks pretty much OK to me. Upstream bug assigned to sshtein Summary: The cause for slowness was identified as a N+1 query caused by output of fields from relations added by plugins, like Katello and OpenSCAP. To address this problem a change was made to foreman core [1] to allow plugins declaring needed relations before the query is executed. Once the framework was in place relevant plugins could declare the needed fields: Katello [2] OpenSCAP [3] [1] http://projects.theforeman.org/issues/16646 and http://projects.theforeman.org/issues/18843 [2] http://projects.theforeman.org/issues/16010 and http://projects.theforeman.org/issues/18586 [3] http://projects.theforeman.org/issues/18841 Another issue that affected query performance was lack of index in reports table (http://projects.theforeman.org/issues/18136). There is still one N+1 query in the code, it comes from remote execution status field. Since the fix is not trivial, it will be tracked in its own BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1445623 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1191 *** Bug 1271274 has been marked as a duplicate of this bug. *** |