Bug 1669186

Summary: Manifest upload task takes too much time
Product: Red Hat Satellite Reporter: Roman Plevka <rplevka>
Component: Subscription ManagementAssignee: Partha Aji <paji>
Status: CLOSED ERRATA QA Contact: Roman Plevka <rplevka>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.5.0CC: ehelms, jomitsch, jsherril, ktordeur, mjia, mmccune, nkathole, pmoravec, rplevka
Target Milestone: 6.5.0Keywords: Performance, PrioBumpQA, Regression, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tfm-rubygem-katello-3.10.0.24-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1669216 1684703 1698513 (view as bug list) Environment:
Last Closed: 2019-05-14 12:39:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1669216    
Attachments:
Description Flags
traceback none

Comment 4 Roman Plevka 2019-01-24 15:32:46 UTC
here are some timings for 10 consecutive manifest imports:

209.78s manifest_upload.py::test_manifest_upload[1/10]
213.80s manifest_upload.py::test_manifest_upload[2/10]
224.00s manifest_upload.py::test_manifest_upload[3/10]
228.95s manifest_upload.py::test_manifest_upload[4/10]
229.30s manifest_upload.py::test_manifest_upload[5/10]
219.14s manifest_upload.py::test_manifest_upload[6/10]
226.37s manifest_upload.py::test_manifest_upload[7/10]
229.08s manifest_upload.py::test_manifest_upload[8/10]
228.91s manifest_upload.py::test_manifest_upload[9/10]
234.40s manifest_upload.py::test_manifest_upload[10/10]

Comment 6 Brad Buckingham 2019-01-25 16:13:30 UTC
*** Bug 1669216 has been marked as a duplicate of this bug. ***

Comment 11 Mike McCune 2019-01-31 17:53:25 UTC
All the below are on identical hardware, importing the manifest attached to this bug:

* 6.3.5 Satellite: 1m36sec

* 6.4.1 Satellite with external database: 2m49s

* 6.5 SNAP 6: 0m33s

* 6.5 SNAP 13: 3m53s

definite gradual progression getting worse. The 0m33s almost looks like an anomoly and I wonder if we were skipping something critical but we should at least get it back into the same time levels as 6.3 in the 1.5minute range if we can.

Comment 12 John Mitsch 2019-01-31 18:05:19 UTC
We have added steps to the manifest import such as storing the product content in the database https://github.com/Katello/katello/commit/946b2990c9a054babc9ad563e1b14c4161207b3e#diff-17c58a600ab07af9fd58f9c4ccb66be4. This is to provide better searching and avoid calls to candlepin later, it's my understanding this is a trade-off of manifest import time for better searching and less backend calls later

This and other changes have been around since Katello 3.6, so it may help to explain the upward trend in manifest import time, but doesn't explain the recent change in 6.5 snaps. I'm also wondering if something had been missing or skipped in the previous 6.5 snaps, that is now functioning in the latest snaps.

Roman, do we know for sure the time regressed in the 6.4 snap? I see per the following comment that the times were low in 6.5 snaps: https://bugzilla.redhat.com/show_bug.cgi?id=1655981#c5

But having data on the 6.4 snaps and releases would help to tell us more, from mike's testing it seems the manifest took ~3 mins in 6.4, so there doesn't seem to be a change.

Comment 16 Justin Sherrill 2019-02-05 03:03:09 UTC
Created redmine issue https://projects.theforeman.org/issues/25981 from this bug

Comment 18 Bryan Kearney 2019-02-08 05:03:49 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue https://projects.theforeman.org/issues/25981 has been resolved.

Comment 20 Mike McCune 2019-03-01 22:16:34 UTC
This bug was cloned and is still going to be included in the 6.4.3 release. It no longer has the sat-6.4.z+ flag and 6.4.3 Target Milestone Set which are now on the 6.4.z cloned bug. Please see the Clones field to track the progress of this bug in the 6.4.3 release.

Comment 22 Justin Sherrill 2019-03-06 15:12:40 UTC
Hey Roman, 

There are several steps to the manifest import process, and i didn't compare the overall time (including candlepin importing), but if you look at the  Actions::Candlepin::Owner::ImportProducts step within the dynflow console of the manifest import foreman task, it was taking a large portion of the overall time.  That particular step should see a speed up of 70-90%.

Jonathan did some over all timings and you can find them here:  https://github.com/Katello/katello/pull/7963#issuecomment-461497814



manifest import prior (clean environment): 350 seconds
manifest refresh prior: 90 seconds

manifest refresh with change: 43 seconds
manifest import with change (existing content): 36 seconds
manifest import with change (no content): 50 seconds

Comment 23 Roman Plevka 2019-03-19 13:56:43 UTC
Hello Justin,
so I tried to verify this with the same test that produced the earlier timings and the new timings look really great!

21.97s call     manifest_upload.py::test_manifest_upload[1/10]
15.51s call     manifest_upload.py::test_manifest_upload[2/10]
10.38s call     manifest_upload.py::test_manifest_upload[3/10]
10.57s call     manifest_upload.py::test_manifest_upload[4/10]
15.53s call     manifest_upload.py::test_manifest_upload[5/10]
12.35s call     manifest_upload.py::test_manifest_upload[6/10]
10.55s call     manifest_upload.py::test_manifest_upload[7/10]
10.51s call     manifest_upload.py::test_manifest_upload[8/10]
10.48s call     manifest_upload.py::test_manifest_upload[9/10]
10.41s call     manifest_upload.py::test_manifest_upload[10/10]

However, the last iteration errored out with the following message:

PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_katello_pools_on_cp_id"
DETAIL:  Key (cp_id)=(4028f9536995fe480169960bde580a6c) already exists.
: INSERT INTO "katello_pools" ("cp_id", "created_at", "updated_at", "organization_id") VALUES ($1, $2, $3, $4) RETURNING "id"

is it possible that this might be related to the changes you made?

- i tried to run the test (in batches of 10) several times, and this errors occurs in almost every batch. The operations are carried out synchronously.

Comment 24 Justin Sherrill 2019-03-19 14:05:30 UTC
Hey Roman, 

That error shouldn't be related to this, as this converted products and product content to use bulk importing, but the error is related to 'pools'.  I'm happy to take a closer look though, do you have a traceback?

Its possible that my change caused the error to occur because a step that was taking a long time now isn't, which exposed an existing race condition.  The more I think about it, I would put my money on that.  I think the fix should be fairly simple, if you want to handle it as part of this bz.

Justin

Comment 25 Nikhil Kathole 2019-03-19 14:10:52 UTC
I see the bz [1] for same error.

Thanks.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1686604

Comment 26 Roman Plevka 2019-03-19 16:00:16 UTC
Created attachment 1545740 [details]
traceback

attaching the requested traceback

Comment 27 Justin Sherrill 2019-03-19 18:02:34 UTC
If anyone is open to testing the upstream pr: https://github.com/Katello/katello/pull/8034

it would be very helpful,  its hard to reproduce as a race condition

Comment 28 Roman Plevka 2019-03-20 10:38:50 UTC
(In reply to Justin Sherrill from comment #27)
> If anyone is open to testing the upstream pr:
> https://github.com/Katello/katello/pull/8034
> 
> it would be very helpful,  its hard to reproduce as a race condition

Thanks for a quick fix, however, 3 retries might not be enough:
https://github.com/Katello/katello/pull/8034#issuecomment-474775055

Comment 29 Roman Plevka 2019-04-05 06:39:56 UTC
Justin,
will you include the race condition fix mentioned in Comment #27, so I can properly test and Verify this BZ?

Comment 31 Brad Buckingham 2019-04-05 18:00:48 UTC
Hi Roman,

I am clearing the needinfo for Justin and placing this back to ON_QA.

The fix that was referenced in comment 27 was associated with bug 1686604.  That bug was delivered in snap 22 and since been verified.

Comment 32 Roman Plevka 2019-04-08 12:07:37 UTC
Thank you for confirmation.

with the commit in place we are now good to put this to VERIFIED

Comment 33 Mike McCune 2019-04-23 15:46:03 UTC
*** Bug 1698513 has been marked as a duplicate of this bug. ***

Comment 34 Partha Aji 2019-05-02 13:58:12 UTC
Connecting redmine issue https://projects.theforeman.org/issues/26733 from this bug

Comment 37 errata-xmlrpc 2019-05-14 12:39:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:1222