Bug 971468

Summary:

Request to search units times while memory consumption is high

Product:

[Retired] Pulp

Reporter:

Mike McCune <mmccune>

Component:

user-experience

Assignee:

Michael Hrivnak <mhrivnak>

Status:

CLOSED WORKSFORME

QA Contact:

Preethi Thomas <pthomas>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.1.1

CC:

jsherril, mmccune, skarmark

Target Milestone:

---

Keywords:

Triaged

Target Release:

2.1.2

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

960278

Environment:

Last Closed:

2013-06-11 14:31:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

960278

Bug Blocks:

950743

Attachments:

Description	Flags
Publish log	none
pulp refresh log with single-type unassociate.	none
apache log for unit fetch requests	none

Description Mike McCune 2013-06-06 15:17:32 UTC

+++ This bug was initially created as a clone of Bug #960278 +++

Description of problem:

In Katello syncing a repo involves a few steps:

1) sync the repo in pulp
2) When katello receives the post-sync notification index all packages
 a) fetch all unit ids of packages for a repo
 b) in batches of 200 (by default) request the metdata for each set of ids

When syncing more than 1 large repo (typically four repos), the first one that finishes will try to index its content and typically time out.  Our rest client timeout is set for 2 minutes.



Version-Release number of selected component (if applicable):
2.1.1-0.9

How reproducible:
Most of the time.


Steps to Reproduce:
1.  Sync 4 large repos at once (RHEL 6.2, 6.3, 6.4, 6Server)
2.  When one of them is done, request 200 packages (you may have to do this multiple times before one times out (not sure).

  
Actual results:
Request Times out


Expected results:
Request completes

--- Additional comment from Justin Sherrill on 2013-05-08 22:32:46 EDT ---

I'm not sure this totally related to syncing.  

If i :

a) sync  RHEL 6.1, 6.2, 6.3, 6.4, 6server  
b) copy them all to new repos  
c) copy them all to new repos again
d) within the new repos request packages 200 at a time

Memory usage seems to be around 40% (of 12 gigs) and as part of d), one of the requests will timeout.  

It seems if i restart apache and re-do d) it will all complete.

--- Additional comment from Mike McCune on 2013-06-06 11:12:38 EDT ---

The effect on Katello is when we refresh (re-publish) our Content Views we see a huge spike in memory consumption with httpd.  On an instance that I'm testing I see httpd go up to 4.8G after a republish.

Comment 1 Jay Dobies 2013-06-06 17:29:50 UTC

We need some more context. What is the query you're running? The answer may be to limit the fields that are returned. For instance, I doubt you care to download the repomd snippets.

Comment 2 Justin Sherrill 2013-06-06 17:46:44 UTC

We actually do trim it to certain fields.  For packages for example:

        

 ['name', 'version', 'release', 'arch', 'suffix', 'epoch', 'download_url', 'checksum', 'checksumtype', 'license', 'group', 'children', 'vendor', 'filename', 'relativepath', 'requires','provides', 'description', 'size', 'buildhost', '_id', '_content_type_id', '_href', '_storage_path', '_type']


I'm fairly certain the issue is with the large memory footprint, as the query works fine after a bounce of httpd.   I can give you an exact query in a moment.

Comment 3 Jay Dobies 2013-06-06 17:51:49 UTC

The memory footprint you mention, is that what's left hanging around after the syncs? If you do just steps b-d is it still an issue?

Comment 4 Justin Sherrill 2013-06-06 17:57:56 UTC

We first do a post to /pulp/api/v2/repositories/ACME_Corporation-DEV-View_62-TestProduct1-TestRepo/search/units/


{"criteria":{"type_ids":["rpm"],"fields":{"unit":[],"association":["unit_id"]}}}


We then take the unit_id from that list and fetch 200 at a time:

POST pulp/api/v2/content/units/rpm/search/


{
    "criteria": {
        "fields": [
            "name", 
            "version", 
            "release", 
            "arch", 
            "suffix", 
            "epoch", 
            "download_url", 
            "checksum", 
            "checksumtype", 
            "license", 
            "group", 
            "children", 
            "vendor", 
            "filename", 
            "relativepath", 
            "requires", 
            "provides", 
            "description", 
            "size", 
            "buildhost", 
            "_id", 
            "_content_type_id", 
            "_href", 
            "_storage_path", 
            "_type"
        ], 
        "filters": {
            "_id": {
                "": [
                    "a9a56b83-e046-4f38-8523-c2e5a0719d8f", 
                    ".......TRIM.......", 
                    "13913de5-c74b-4228-822a-5df06501df0a"
                ]
            }
        }
    }, 
    "include_repos": true
}

Comment 5 Justin Sherrill 2013-06-06 17:59:35 UTC

I am fairly certain the majority of the memory 'gain' is from steps B & C.  Step A by itself does not seem to cause all that much.

Comment 6 Jay Dobies 2013-06-06 18:04:33 UTC

Ok, let's take a look at the copy calls then. You're applying the field limiting to the copy calls that Mike Hrivnak told you about as well, correct?

Comment 7 Justin Sherrill 2013-06-06 18:32:29 UTC

These are the 4 unit copy calls we make for the 'content view publish':

post "https://abed.usersys.redhat.com/pulp/api/v2/repositories/ACME_Corporation-Library-AnotherTestView-TestProduct1-TestRepo/actions/associate/", "{\"source_repo_id\":\"ACME_Corporation-TestProduct1-TestRepo\",\"criteria\":{\"type_ids\":[\"rpm\"],\"filters\":{},\"fields\":{\"unit\":[\"name\",\"epoch\",\"version\",\"release\",\"arch\",\"checksumtype\",\"checksum\"]}}}"


post "https://abed.usersys.redhat.com/pulp/api/v2/repositories/ACME_Corporation-Library-AnotherTestView-TestProduct1-TestRepo/actions/associate/", "{\"source_repo_id\":\"ACME_Corporation-TestProduct1-TestRepo\",\"criteria\":{\"type_ids\":[\"distribution\"],\"filters\":{}}}", 


post "https://abed.usersys.redhat.com/pulp/api/v2/repositories/ACME_Corporation-Library-AnotherTestView-TestProduct1-TestRepo/actions/associate/", "{\"source_repo_id\":\"ACME_Corporation-TestProduct1-TestRepo\",\"criteria\":{\"type_ids\":[\"erratum\"],\"filters\":{}},\"override_config\":{\"copy_children\":false}}", 


post "https://abed.usersys.redhat.com/pulp/api/v2/repositories/ACME_Corporation-Library-AnotherTestView-TestProduct1-TestRepo/actions/associate/", "{\"source_repo_id\":\"ACME_Corporation-TestProduct1-TestRepo\",\"criteria\":{\"type_ids\":[\"package_group\"],\"filters\":{}},\"override_config\":{\"copy_children\":false}}", 


As you can see we only specify the fields for rpms.

Comment 8 Justin Sherrill 2013-06-06 21:32:22 UTC

Created attachment 757932 [details]
Publish log

Comment 9 Justin Sherrill 2013-06-06 21:33:06 UTC

Created attachment 757933 [details]
pulp refresh log with single-type unassociate.

Comment 10 Michael Hrivnak 2013-06-07 14:38:20 UTC

Confirmed that unassociating RPMs takes a tremendous amount of RAM. You can pass the same criteria (with fields specified) as with the associate operation to keep the memory footprint down.

Can you try adding the "fields" to RPM unassociate and see if this problem still exists?

Comment 11 Justin Sherrill 2013-06-07 20:22:34 UTC

So the memory issue seems to have gone away with the 



a) sync  RHEL 6.1, 6.2, 6.3, 6.4, 6server  
b) copy them all to new repos  
c) copy them all to new repos again  (potentially repeat this a few times)
d) within each repo, request all the unit_ids of all rpms  (see comment #4)
e) from that list of rpms, request the full units from unit_search (see comment #4)

result in e) one of the requests for units will time out (take more than 30 seconds to complete)

This isn't 100% reproducable, but seems to happen fairly often.

Comment 12 Justin Sherrill 2013-06-07 20:25:06 UTC

errr  missed the end of the last sentence:

So the memory issue seems to have gone away with the unit unassociate change.  Thanks!  Timeouts seem to still occur at least for me.

Comment 13 Michael Hrivnak 2013-06-10 15:02:43 UTC

Created attachment 759227 [details]
apache log for unit fetch requests

This is from /var/log/httpd/ssl_access_log, showing the activity for a series of requests identical to what katello does. The first request gets the unit IDs for all RPMs in a rhel6 repo, and then details about those are fetched 200 at a time. I am requesting the same fields as the katello requests.

Notice that one of the requests has a much larger response size than the others. On my system, both apache and my python client both work very hard to generate, serialize, send, and deserialize that response, taking just over 30 seconds consistently.

You can see from the time stamps that this particular request took about 31 seconds, whereas most of the requests took less than 1 second.

I was able to eliminate the time delay of this particular request by not asking for the "provides" or "requires" fields. In that case, I was able to execute all of the requests in a total of 19s, compared to 64s previously. The particularly large response, previously around 46MB, is now 166KB.

I'm not sure if this is related to your timeouts, but it seems like a reasonable suspect. If you can manage without the requires and provides fields, that would help a lot. Otherwise, requesting smaller chunks may help.

Comment 14 Michael Hrivnak 2013-06-10 18:34:42 UTC

Please advise if I should look into this further, or if you are satisfied to proceed with the advice given so far (specifying fields for unassociate, and the above notes about "requires" and "provides").

I am not able to reproduce the timeouts, so I would need more detail or access to a machine where it is reproducible in order to dig in further.

Comment 16 Justin Sherrill 2013-06-11 14:20:50 UTC

Michael,  

So we current index provides and requires information to make it easier for searching.  We could reduce the page size that we request and see if that helps.  I will try cutting it in half down to ~100 units at a time.

Thanks.