Bug 1315006

Summary: Repository syncs are taking longer than they should
Product: Red Hat Satellite Reporter: Pradeep Kumar Surisetty <psuriset>
Component: PulpAssignee: Chris Duryee <cduryee>
Status: CLOSED ERRATA QA Contact: jcallaha
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1.7CC: alden.1, bbuckingham, bmbouter, cwelton, daviddavis, dkliban, ehelms, fcami, ggainey, ipanova, jcallaha, jentrena, jmatthew, jmontleo, jomitsch, lzap, mhrivnak, mmccune, oshtaier, pcreech, pdwyer, peter.vreman, rchan, riehecky, ttereshc, will_darton, zhunting
Target Milestone: UnspecifiedKeywords: Reopened, Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: pulp-rpm-2.8.3.3-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-27 11:11:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1115190, 1122832, 1212602    
Attachments:
Description Flags
60 seconds of strace output from a celery process none

Description Pradeep Kumar Surisetty 2016-03-05 15:28:45 UTC
Description of problem:

1) Installed Satelitte 6.1.7  
2) Sync 5 repos concurrently  (total content size: 104GB) from local repo server. ( created a local repo with CDN content) 
 
       Red Hat Enterprise Linux 5 Server RPMs i386 5Server 
       Red Hat Enterprise Linux 6 Server RPMs x86_64 5Server
       Red Hat Enterprise Linux 5 Server RPMs i386 5Server
       Red Hat Enterprise Linux 6 Server RPMs x86_64 5Server
       Red Hat Enterprise Linux 7 Server RPMs x86_64 5Server

celery workers reaching 100% cpu mostly. 

Version-Release number of selected component (if applicable):


How reproducible:

Always. 

Steps to Reproduce:
1. concurrent sync of above repos to satelitte server from local repo server (with cdn content)
2.
3.

Actual results:

celery workers reaching 100% cpu all the time. 

Expected results:

celery not to reach 100% cpu 
Additional info:

Comment 1 Pradeep Kumar Surisetty 2016-03-05 17:47:33 UTC
satellite server h/w

48G mem,  16 cpu

Comment 2 Pradeep Kumar Surisetty 2016-03-06 08:16:16 UTC
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                           
25507 apache    20   0  904280  60848   2764 R 100.0  0.1   0:27.57 celery                                                            
25532 apache    20   0  912728  65332   4064 R 100.0  0.1   0:29.96 celery                                                            
25543 apache    20   0  879920  58036   2760 R 100.0  0.1   0:27.43 celery                                                            
25554 apache    20   0  891588  64620   3812 R 100.0  0.1   0:27.21 celery                                                            
25501 apache    20   0  905368  66956   2764 R  99.7  0.1   0:27.12 celery

Comment 3 Pradeep Kumar Surisetty 2016-03-06 12:45:26 UTC
strace

futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
sendto(4, "\315\0\0\0004d\f\0\0\0\0\0\324\7\0\0\0\0\0\0pulp_databas"..., 205, 0, NULL, 0) = 205
recvfrom(4, "$\0\0\0mO\3\0004d\f\0\1\0\0\0", 16, 0, NULL, NULL) = 16
recvfrom(4, "\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20, 0, NULL, NULL) = 20
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)

Comment 4 Brian Bouterse 2016-03-10 15:25:59 UTC
Pradeep and I looked at his machine and although it's showing very large CPU usage, the system is under huge load. For instance mongodb is also loaded at 70%. Also when the work dispatched into Pulp from Katello is halted the CPU usage of processes returns to 0. This supports that this system is legitimately under high load.

Pradeep is going to look into the issue a little more to see if he can identify if something is abnormal based on some debugging techniques I shared with him.

Comment 5 Julio Entrena Perez 2016-05-24 11:12:15 UTC
Created attachment 1160983 [details]
60 seconds of strace output from a celery process

Collected while a content sync was running.

Lots of futex() calls and also lots of open+read+close on /dev/urandom.

All 4 CPUs were at 95% - 100% usage by celery processes:

Tasks: 283 total,   5 running, 278 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.0 us,  1.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16266844 total,  5101944 free,  3781704 used,  7383196 buff/cache
KiB Swap:  4194300 total,  4194300 free,        0 used. 11964672 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1795 apache    20   0 1209708 241740   4160 R 100.0  1.5   1026:02 celery
 1801 apache    20   0 1214232 244360   4176 R  99.3  1.5   1026:43 celery
 1796 apache    20   0 1248636 282964   4176 R  97.0  1.7   1034:28 celery
 1793 apache    20   0 1206392 238628   4160 R  95.7  1.5   1029:09 celery
 1357 mongodb   20   0 18.107g 2.312g 1.614g S   4.0 14.9 115:12.10 mongod
 1799 apache    20   0  666360  57984   3584 S   1.0  0.4  11:27.28 celery

Comment 6 Julio Entrena Perez 2016-05-25 11:03:49 UTC
strace summary on one of the offending celery processes:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 19.25    0.020265           0    235819           gettimeofday
 18.25    0.019208           0     74436           read
 17.91    0.018851           0     74436           open
 11.34    0.011936           0     74436           close
 10.47    0.011025           0    148872           fcntl
  5.67    0.005967           0     74436           gettid
  5.40    0.005687           0     74436           getppid
  5.11    0.005377           0     74436           getuid
  4.64    0.004888         233        21           sendto
  1.50    0.001576          34        46           recvfrom
  0.45    0.000471           0      3584         9 futex
  0.00    0.000000           0         5           poll
------ ----------- ----------- --------- --------- ----------------
100.00    0.105251                834963         9 total

So 74.15% (read+open+close+fcntl+gettid+getppid+getuid) of the time is spent reading /dev/urandom.

This seems to come from uuid4() in /usr/lib64/python2.7/uuid.py :

    # Otherwise, get randomness from urandom or the 'random' module.
    try:
        import os
        return UUID(bytes=os.urandom(16), version=4)

Comment 8 John Mitsch 2016-05-25 18:58:12 UTC
I'm renaming this to a more general title to avoid confusion

Comment 10 Lukas Zapletal 2016-06-01 09:48:14 UTC
I also hit this when testing 6.2 GA drops, trying to sync many "on-demand" (aka "lazy") repositories and I see many celery processes taking CPU, but the progress is very slow. After two days of work, the machine was not able to sync much data.

I can also help giving a reproducer server.

Comment 11 pulp-infra@redhat.com 2016-06-01 13:30:21 UTC
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.

Comment 12 pulp-infra@redhat.com 2016-06-01 13:30:26 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 13 Brad Buckingham 2016-06-13 19:21:10 UTC
Closing as a duplicate based on https://bugzilla.redhat.com/show_bug.cgi?id=1337226#c6 .

*** This bug has been marked as a duplicate of bug 1337226 ***

Comment 15 Brad Buckingham 2016-06-13 20:44:42 UTC
*** Bug 1337226 has been marked as a duplicate of this bug. ***

Comment 16 pulp-infra@redhat.com 2016-06-20 14:30:31 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 17 pulp-infra@redhat.com 2016-06-20 14:30:37 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 18 pulp-infra@redhat.com 2016-06-24 17:17:29 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 19 pulp-infra@redhat.com 2016-06-24 20:30:32 UTC
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 20 Julio Entrena Perez 2016-06-27 09:00:57 UTC
https://pulp.plan.io/issues/1949#note-14 says: "Assuming the performance problem is in querying the database ..."

Comments 5 and 6 in this bug suggest that's not where the problem is.

Comment 21 Brian Bouterse 2016-06-27 14:09:37 UTC
There was a regression which caused a specific, database related performance problem during errata publishing. That was causing the symptom that repository sync and publish were taking too long. Beyond that we want even more performance, but that is a separate issue from fixing the specific regression.

Comment 22 Julio Entrena Perez 2016-06-27 14:14:59 UTC
(In reply to Brian Bouterse from comment #21)
> but that is a separate issue from fixing the specific
> regression.

I fail to see how comments 0, 1, 2, 3, 5 and 6 of this bz (the problem originally reported) are unrelated to this bz.

Comment 24 Brian Bouterse 2016-06-27 15:00:34 UTC
This BZ is tracking the inclusion of a specific upstream fix (see related Pulp bug). The original claim on this bug is so broad it can't be attributed to just one root cause. Given that, I can see how unassociating this downstream bug from the upstream bug could be right. If we do that, we'll need a new bug to track the inclusion of the upstream fix.

If this bug does get disconnected from the upstream bug, then I'll suggest this bug be rewritten to be more specific or be closed. For instance the title says "taking longer than they should". I'm not sure what "should" means there.

Comment 25 Julio Entrena Perez 2016-06-27 15:29:16 UTC
(In reply to Brian Bouterse from comment #24)
> I'm not sure what "should" means there.

As per above comment: "So 74.15% (read+open+close+fcntl+gettid+getppid+getuid) of the time is spent reading /dev/urandom."

I reported the problem and provided the output and summary of strace collected while the problem occurs.
Could you please clarify what additional actions are required from me in order for this to progress?

Comment 26 Brian Bouterse 2016-06-27 16:27:27 UTC
I think the next step is a priorities discussion. I recommend you raise this issue with sat6 mgmt so that it can be prioritized correctly. I believe this request is for a feature more than to solve an isolated bug.

Comment 28 Michael Hrivnak 2016-06-28 15:28:55 UTC
Many people observed a performance regression in 6.2 during the "sync" workflow. What katello and Satellite call a "sync", pulp calls a sync plus one or more publish tasks. We identified a major regression in publish that was causing the Satellite "sync" to take much longer than it did in the past. Here is a similar report that was considered a duplicate: https://bugzilla.redhat.com/show_bug.cgi?id=1337226

If you were specifically concerned about the high CPU usage during pulp's sync tasks only, it's possible this BZ was incorrectly associated with that regression others observed.

We identified that a particular part of the publish workflow had regressed dramatically in performance. It increased publish times in many cases by a factor of 10+.

Brian fixed that issue, and now pulp's publish performance is back on-par with prior expectations. 6.2 will likely perform better than 6.1 because the "nodes" feature is no longer in use, so there is no need to publish all of the node data.

If there are other known performance regressions, it would be helpful to file new bug reports, and we will dig into them.

Looking specifically at the uuid generation that was observed, how do we know it's a problem? It does appear to be a bottleneck (likely competing with bandwidth for which is the limiting factor during a sync), but that does not necessarily mean it is a problem. During sync, pulp is often saving dozens of documents to the database per second, and each of those needs at least one unique ID. That could easily explain all of the uuid generation. This may be an area where pulp could be improved, so thank you for helping us identify that bottleneck. But is it a regression? Do we have any other reason to believe it needs to be fixed? Or have we just identified the nature of the work that sync needs to perform?

Comment 29 Julio Entrena Perez 2016-06-29 09:25:01 UTC
I don't seem to be able to reproduce the problem described in the first updates when using beta 2.

Comment 30 Brian Bouterse 2016-06-29 09:41:55 UTC
Julio, thank you for the testing and information. We want to make sure this works well.

I'm not 100% sure, but it's likely that the beta 2 doesn't have the upstream fix cherry picked yet. If not, a future beta could perform even better!

Does anyone know if beta 2 contains the cherry pick from upstream for issue #1949? Also how can I answer that question on my own?

Comment 31 Julio Entrena Perez 2016-06-29 09:53:57 UTC
Which package(s) and version(s) do I need to check?

Happy to re-test with newer packages to ensure that it still works fine with those.

Comment 32 pulp-infra@redhat.com 2016-07-04 02:30:31 UTC
The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.

Comment 33 jcallaha 2016-07-07 13:50:45 UTC
Verified in Satellite 6.2 Beta Snap 19.

Sync times are well within reasonable margins.

Details                        Duration
New packages: 344 (434 MB)     2 minutes
New packages: 336 (478 MB)     2 minutes
New packages: 89 (1.19 GB)     3 minutes
New packages: 735 (2.37 GB)    18 minutes
New packages: 129 (3.18 GB)    2 minutes
New packages: 7983 (8.44 GB)   36 minutes

Comment 34 pulp-infra@redhat.com 2016-07-18 19:30:59 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 35 Bryan Kearney 2016-07-27 11:11:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501

Comment 36 Brad Buckingham 2016-08-02 23:36:10 UTC
*** Bug 1335512 has been marked as a duplicate of this bug. ***