Hide Forgot
Description of problem: 1) Installed Satelitte 6.1.7 2) Sync 5 repos concurrently (total content size: 104GB) from local repo server. ( created a local repo with CDN content) Red Hat Enterprise Linux 5 Server RPMs i386 5Server Red Hat Enterprise Linux 6 Server RPMs x86_64 5Server Red Hat Enterprise Linux 5 Server RPMs i386 5Server Red Hat Enterprise Linux 6 Server RPMs x86_64 5Server Red Hat Enterprise Linux 7 Server RPMs x86_64 5Server celery workers reaching 100% cpu mostly. Version-Release number of selected component (if applicable): How reproducible: Always. Steps to Reproduce: 1. concurrent sync of above repos to satelitte server from local repo server (with cdn content) 2. 3. Actual results: celery workers reaching 100% cpu all the time. Expected results: celery not to reach 100% cpu Additional info:
satellite server h/w 48G mem, 16 cpu
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25507 apache 20 0 904280 60848 2764 R 100.0 0.1 0:27.57 celery 25532 apache 20 0 912728 65332 4064 R 100.0 0.1 0:29.96 celery 25543 apache 20 0 879920 58036 2760 R 100.0 0.1 0:27.43 celery 25554 apache 20 0 891588 64620 3812 R 100.0 0.1 0:27.21 celery 25501 apache 20 0 905368 66956 2764 R 99.7 0.1 0:27.12 celery
strace futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2d1fa90, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) sendto(4, "\315\0\0\0004d\f\0\0\0\0\0\324\7\0\0\0\0\0\0pulp_databas"..., 205, 0, NULL, 0) = 205 recvfrom(4, "$\0\0\0mO\3\0004d\f\0\1\0\0\0", 16, 0, NULL, NULL) = 16 recvfrom(4, "\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20, 0, NULL, NULL) = 20 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
Pradeep and I looked at his machine and although it's showing very large CPU usage, the system is under huge load. For instance mongodb is also loaded at 70%. Also when the work dispatched into Pulp from Katello is halted the CPU usage of processes returns to 0. This supports that this system is legitimately under high load. Pradeep is going to look into the issue a little more to see if he can identify if something is abnormal based on some debugging techniques I shared with him.
Created attachment 1160983 [details] 60 seconds of strace output from a celery process Collected while a content sync was running. Lots of futex() calls and also lots of open+read+close on /dev/urandom. All 4 CPUs were at 95% - 100% usage by celery processes: Tasks: 283 total, 5 running, 278 sleeping, 0 stopped, 0 zombie %Cpu(s): 98.0 us, 1.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 16266844 total, 5101944 free, 3781704 used, 7383196 buff/cache KiB Swap: 4194300 total, 4194300 free, 0 used. 11964672 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1795 apache 20 0 1209708 241740 4160 R 100.0 1.5 1026:02 celery 1801 apache 20 0 1214232 244360 4176 R 99.3 1.5 1026:43 celery 1796 apache 20 0 1248636 282964 4176 R 97.0 1.7 1034:28 celery 1793 apache 20 0 1206392 238628 4160 R 95.7 1.5 1029:09 celery 1357 mongodb 20 0 18.107g 2.312g 1.614g S 4.0 14.9 115:12.10 mongod 1799 apache 20 0 666360 57984 3584 S 1.0 0.4 11:27.28 celery
strace summary on one of the offending celery processes: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 19.25 0.020265 0 235819 gettimeofday 18.25 0.019208 0 74436 read 17.91 0.018851 0 74436 open 11.34 0.011936 0 74436 close 10.47 0.011025 0 148872 fcntl 5.67 0.005967 0 74436 gettid 5.40 0.005687 0 74436 getppid 5.11 0.005377 0 74436 getuid 4.64 0.004888 233 21 sendto 1.50 0.001576 34 46 recvfrom 0.45 0.000471 0 3584 9 futex 0.00 0.000000 0 5 poll ------ ----------- ----------- --------- --------- ---------------- 100.00 0.105251 834963 9 total So 74.15% (read+open+close+fcntl+gettid+getppid+getuid) of the time is spent reading /dev/urandom. This seems to come from uuid4() in /usr/lib64/python2.7/uuid.py : # Otherwise, get randomness from urandom or the 'random' module. try: import os return UUID(bytes=os.urandom(16), version=4)
I'm renaming this to a more general title to avoid confusion
I also hit this when testing 6.2 GA drops, trying to sync many "on-demand" (aka "lazy") repositories and I see many celery processes taking CPU, but the progress is very slow. After two days of work, the machine was not able to sync much data. I can also help giving a reproducer server.
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
Closing as a duplicate based on https://bugzilla.redhat.com/show_bug.cgi?id=1337226#c6 . *** This bug has been marked as a duplicate of bug 1337226 ***
*** Bug 1337226 has been marked as a duplicate of this bug. ***
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
https://pulp.plan.io/issues/1949#note-14 says: "Assuming the performance problem is in querying the database ..." Comments 5 and 6 in this bug suggest that's not where the problem is.
There was a regression which caused a specific, database related performance problem during errata publishing. That was causing the symptom that repository sync and publish were taking too long. Beyond that we want even more performance, but that is a separate issue from fixing the specific regression.
(In reply to Brian Bouterse from comment #21) > but that is a separate issue from fixing the specific > regression. I fail to see how comments 0, 1, 2, 3, 5 and 6 of this bz (the problem originally reported) are unrelated to this bz.
This BZ is tracking the inclusion of a specific upstream fix (see related Pulp bug). The original claim on this bug is so broad it can't be attributed to just one root cause. Given that, I can see how unassociating this downstream bug from the upstream bug could be right. If we do that, we'll need a new bug to track the inclusion of the upstream fix. If this bug does get disconnected from the upstream bug, then I'll suggest this bug be rewritten to be more specific or be closed. For instance the title says "taking longer than they should". I'm not sure what "should" means there.
(In reply to Brian Bouterse from comment #24) > I'm not sure what "should" means there. As per above comment: "So 74.15% (read+open+close+fcntl+gettid+getppid+getuid) of the time is spent reading /dev/urandom." I reported the problem and provided the output and summary of strace collected while the problem occurs. Could you please clarify what additional actions are required from me in order for this to progress?
I think the next step is a priorities discussion. I recommend you raise this issue with sat6 mgmt so that it can be prioritized correctly. I believe this request is for a feature more than to solve an isolated bug.
Many people observed a performance regression in 6.2 during the "sync" workflow. What katello and Satellite call a "sync", pulp calls a sync plus one or more publish tasks. We identified a major regression in publish that was causing the Satellite "sync" to take much longer than it did in the past. Here is a similar report that was considered a duplicate: https://bugzilla.redhat.com/show_bug.cgi?id=1337226 If you were specifically concerned about the high CPU usage during pulp's sync tasks only, it's possible this BZ was incorrectly associated with that regression others observed. We identified that a particular part of the publish workflow had regressed dramatically in performance. It increased publish times in many cases by a factor of 10+. Brian fixed that issue, and now pulp's publish performance is back on-par with prior expectations. 6.2 will likely perform better than 6.1 because the "nodes" feature is no longer in use, so there is no need to publish all of the node data. If there are other known performance regressions, it would be helpful to file new bug reports, and we will dig into them. Looking specifically at the uuid generation that was observed, how do we know it's a problem? It does appear to be a bottleneck (likely competing with bandwidth for which is the limiting factor during a sync), but that does not necessarily mean it is a problem. During sync, pulp is often saving dozens of documents to the database per second, and each of those needs at least one unique ID. That could easily explain all of the uuid generation. This may be an area where pulp could be improved, so thank you for helping us identify that bottleneck. But is it a regression? Do we have any other reason to believe it needs to be fixed? Or have we just identified the nature of the work that sync needs to perform?
I don't seem to be able to reproduce the problem described in the first updates when using beta 2.
Julio, thank you for the testing and information. We want to make sure this works well. I'm not 100% sure, but it's likely that the beta 2 doesn't have the upstream fix cherry picked yet. If not, a future beta could perform even better! Does anyone know if beta 2 contains the cherry pick from upstream for issue #1949? Also how can I answer that question on my own?
Which package(s) and version(s) do I need to check? Happy to re-test with newer packages to ensure that it still works fine with those.
The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.
Verified in Satellite 6.2 Beta Snap 19. Sync times are well within reasonable margins. Details Duration New packages: 344 (434 MB) 2 minutes New packages: 336 (478 MB) 2 minutes New packages: 89 (1.19 GB) 3 minutes New packages: 735 (2.37 GB) 18 minutes New packages: 129 (3.18 GB) 2 minutes New packages: 7983 (8.44 GB) 36 minutes
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1501
*** Bug 1335512 has been marked as a duplicate of this bug. ***