Bug 1320276

Summary: 'fedpkg build' high CPU consumption
Product: [Fedora] Fedora Reporter: Jaroslav Škarvada <jskarvad>
Component: rpkgAssignee: Dennis Gilmore <dennis>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: bochecha, cqi, dennis, lsedlar, pbabinca, s
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-22 08:33:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Proposed fix none

Description Jaroslav Škarvada 2016-03-22 17:35:51 UTC
Description of problem:
'fedpkg build' is in top five most CPU eager processes on my machine (sometimes top one). Why does it use aggressive polling? I think the design should be improved, there is no need to utilize the CPU two much.

Version-Release number of selected component (if applicable):
fedpkg-1.21-2.fc23.noarch

How reproducible:
Always

Steps to Reproduce:
1. fedpkb build
2. Check top
3.

Actual results:
fedpkg process on top places

Expected results:
fedpkg process on bottom places

Additional info:

Comment 1 Jaroslav Škarvada 2016-03-22 17:37:52 UTC
(In reply to Jaroslav Škarvada from comment #0)
> Why does it use aggressive polling?

Typo, the right question:

Why? Does it use aggressive polling?

Comment 2 Lubomír Sedlář 2016-03-23 07:30:22 UTC
Can you try running the build with koji directly to see if it has the same symptoms?

koji build f24 --scratch $(fedpkg giturl)

When watching tasks, it polls Koji for the status once per second. As far as I know, there is no other way to get the same information. I guess all we could do here is change the interval.

Comment 3 Jaroslav Škarvada 2016-03-23 16:55:11 UTC
I guess the code is coming from the rpkg package which, regarding the watch method, seems to be just copy&paste of the code from the koji client.

Koji client itself seems to consume a bit less CPU, but it still seems too much for me. It seems koji uses 5 seconds polling interval, but rpkg uses 1 second polling interval. I think it shouldn't matter much, 1 second is pretty long, but I think the following is the problem:

http://fpaste.org/344262/

Notice the number of select calls in less than 100 milliseconds - this is what's utilizing the CPU.

Comment 4 Jaroslav Škarvada 2016-03-23 17:32:32 UTC
Increasing the sleep delay to 5 seconds didn't help much:

18:28:53.190394 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 585421})
18:28:53.605273 read(3, "\27\3\3\1,", 5) = 5
18:28:53.605495 read(3, "e\210\v8\3101+yQR\234p\343[\366\311\213\276b\227\217\226V1z\356\371\34/\16r\7"..., 300) = 300
18:28:53.605877 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 999994})
18:28:53.606229 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 999995})
.....
18:28:53.630567 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 598114})
18:28:54.032746 read(3, "\27\3\3\1+", 5) = 5
18:28:54.032944 read(3, "e\210\v8\3101+{\274\177\251%Ze8\347l\255\201\231\34]\267\242\33-\360\276].4\352"..., 299) = 299
18:28:54.033262 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 999994})
18:28:54.033516 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 999996})
...

You can see it never sleeps longer than 0.4 seconds.

Comment 5 Jaroslav Škarvada 2016-03-23 17:46:37 UTC
The following hack of _watch_koji_tasks seems to improve the situation:

-           time.sleep(1)
+               clk = time.time()
+                while time.time() < clk + 5:
+                   time.sleep(5)

Comment 6 Jaroslav Škarvada 2016-03-23 17:52:51 UTC
Or maybe better hack:
                clk = time.time()
                clkn = clk
                while clkn < clk + 5:
                   time.sleep(clk + 5 - clkn)
                   clkn = time.time()

Comment 7 Jaroslav Škarvada 2016-03-23 18:00:12 UTC
I retried on another machine and it seems that just:
  time.sleep(5)
should be enough.

Comment 8 Jaroslav Škarvada 2016-03-23 18:00:39 UTC
(In reply to Jaroslav Škarvada from comment #7)
> I retried on another machine and it seems that just:
>   time.sleep(5)
> should be enough.

and consistent with koji cli

Comment 9 Jaroslav Škarvada 2016-03-23 18:10:44 UTC
Created attachment 1139677 [details]
Proposed fix

Lower CPU utilization by increasing the sleep interval and make it consistent with koji-cli defaults.

Comment 10 Jaroslav Škarvada 2016-03-23 18:12:02 UTC
There may be more to optimize (e.g. the number of select calls), but this should be a good start.

Comment 11 Mathieu Bridon 2016-03-23 18:29:05 UTC
I just tried running a scratch-build (didn't have anything to build), and fedpkg barely uses any CPU here.

Those select calls you mention, do they happen during the "time.sleep(1)" ? Or do they happen in between sleeps, when fedpkg is actually doing something?

Comment 12 Jaroslav Škarvada 2016-03-23 18:52:37 UTC
(In reply to Mathieu Bridon from comment #11)
> I just tried running a scratch-build (didn't have anything to build), and
> fedpkg barely uses any CPU here.
> 
> Those select calls you mention, do they happen during the "time.sleep(1)" ?
> Or do they happen in between sleeps, when fedpkg is actually doing something?

It seems it happen between sleeps. By looking on the strace, it's full of:

17:08:55.278088 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 999994})
17:08:55.278412 select(4, [3], [], [], {43200, 0}) = 1 (in [3], left {43199, 999994})

i.e. waiting for one FD #3, waiting for input, waiting for 43200 seconds, returned after 43199.999994 seconds. Is the timing so critical? Cannot it wait longer and read the FD in batches?

Comment 13 cqi 2016-08-22 08:33:48 UTC
As of fedpkg-1.24-3.fc24.noarch, this does not happen. Thanks for your effort to dig into this issue. To close this bug at this moment. Feel free to reopen if this performance issue happens again.