Bug 1505207

Summary: job page auto-fetch keeps sending requests even when they are taking a long time, DOS'ing Beaker
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: generalAssignee: Jacob McKenzie <jmckenzi>
Status: CLOSED CURRENTRELEASE QA Contact: Roman Joost <rjoost>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 24CC: dcallagh, jmckenzi, mjia, rjoost
Target Milestone: 25.5Keywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-23 06:40:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Callaghan 2017-10-23 00:44:41 UTC
Version-Release number of selected component (if applicable):
24.4

How reproducible:
not very

Steps to Reproduce:
1. Open the job page for a running job containing a lot of recipes (for example, a Beaker self-test job)
Might only be applicable on beaker-devel, which is very memory constrained and often performs poorly.
2. Watch the Firefox network requests tool

Actual results:
The page sends an AJAX request to GET /jobs/<id> every 20 seconds.
The requests take longer than 20 seconds to come back
(in today's case they were taking 300 seconds because beaker-devel was struggling).
But the page keeps sending a new one every 20 seconds, so you end up with 20 in-flight requests all waiting.
Effectively this DOS'es Beaker.

Expected results:
Should not send any more requests while the first one is still waiting to come back.
In fact the 20 seconds between requests should really be 20 seconds *after* the request completes, not every 20 seconds, to be kind to the server.

Additional info:
Maybe need to check if we have a timeout on the requests. At some point we would want the request to fail (so we can try a new one) rather than the page get stuck waiting forever for the AJAX request which isn't going to come back.

Comment 2 Roman Joost 2018-07-04 00:51:51 UTC
I've tested this on beaker-devel by firing off a new self test: https://beaker-devel.app.eng.bos.redhat.com/jobs/14659. The old access logs were firing requests more rapidly:

next request          last request         delta (seconds)
2018-06-20 06:47:50 - 2018-06-20 06:47:32 = 18
2018-06-20 06:48:34 - 2018-06-20 06:47:50 = 44
2018-06-20 06:49:05 - 2018-06-20 06:48:34 = 31
2018-06-20 06:49:16 - 2018-06-20 06:49:05 = 11
2018-06-20 06:50:07 - 2018-06-20 06:49:16 = 51
2018-06-20 06:50:38 - 2018-06-20 06:50:07 = 31
2018-06-20 06:50:52 - 2018-06-20 06:50:38 = 14
2018-06-20 06:51:40 - 2018-06-20 06:50:52 = 48
2018-06-20 06:52:11 - 2018-06-20 06:51:40 = 31
2018-06-20 06:52:22 - 2018-06-20 06:52:11 = 11
2018-06-20 06:53:13 - 2018-06-20 06:52:22 = 51
2018-06-20 06:53:44 - 2018-06-20 06:53:13 = 31
2018-06-20 06:54:05 - 2018-06-20 06:53:44 = 21
2018-06-20 06:54:46 - 2018-06-20 06:54:05 = 41
2018-06-20 06:55:17 - 2018-06-20 06:54:46 = 31
2018-06-20 06:55:30 - 2018-06-20 06:55:17 = 13
2018-06-20 06:56:19 - 2018-06-20 06:55:30 = 49
2018-06-20 06:57:01 - 2018-06-20 06:56:19 = 42

The new implementation has a more even distribution:

2018-07-03 06:23:32 - 2018-07-03 06:22:24 = 68
2018-07-03 06:24:32 - 2018-07-03 06:23:32 = 60
2018-07-03 06:25:57 - 2018-07-03 06:24:32 = 85
2018-07-03 06:27:04 - 2018-07-03 06:25:57 = 67
2018-07-03 06:28:08 - 2018-07-03 06:27:04 = 64
2018-07-03 06:29:16 - 2018-07-03 06:28:08 = 68
2018-07-03 06:30:31 - 2018-07-03 06:29:16 = 75
2018-07-03 06:31:45 - 2018-07-03 06:30:31 = 74
2018-07-03 06:32:54 - 2018-07-03 06:31:45 = 69
2018-07-03 06:34:19 - 2018-07-03 06:32:54 = 85
2018-07-03 06:35:52 - 2018-07-03 06:34:19 = 93
2018-07-03 06:37:04 - 2018-07-03 06:35:52 = 72
2018-07-03 06:38:17 - 2018-07-03 06:37:04 = 73
2018-07-03 06:39:40 - 2018-07-03 06:38:17 = 83
2018-07-03 06:41:12 - 2018-07-03 06:39:40 = 92
2018-07-03 06:42:45 - 2018-07-03 06:41:12 = 93
2018-07-03 06:44:16 - 2018-07-03 06:42:45 = 91
2018-07-03 06:45:33 - 2018-07-03 06:44:16 = 77
2018-07-03 06:47:20 - 2018-07-03 06:45:33 = 107

That should reduce the load. Unfortunately, we're still pulling 1M with each request on JSON data, which doesn't solve the issue that there is still a potential risk of putting too much load on the server, but at least one contributor is fixed.

Comment 6 Dan Callaghan 2018-07-06 03:34:35 UTC
Just noticed the recipe page has exactly the same problem. Should we file a separate bug for that?

The impact of the recipe page auto-fetching should be much lower than the job page, since an individual recipe is usually pretty small, unless it has an unusually large number of tasks or results.

Comment 7 Roman Joost 2018-07-23 06:40:51 UTC
Released with Beaker 25.5: https://beaker-project.org/docs/whats-new/release-25.html#beaker-25-5