Bug 1293007

Summary:	[RFE] enforce a server-side limit on number of results in a recipe
Product:	[Retired] Beaker	Reporter:	Dan Callaghan <dcallagh>
Component:	general	Assignee:	Dan Callaghan <dcallagh>
Status:	CLOSED CURRENTRELEASE	QA Contact:	tools-bugs <tools-bugs>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	21	CC:	dcallagh, dowang, mjia, rjoost
Target Milestone:	23.0	Keywords:	FutureFeature, Patch
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:
Clones:	1358557 (view as bug list)		Environment:
Last Closed:	2016-07-07 23:12:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Callaghan 2015-12-19 02:12:37 UTC

If a Beaker task has a bug causing it to go into an infinite loop, it can report a very large number of results before it eventually dies, particularly if the task is also extending its watchdog in the infinite loop.

Very large recipes cause problems in many parts of Beaker. The web UI generally cannot render results (nor serve up results XML) for the recipe because loading the entire object graph into memory exceeds the 1.5GB heap limit.

Another problem we encountered today is that beaker-transfer calls recipe.files() for the large recipe, and to answer the request the server has to make N db queries (N = number of tasks and results) which takes longer than the 2 minute request timeout in beaker-transfer. So beaker-transfer never progresses and the server bogs down answering the same large request continually.

Mistakes do happen so we can't rely on task authors not triggering infinite loops. Beaker should prevent reporting an excessive number of results in a single recipe -- perhaps with some additional heuristics to detect infinite loops (in this case, all the results had the same name), or maybe some kind of exponential rate limiting.

Not sure what a reasonable limit would be, but a recipe with 31457 total results (63543 total result logs) was enough to bring our production Beaker environment to its knees.

Comment 1 Dan Callaghan 2015-12-19 02:31:15 UTC

We hit this problem once before already, and it seems I already filed a bug about the issues with beaker-transfer specifically... however this seems like a better general solution which we should implement too.

Comment 2 Dan Callaghan 2016-01-06 06:43:34 UTC

It turned out in this particular case, the large number of results were *not* due to an accidental infinite loop in the task, but were in fact intentional. So I'm no longer totally sure that this is a worthwhile idea, and if we do enforce a limit what number it should be. Probably more important that we fix up the issues preventing Beaker from properly handling such a large recipe in the first place (bug 1293011, bug 1240809).

Comment 3 Dan Callaghan 2016-01-06 07:52:13 UTC

I used the following query (in Teiid against our production database) to get an idea of the upper end of the distribution of results-count-per-recipe. (The constraint on recipe.id > 1600000 is just to get the query to complete in a reasonable amount of time without being killed.)

SELECT recipe.id, COUNT(recipe_task_result.id) result_count FROM Beaker.recipe INNER JOIN recipe_task ON recipe_task.recipe_id = recipe.id INNER JOIN recipe_task_result ON recipe_task_result.recipe_task_id = recipe_task.id WHERE recipe.id > 1600000 GROUP BY recipe.id ORDER BY result_count DESC LIMIT 20;
   id    | result_count 
---------+--------------
 2373965 |        31779
 2373967 |        31457
 2044046 |        27067
 2030795 |        21781
 1962366 |        16639
 1970679 |        14669
 1622138 |        14002
 2197682 |        11539
 2066693 |        10655
 2057292 |         9763
 2373966 |         9111
 2104047 |         7995
 2104046 |         7938
 2060648 |         7429
 2070229 |         6659
 1966377 |         6408
 1617373 |         6283
 2011777 |         6081
 2060453 |         5954
 1822492 |         5425
(20 rows)

And it turns out there are only 23 recent recipes with more than 5000 results. Of those, 10 are deleted and 13 are not. Two of them were deleted by us because of the problems they caused with beaker-transfer in December and July.

Looking at the recipes which aren't deleted:

SELECT recipe.id, COUNT(recipe_task_result.id) result_count FROM Beaker.recipe INNER JOIN recipe_task ON recipe_task.recipe_id = recipe.id INNER JOIN recipe_task_result ON recipe_task_result.recipe_task_id = recipe_task.id INNER JOIN recipe_set ON recipe.recipe_set_id = recipe_set.id INNER JOIN job ON recipe_set.job_id = job.id WHERE recipe.id > 1600000 AND job.deleted IS NULL GROUP BY recipe.id HAVING COUNT(recipe_task_result.id) > 5000 ORDER BY result_count DESC;
   id    | result_count 
---------+--------------
 2044046 |        27067
 2030795 |        21781
 1970679 |        14669
 2197682 |        11539
 2066693 |        10655
 2057292 |         9763
 2104047 |         7995
 2104046 |         7938
 2070229 |         6659
 2011777 |         6081
 2060453 |         5954
 2015907 |         5250
 2065544 |         5079
(13 rows)

it seems the only reason they have survived without crippling beaker-transfer is because many of their results have no logs. I'm not sure how that's possible since rhts-report-result always sends a test log, but I guess it must be some bug in the system under test (e.g. segfaulting Python) which also caused the infinite loop of results reporting.

The first one in the above list which doesn't fit that pattern is recipe 2104047 with 7995 results. It appears to be a performance test run which intentionally loops many times. Probably the only reason it didn't kill beaker-transfer is because there is only one log for most of the results, not two, since the avc_log is absent.

Comment 4 Dan Callaghan 2016-01-06 07:55:38 UTC

So it seems like a generous limit on number of results per recipe would be 10,000. This would allow all existing use cases we have seen. But that limit is still high enough to kill beaker-transfer and beaker-log-delete as they stand now (assuming 2 logs per result) so enforcing it seems a bit pointless.

A more conservative limit would be 5000 which is enough for almost every recipe we have seen recently on our production Beaker, *and* is low enough that beaker-transfer and beaker-log-delete in their current form can actually handle it.

Another option might be to enforce a maximum number of logs per recipe, rather than results per recipe. Logs are generally costlier for Beaker to store and handle so that might be a nicer approach.

Comment 5 Dan Callaghan 2016-01-06 07:56:53 UTC

Either way the limit would need to be configurable so that we can easily tweak it if we get the number wrong initially.

Comment 6 Dan Callaghan 2016-01-07 01:29:08 UTC

http://gerrit.beaker-project.org/4572

I went with (configurable) limits of 7500 results per recipe and 7500 result logs per recipe. That should permit all legitimate testing (based on the maximum numbers in our production db so far) while still avoiding recipes that are large enough to start crashing things as we have seen in bug 1240809 and elsewhere.

Comment 8 Dan Callaghan 2016-01-14 06:30:02 UTC

We have a couple of bugs related to handling very large recipes and it probably makes sense to verify them all in one go, by submitting a job on beaker-devel which produces a similarly large recipe.

I will think up a good way to craft one and post it here.

Comment 11 Dan Callaghan 2016-01-19 00:30:04 UTC

Based on those test jobs, we can see that the results limiting is working as expected. However the log limiting is effectively defeated by beah (I guess in an error handling path somewhere) which uploads the logs against the task instead of the result when the result upload fails.

So for completeness we probably need to apply the limit to the total number of task logs *and* result logs.

Comment 12 Dan Callaghan 2016-01-19 00:45:04 UTC

The other thing is that Beaker should probably record a warning against the task when it hits the limits, otherwise it could just show up as fully Passing with no hint that some results are missing.

Comment 13 Dan Callaghan 2016-01-19 01:42:07 UTC

http://gerrit.beaker-project.org/4598 enforce limit on task logs as well as result logs
http://gerrit.beaker-project.org/4599 record a warning result if limits are reached

Comment 16 Dan Callaghan 2016-07-07 23:12:25 UTC

Beaker 23.0 has been released.