1086321 – RFE: Add more aggressive removal option for retrace-server tasks with 'status == STATUS_FAIL' after X days

Bug 1086321 - RFE: Add more aggressive removal option for retrace-server tasks with 'status == STATUS_FAIL' after X days

Summary: RFE: Add more aggressive removal option for retrace-server tasks with 'status...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora EPEL
Classification:	Fedora
Component:	retrace-server
Sub Component:
Version:	el6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Assignee:	Dave Wysochanski
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-10 15:27 UTC by Dave Wysochanski
Modified:	2014-09-19 10:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:	retrace-server-1.12-2.el6
Clone Of:
Environment:
Last Closed:	2014-08-15 18:57:59 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Patch to add a new config variable and section to allow for more aggressive deletes of failed tasks (2.43 KB, text/plain) 2014-04-17 14:14 UTC, Dave Wysochanski	no flags	Details
View All

Description Dave Wysochanski 2014-04-10 15:27:02 UTC

Description of problem:
Today retrace-server has cleanup task which removes 'old' tasks based on the mtime of the task directory. This is good but does not cover the situation where tasks fail.

Task that fail are often submitted multiple times, and so the space usage can be significant. Also these failed tasks are not useful at all - they are just dead space. On our system where we have set to delete after 180 days:
$ grep ^Delete /etc/retrace-server.conf
DeleteTaskAfter = 4320
we see if we would more aggressively remove failed tasks within 30 days, (tasks where 'status' file contains 7), we would retain a savings of around 2TB out of 14 TB. This is significant savings and worth considering a separate cleanup setting.

Recommendation is to add a new 'DeleteNonSuccessTaskAfter' setting to remove tasks with 'status != STATUS_SUCCESS == 6' after the timeframe given by the setting. One note of warning. There probably needs to be a minimum value here just in case someone sets this too aggressively low. For example we probably would not want someone to be able to set this value to 1 since it may delete long running tasks just submitted (tasks taking an hour to process). Then again, if a task takes an hour to process, is there another bug that needs filed? Maybe 24 - 48 is a good 'minimum' just in case something goes wrong with the status value and tasks are really usable?

In short, I don't think there is any use to keep tasks with status != STATUS_SUCCESS after we're sure we've had enough time to process them. Could there be exceptions to this rule? I'm not sure but maybe not worth worrying about since they imply other bugs. One such exception is if the 'status' value is unreliable - we could end up deleting tasks which are really usable.

Version-Release number of selected component (if applicable):
retrace-server-1.11-3.el6.noarch

How reproducible:
Fairly noticeable on our production system.

Steps to Reproduce:
Search for tasks with 'status != 6'. These tasks are not useable vmcores AFAIK.

Actual results:
Failed tasks only get deleted with the normal cleanup processing.

Expected results:
Failed tasks should have more aggressive cleanup.

Additional info:
I think Brian R. has some scripts which look for failed tasks. We will probably implement some sort of script cleanup in the interim until this gets fixed, unless it is easy and can be addressed shortly.

Comment 3 Dave Wysochanski 2014-04-17 13:26:44 UTC

Assuming Michal you're ok with the design of a second config variable, I think we just need a similar section in retrace-server-cleanup to the existing one for DeleteTaskAfter, something like the below.

To be conservative we could just go with STATUS_FAILED but I'm not sure.  If we got with the 'non-success' idea, we may need to check for a minimum value due to longer running tasks which may still be in progress.  We would not want this addition to the cleanup job cleaning up tasks which may still be in progress.

diff -Nurp /usr/bin/retrace-server-cleanup /usr/bin/retrace-server-cleanup.bz1086321
--- /usr/bin/retrace-server-cleanup     2014-03-25 09:04:07.000000000 -0400
+++ /usr/bin/retrace-server-cleanup.bz1086321   2014-04-17 09:16:56.019129321 -0400
@@ -123,3 +123,21 @@ if __name__ == "__main__":
                 if task.get_age() >= CONFIG["DeleteTaskAfter"]:
                     log.write("Deleting old task %s\n" % filename)
                     task.remove()
+
+        if CONFIG["DeleteFailedTaskAfter"] > 0:
+            # clean up old failed tasks
+            try:
+                files = os.listdir(CONFIG["SaveDir"])
+            except OSError, ex:
+                files = []
+                log.write("Error listing task directory: %s\n" % ex)
+
+            for filename in files:
+                try:
+                    task = RetraceTask(filename)
+                except:
+                    continue
+
+                if task.get_age() >= CONFIG["DeleteNonSuccessTaskAfter"] and task.get_status == STATUS_FAILED:
+                    log.write("Deleting old non-success task %s\n" % filename)
+                    task.remove()

Comment 4 Dave Wysochanski 2014-04-17 14:06:50 UTC

(In reply to Dave Wysochanski from comment #3)
> Assuming Michal you're ok with the design of a second config variable, I
> think we just need a similar section in retrace-server-cleanup to the
> existing one for DeleteTaskAfter, something like the below.
> 
> To be conservative we could just go with STATUS_FAILED but I'm not sure.  If
> we got with the 'non-success' idea, we may need to check for a minimum value
> due to longer running tasks which may still be in progress.  We would not
> want this addition to the cleanup job cleaning up tasks which may still be
> in progress.
> 
> diff -Nurp /usr/bin/retrace-server-cleanup
> /usr/bin/retrace-server-cleanup.bz1086321
> --- /usr/bin/retrace-server-cleanup     2014-03-25 09:04:07.000000000 -0400
> +++ /usr/bin/retrace-server-cleanup.bz1086321   2014-04-17
> 09:16:56.019129321 -0400
> @@ -123,3 +123,21 @@ if __name__ == "__main__":
>                  if task.get_age() >= CONFIG["DeleteTaskAfter"]:
>                      log.write("Deleting old task %s\n" % filename)
>                      task.remove()
> +
> +        if CONFIG["DeleteFailedTaskAfter"] > 0:
> +            # clean up old failed tasks
> +            try:
> +                files = os.listdir(CONFIG["SaveDir"])
> +            except OSError, ex:
> +                files = []
> +                log.write("Error listing task directory: %s\n" % ex)
> +
> +            for filename in files:
> +                try:
> +                    task = RetraceTask(filename)
> +                except:
> +                    continue
> +
> +                if task.get_age() >= CONFIG["DeleteNonSuccessTaskAfter"]
> and task.get_status == STATUS_FAILED:
> +                    log.write("Deleting old non-success task %s\n" %
> filename)
> +                    task.remove()

The above patch does not even work but I will attach a cleaned up / tested patch soon.

Comment 5 Dave Wysochanski 2014-04-17 14:14:45 UTC

Created attachment 887198 [details]
Patch to add a new config variable and section to allow for more aggressive deletes of failed tasks

Comment 6 Dave Wysochanski 2014-04-17 14:23:24 UTC

Test case / outline:
1. On retrace-server system, run script to find failed tasks (see attachment).  The system should have at least one failed task which is newer than the value of DeleteTaskAfter
2. Add DeleteFailedTaskAfter to /etc/retrace-server.conf with a value just below the value of DeleteTaskAfter and such that it should remove the failed task
3. Run the cleanup script, look at the cleanup log, and make sure you see something like the below.  On my system the log is /var/log/retrace-server/cleanup.log
# tail /var/log/retrace-server/cleanup.log
...
[2014-04-17 10:01:13] Running cleanup
Deleting old failed task 217408639

Comment 8 Dave Wysochanski 2014-05-06 17:10:18 UTC

Looks fixed in retrace-server-1.11-4.el6.noarch

Comment 10 Fedora Update System 2014-07-31 11:52:43 UTC

retrace-server-1.12-2.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/retrace-server-1.12-2.el6

Comment 11 Fedora Update System 2014-08-15 18:57:59 UTC

retrace-server-1.12-2.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.