Description of problem: After working on the cleanup job to remove tasks with failed status in https://bugzilla.redhat.com/show_bug.cgi?id=1086321 I started thinking whether we should put some minimum checks or clarify any language in the retrace-server.conf file about the options for cleanup/archive. We also had a incident where we lost vmcores on our retrace-server system due to an error, but it was not related to the cleanup job. However the incident got me thinking about what might lead up to a 'disaster' scenario and/or if there is anything we can improve to lessen any such risks. Here's a couple scenarios I came up with, though I'm not sure how likely they are: 1. Someone misreads the units in 'DeleteTaskAfter' and sets it to '30' thinking it is '30 days'. Instead they get vmcores removed after almost a day, leading to a lot of vmcores removed. 2. Someone misreads / confuses 'DeleteTaskAfter' with 'DeleteFailedTaskAfter' and reverses the values, leading to a lot of vmcores removed. Normally, DeleteFailedTaskAfter < DeleteTaskAfter but I didn't put any check in to make sure this was the case. One thing Brian R suggested is a line such as this in the retrace-server.conf file as a comment to go above 'DeleteTaskAfter' # WARNING: BE CAREFUL WHEN CHANGING THIS PARAMETER AS IT MAY LEAD TO DELETING MORE THAN DESIRED We could probably add some checks, such as a minimum value for 'DeleteTaskAfter', and whether DeleteFailedTaskAfter < DeleteTaskAfter, but then we need to decide what to do about it - do we not start retrace-server, override with a 'minimum' value, print a warning somewhere, etc? Version-Release number of selected component (if applicable): retrace-server-1.11-4.el6.noarch How reproducible: Depends on the likelihood of misconfiguration. Steps to Reproduce: Set DeleteTaskAfter to a low value. Actual results: Premature vmcore losses Expected results: Additional info: Brian R or some other admins / users only of retrace-server may be best to comment on likelihood of misconfiguration. I don't want to overly complicate things but if we can lessen risk of data loss easily we should do it.
So far I've extended the comments. Do you think this is sufficient or do you want to implement some checking logic anyway? # Delete old tasks after (hours); <= 0 means never # This is mutually exclusive with ArchiveTaskAfter (see below) # The one that occurs first removes the task from the system # In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first DeleteTaskAfter = 0 # Delete old failed tasks after (hours); <= 0 means never # This is useful for cleanup of failed tasks before the standard # mechanisms do (DeleteTaskAfter or ArchiveTaskAfter) # In case DeleteFailedTaskAfter > DeleteTaskAfter # or DeleteFailedTaskAfter > ArchiveTaskAfter, this option does nothing DeleteFailedTaskAfter = 0 # Archive old task after (hours); <= 0 means never # This is mutually exclusive with DeleteTaskAfter (see above) # The one that occurs first removes the task from the system # In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first ArchiveTaskAfter = 0
Note the clarification of cleanup options is upstream, but I don't think any extra checks have been implemented yet. commit 42db0647a509f31c035dad596ddd373ee8b49923 Author: Michal Toman <mtoman> Date: Tue Jun 10 11:31:47 2014 +0200 retrace-server.conf: clarify cleanup options Signed-off-by: Michal Toman <mtoman>
Checks added commit f4cffef491130363a19bc372fba1d8003db105a1 Author: Michal Toman <mtoman> Date: Wed Jul 30 12:40:59 2014 +0200 rs-cleanup: add config sanity checks Signed-off-by: Michal Toman <mtoman>
retrace-server-1.12-2.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/retrace-server-1.12-2.el6
Package retrace-server-1.12-2.el6: * should fix your issue, * was pushed to the Fedora EPEL 6 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=epel-testing retrace-server-1.12-2.el6' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2014-2089/retrace-server-1.12-2.el6 then log in and leave karma (feedback).
retrace-server-1.12-2.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.