Bug 1095349

Summary: Lessen risk of premature vmcore removal by clarifing comments in retrace-server.conf and possibly add checks to avoid misconfiguration of cleanup job
Product: [Fedora] Fedora EPEL Reporter: Dave Wysochanski <dwysocha>
Component: retrace-serverAssignee: Michal Toman <mtoman>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: el6CC: brhatiga, dwysocha, mtoman, rvokal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: retrace-server-1.12-2.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-15 18:57:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dave Wysochanski 2014-05-07 14:00:59 UTC
Description of problem:
After working on the cleanup job to remove tasks with failed status in https://bugzilla.redhat.com/show_bug.cgi?id=1086321 I started thinking whether we should put some minimum checks or clarify any language in the retrace-server.conf file about the options for cleanup/archive.  We also had a incident where we lost vmcores on our retrace-server system due to an error, but it was not related to the cleanup job.  However the incident got me thinking about what might lead up to a 'disaster' scenario and/or if there is anything we can improve to lessen any such risks.

Here's a couple scenarios I came up with, though I'm not sure how likely they are:
1. Someone misreads the units in 'DeleteTaskAfter' and sets it to '30' thinking it is '30 days'.  Instead they get vmcores removed after almost a day, leading to a lot of vmcores removed.
2. Someone misreads / confuses 'DeleteTaskAfter' with 'DeleteFailedTaskAfter' and reverses the values, leading to a lot of vmcores removed.  Normally, DeleteFailedTaskAfter < DeleteTaskAfter but I didn't put any check in to make sure this was the case.

One thing Brian R suggested is a line such as this in the retrace-server.conf file as a comment to go above 'DeleteTaskAfter'
# WARNING:  BE CAREFUL WHEN CHANGING THIS PARAMETER AS IT MAY LEAD TO DELETING MORE THAN DESIRED

We could probably add some checks, such as a minimum value for 'DeleteTaskAfter', and whether DeleteFailedTaskAfter < DeleteTaskAfter, but then we need to decide what to do about it - do we not start retrace-server, override with a 'minimum' value, print a warning somewhere, etc?


Version-Release number of selected component (if applicable):
retrace-server-1.11-4.el6.noarch


How reproducible:
Depends on the likelihood of misconfiguration.


Steps to Reproduce:
Set DeleteTaskAfter to a low value.


Actual results:
Premature vmcore losses


Expected results:


Additional info:
Brian R or some other admins / users only of retrace-server may be best to comment on likelihood of misconfiguration.  I don't want to overly complicate things but if we can lessen risk of data loss easily we should do it.

Comment 2 Michal Toman 2014-06-10 09:35:58 UTC
So far I've extended the comments. Do you think this is sufficient or do you want to implement some checking logic anyway?


# Delete old tasks after (hours); <= 0 means never
# This is mutually exclusive with ArchiveTaskAfter (see below)
# The one that occurs first removes the task from the system
# In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first
DeleteTaskAfter = 0

# Delete old failed tasks after (hours); <= 0 means never
# This is useful for cleanup of failed tasks before the standard
# mechanisms do (DeleteTaskAfter or ArchiveTaskAfter)
# In case DeleteFailedTaskAfter > DeleteTaskAfter
# or DeleteFailedTaskAfter > ArchiveTaskAfter, this option does nothing
DeleteFailedTaskAfter = 0

# Archive old task after (hours); <= 0 means never
# This is mutually exclusive with DeleteTaskAfter (see above)
# The one that occurs first removes the task from the system
# In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first
ArchiveTaskAfter = 0

Comment 4 Dave Wysochanski 2014-07-29 13:31:52 UTC
Note the clarification of cleanup options is upstream, but I don't think any extra checks have been implemented yet.

commit 42db0647a509f31c035dad596ddd373ee8b49923
Author: Michal Toman <mtoman>
Date:   Tue Jun 10 11:31:47 2014 +0200

    retrace-server.conf: clarify cleanup options
    
    Signed-off-by: Michal Toman <mtoman>

Comment 5 Michal Toman 2014-07-30 12:21:05 UTC
Checks added

commit f4cffef491130363a19bc372fba1d8003db105a1
Author: Michal Toman <mtoman>
Date:   Wed Jul 30 12:40:59 2014 +0200

    rs-cleanup: add config sanity checks
    
    Signed-off-by: Michal Toman <mtoman>

Comment 6 Fedora Update System 2014-07-31 11:52:38 UTC
retrace-server-1.12-2.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/retrace-server-1.12-2.el6

Comment 7 Fedora Update System 2014-07-31 16:58:53 UTC
Package retrace-server-1.12-2.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing retrace-server-1.12-2.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2014-2089/retrace-server-1.12-2.el6
then log in and leave karma (feedback).

Comment 8 Fedora Update System 2014-08-15 18:57:53 UTC
retrace-server-1.12-2.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.