Bug 1095349

Summary:	Lessen risk of premature vmcore removal by clarifing comments in retrace-server.conf and possibly add checks to avoid misconfiguration of cleanup job
Product:	[Fedora] Fedora EPEL	Reporter:	Dave Wysochanski <dwysocha>
Component:	retrace-server	Assignee:	Michal Toman <mtoman>
Status:	CLOSED ERRATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	low	Docs Contact:
Priority:	low
Version:	el6	CC:	brhatiga, dwysocha, mtoman, rvokal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	retrace-server-1.12-2.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-08-15 18:57:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dave Wysochanski 2014-05-07 14:00:59 UTC

Description of problem:
After working on the cleanup job to remove tasks with failed status in https://bugzilla.redhat.com/show_bug.cgi?id=1086321 I started thinking whether we should put some minimum checks or clarify any language in the retrace-server.conf file about the options for cleanup/archive. We also had a incident where we lost vmcores on our retrace-server system due to an error, but it was not related to the cleanup job. However the incident got me thinking about what might lead up to a 'disaster' scenario and/or if there is anything we can improve to lessen any such risks.

Here's a couple scenarios I came up with, though I'm not sure how likely they are:
1. Someone misreads the units in 'DeleteTaskAfter' and sets it to '30' thinking it is '30 days'. Instead they get vmcores removed after almost a day, leading to a lot of vmcores removed.
2. Someone misreads / confuses 'DeleteTaskAfter' with 'DeleteFailedTaskAfter' and reverses the values, leading to a lot of vmcores removed. Normally, DeleteFailedTaskAfter < DeleteTaskAfter but I didn't put any check in to make sure this was the case.

One thing Brian R suggested is a line such as this in the retrace-server.conf file as a comment to go above 'DeleteTaskAfter'
# WARNING: BE CAREFUL WHEN CHANGING THIS PARAMETER AS IT MAY LEAD TO DELETING MORE THAN DESIRED

We could probably add some checks, such as a minimum value for 'DeleteTaskAfter', and whether DeleteFailedTaskAfter < DeleteTaskAfter, but then we need to decide what to do about it - do we not start retrace-server, override with a 'minimum' value, print a warning somewhere, etc?

Version-Release number of selected component (if applicable):
retrace-server-1.11-4.el6.noarch

How reproducible:
Depends on the likelihood of misconfiguration.

Steps to Reproduce:
Set DeleteTaskAfter to a low value.

Actual results:
Premature vmcore losses

Expected results:

Additional info:
Brian R or some other admins / users only of retrace-server may be best to comment on likelihood of misconfiguration. I don't want to overly complicate things but if we can lessen risk of data loss easily we should do it.

Comment 2 Michal Toman 2014-06-10 09:35:58 UTC

So far I've extended the comments. Do you think this is sufficient or do you want to implement some checking logic anyway?


# Delete old tasks after (hours); <= 0 means never
# This is mutually exclusive with ArchiveTaskAfter (see below)
# The one that occurs first removes the task from the system
# In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first
DeleteTaskAfter = 0

# Delete old failed tasks after (hours); <= 0 means never
# This is useful for cleanup of failed tasks before the standard
# mechanisms do (DeleteTaskAfter or ArchiveTaskAfter)
# In case DeleteFailedTaskAfter > DeleteTaskAfter
# or DeleteFailedTaskAfter > ArchiveTaskAfter, this option does nothing
DeleteFailedTaskAfter = 0

# Archive old task after (hours); <= 0 means never
# This is mutually exclusive with DeleteTaskAfter (see above)
# The one that occurs first removes the task from the system
# In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first
ArchiveTaskAfter = 0

Comment 4 Dave Wysochanski 2014-07-29 13:31:52 UTC

Note the clarification of cleanup options is upstream, but I don't think any extra checks have been implemented yet.

commit 42db0647a509f31c035dad596ddd373ee8b49923
Author: Michal Toman <mtoman>
Date:   Tue Jun 10 11:31:47 2014 +0200

    retrace-server.conf: clarify cleanup options
    
    Signed-off-by: Michal Toman <mtoman>

Comment 5 Michal Toman 2014-07-30 12:21:05 UTC

Checks added

commit f4cffef491130363a19bc372fba1d8003db105a1
Author: Michal Toman <mtoman>
Date:   Wed Jul 30 12:40:59 2014 +0200

    rs-cleanup: add config sanity checks
    
    Signed-off-by: Michal Toman <mtoman>

Comment 6 Fedora Update System 2014-07-31 11:52:38 UTC

retrace-server-1.12-2.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/retrace-server-1.12-2.el6

Comment 7 Fedora Update System 2014-07-31 16:58:53 UTC

Package retrace-server-1.12-2.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing retrace-server-1.12-2.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2014-2089/retrace-server-1.12-2.el6
then log in and leave karma (feedback).

Comment 8 Fedora Update System 2014-08-15 18:57:53 UTC

retrace-server-1.12-2.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.