1095349 – Lessen risk of premature vmcore removal by clarifing comments in retrace-server.conf and possibly add checks to avoid misconfiguration of cleanup job

Bug 1095349 - Lessen risk of premature vmcore removal by clarifing comments in retrace-server.conf and possibly add checks to avoid misconfiguration of cleanup job

Summary: Lessen risk of premature vmcore removal by clarifing comments in retrace-serv...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora EPEL
Classification:	Fedora
Component:	retrace-server
Sub Component:
Version:	el6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Michal Toman
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-07 14:00 UTC by Dave Wysochanski
Modified:	2015-03-23 00:42 UTC (History)
CC List:	4 users (show)
Fixed In Version:	retrace-server-1.12-2.el6
Clone Of:
Environment:
Last Closed:	2014-08-15 18:57:53 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Dave Wysochanski 2014-05-07 14:00:59 UTC

Description of problem:
After working on the cleanup job to remove tasks with failed status in https://bugzilla.redhat.com/show_bug.cgi?id=1086321 I started thinking whether we should put some minimum checks or clarify any language in the retrace-server.conf file about the options for cleanup/archive. We also had a incident where we lost vmcores on our retrace-server system due to an error, but it was not related to the cleanup job. However the incident got me thinking about what might lead up to a 'disaster' scenario and/or if there is anything we can improve to lessen any such risks.

Here's a couple scenarios I came up with, though I'm not sure how likely they are:
1. Someone misreads the units in 'DeleteTaskAfter' and sets it to '30' thinking it is '30 days'. Instead they get vmcores removed after almost a day, leading to a lot of vmcores removed.
2. Someone misreads / confuses 'DeleteTaskAfter' with 'DeleteFailedTaskAfter' and reverses the values, leading to a lot of vmcores removed. Normally, DeleteFailedTaskAfter < DeleteTaskAfter but I didn't put any check in to make sure this was the case.

One thing Brian R suggested is a line such as this in the retrace-server.conf file as a comment to go above 'DeleteTaskAfter'
# WARNING: BE CAREFUL WHEN CHANGING THIS PARAMETER AS IT MAY LEAD TO DELETING MORE THAN DESIRED

We could probably add some checks, such as a minimum value for 'DeleteTaskAfter', and whether DeleteFailedTaskAfter < DeleteTaskAfter, but then we need to decide what to do about it - do we not start retrace-server, override with a 'minimum' value, print a warning somewhere, etc?

Version-Release number of selected component (if applicable):
retrace-server-1.11-4.el6.noarch

How reproducible:
Depends on the likelihood of misconfiguration.

Steps to Reproduce:
Set DeleteTaskAfter to a low value.

Actual results:
Premature vmcore losses

Expected results:

Additional info:
Brian R or some other admins / users only of retrace-server may be best to comment on likelihood of misconfiguration. I don't want to overly complicate things but if we can lessen risk of data loss easily we should do it.

Comment 2 Michal Toman 2014-06-10 09:35:58 UTC

So far I've extended the comments. Do you think this is sufficient or do you want to implement some checking logic anyway?


# Delete old tasks after (hours); <= 0 means never
# This is mutually exclusive with ArchiveTaskAfter (see below)
# The one that occurs first removes the task from the system
# In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first
DeleteTaskAfter = 0

# Delete old failed tasks after (hours); <= 0 means never
# This is useful for cleanup of failed tasks before the standard
# mechanisms do (DeleteTaskAfter or ArchiveTaskAfter)
# In case DeleteFailedTaskAfter > DeleteTaskAfter
# or DeleteFailedTaskAfter > ArchiveTaskAfter, this option does nothing
DeleteFailedTaskAfter = 0

# Archive old task after (hours); <= 0 means never
# This is mutually exclusive with DeleteTaskAfter (see above)
# The one that occurs first removes the task from the system
# In case DeleteTaskAfter = ArchiveTaskAfter, archiving executes first
ArchiveTaskAfter = 0

Comment 4 Dave Wysochanski 2014-07-29 13:31:52 UTC

Note the clarification of cleanup options is upstream, but I don't think any extra checks have been implemented yet.

commit 42db0647a509f31c035dad596ddd373ee8b49923
Author: Michal Toman <mtoman>
Date:   Tue Jun 10 11:31:47 2014 +0200

    retrace-server.conf: clarify cleanup options
    
    Signed-off-by: Michal Toman <mtoman>

Comment 5 Michal Toman 2014-07-30 12:21:05 UTC

Checks added

commit f4cffef491130363a19bc372fba1d8003db105a1
Author: Michal Toman <mtoman>
Date:   Wed Jul 30 12:40:59 2014 +0200

    rs-cleanup: add config sanity checks
    
    Signed-off-by: Michal Toman <mtoman>

Comment 6 Fedora Update System 2014-07-31 11:52:38 UTC

retrace-server-1.12-2.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/retrace-server-1.12-2.el6

Comment 7 Fedora Update System 2014-07-31 16:58:53 UTC

Package retrace-server-1.12-2.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing retrace-server-1.12-2.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2014-2089/retrace-server-1.12-2.el6
then log in and leave karma (feedback).

Comment 8 Fedora Update System 2014-08-15 18:57:53 UTC

retrace-server-1.12-2.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.