Bug 1566115 - RetraceServer.task_age implementation vulnerable to rogue script or random user touching retrace task directories
Summary: RetraceServer.task_age implementation vulnerable to rogue script or random us...
Keywords:
Status: NEW
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: retrace-server
Version: epel7
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
Assignee: abrt
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-11 14:34 UTC by Dave Wysochanski
Modified: 2020-03-31 21:34 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)

Description Dave Wysochanski 2018-04-11 14:34:12 UTC
Description of problem:
Our cleanup logic is based on 'task_age' which in turn is based only on the change time of the task directory.  This is too fragile as any random user or script doing 'touch' on the task directories can reset the task_age and thus defeat the cleanup logic.

We should make the task_age only reset when running a retrace-server-{task|interact|worker} command, not allow any user or script to reset it with a file based command.

Version-Release number of selected component (if applicable):
upstream as of ef1fb69928e9dbd28e28b1fd73fb6c2481cef14f

How reproducible:
everytime

Steps to Reproduce:
1. any user can 'touch' a retrace-server task directory and defeat the cleanup logic

Actual results:
the cleanup logic may be impacted by a rogue script that inadvertantly touches all task directories.

Expected results:
only specific retrace-server commands should reset the task_age

Additional info:
It looks like this recently happened with our production retrace instance.  As a result, we will have to do manual cleanup to avoid running out of space or severely change our Delete*TaskAfter settings which will likely lose data that people care about.

Alternatively, we could implement an atime based age of the vmcore itself.  That way we only remove tasks where the vmcore was accessed.  This was proposed in another bug though (see https://bugzilla.redhat.com/show_bug.cgi?id=1228756) but there are also downsides to this (other scripts may actually scan vmcores looking for issues which would then reset any atime based task_age).

Comment 1 Dave Wysochanski 2018-04-11 18:18:23 UTC
Actually now that I investigated in further details, based on the timestamps on various files and directories, I think what happened is someone ran a retrace-server-interact command on all tasks (not sure if it's intentional or not).  So I may end up closing this as WONTFIX.

This does bring up the matter of whether our existing "DeleteTaskAfter" and "DeleteFailedTaskAfter" management of vmcores is really adequate.  It's ok but it's not great.  It would be a lot better if we could shift towards higher value data about whether a vmcore is useful or not to make removal or archival decisions.  For example, https://bugzilla.redhat.com/show_bug.cgi?id=1548183 may be better approach as well as some very aggressive cleanup based on times or other data (case state - outside of retrace-server).

Comment 2 Matej Marušák 2018-04-13 08:52:00 UTC
It seems to me, that it is not very well thought out to use last modification time for cleaning up tasks. It would be much much better to save some value (last modification time or delete-after-date) into specific file in task or into database. This time could be changed only by specific scripts and cleanup task would look into this tasks. It also would be easier to implement closed-bugzilla-cleanup (when assigned bugzzila gets closed, clean the task after a short while).

Comment 3 Dave Wysochanski 2018-04-13 09:49:00 UTC
(In reply to Matej Marušák from comment #2)
> It seems to me, that it is not very well thought out to use last
> modification time for cleaning up tasks. It would be much much better to
> save some value (last modification time or delete-after-date) into specific
> file in task or into database. This time could be changed only by specific
> scripts and cleanup task would look into this tasks. It also would be easier
> to implement closed-bugzilla-cleanup (when assigned bugzzila gets closed,
> clean the task after a short while).

I am not sure I follow.  How would this relate to the closed-bugzilla-cleanup?

Even if we put the task_age into a file that is not accessible, anyone can do something like what was done in this instance and reset the task age by running a retrace command.  In essence this bug should not have been filed because it is not what happened.  That said, someone could have a rogue script to touch all of the directories, and retrace cleanup is vulnerable to that.  We've not seen this happen though and so I'm not sure it's worth too much to fix that, though in principle it is probably better to fix it.  The other angle is really the atime on the vmcore - that's really a better measure of usefulness of a vmcore - did someone read it recently?  That is vulnerable to scripts analyzing a group of vmcores for patterns (though doesn't that mean the set of vmcores are useful?), so we could get into another similar problem like this bug describes.

In the end determining when to clean / remove a vmcore is not easy unless you have some better data like is the vmcore associated with a bug or a case and what is the status?

I tend to think the simplest design that gets us the best result would be atime of the vmcore.  However I do not know how well that would work in practice.

Comment 4 Dave Wysochanski 2018-04-15 12:35:51 UTC
I guess the question we should answer about this bug is if we want to make the task_age something that can only be changed by a retrace-server-* command, or is it ok to have something that is just filesystem based (like mtime or atime)?

If we make it hidden then we probably want to expose the task_age via retrace-server-task so users can query it.

Comment 5 Matej Marušák 2018-04-17 07:46:04 UTC
> How would this relate to the closed-bugzilla-cleanup

We can set specific time, how long after closed bugzilla we want to keep the task. It may be shorter than default 30 days.

> anyone can do something like what was done in this instance and reset the task age 

We cannot fight with that - if someone wants to reset times, there is nothing we can do about it.

> In the end determining when to clean / remove a vmcore is not easy unless you have some better data like is the vmcore associated with a bug or a case and what is the status?

Agree. atime would not bring that much of improvement, but if you believe it would be better (and not break the cleanup logic), it can be implemented.

> I guess the question we should answer about this bug is if we want to make the task_age something that can only be changed by a retrace-server-* command, or is it ok to have something that is just filesystem based (like mtime or atime)?

I still think that having own time is better. It is more foolproof against users.


Note You need to log in before you can comment on or make changes to this bug.