Bug 1228756 - retrace-server Removal policy needs to be based on atime of the vmcore file
Summary: retrace-server Removal policy needs to be based on atime of the vmcore file
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: retrace-server
Version: el6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: abrt
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-05 15:38 UTC by Dave Wysochanski
Modified: 2020-10-16 10:58 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-16 10:58:29 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Dave Wysochanski 2015-06-05 15:38:59 UTC
Description of problem:
Today retrace-server removes file based on the mtime of the task directory.  This isn't a good removal policy and I'm getting complaints from people who are running crash directly on cores, analyzing them and then retrace removes them out from underneath them.

We need a more sane removal policy.

Version-Release number of selected component (if applicable):
retrace-server-1.12-3.el6.noarch

How reproducible:
Anytime someone has a retrace task, and uses crash directly on a vmcore, retrace-server doesn't notice this today and may remove the vmcore.  There's multiple valid reasons someone may want to run crash directly rather than go through retrace-server-interact.

Steps to Reproduce:
1.  Create a retrace task based on a vmcore
2. Use crash directly on the vmcore file rather than retrace-server-interact <taskid> crash

Actual results:
vmcore gets removed while being used

Expected results:
vmcore only gets removed if it's not been used for a certain period of time (atime based removal).  Basically I'm suggesting to forget about the mtime of the task directory, and just use the atime of the crash/vmcore file to trigger the cleanup operations.

Comment 1 Sterling Alexander 2015-06-05 15:47:00 UTC
As an example of this, retrace task 286582644 was actively being analyzed using crash directly and not not via the "retrace-server-interact <task_id> crash" command.  This task was apparently cleaned up between the evening of 6-4-2015 and the morning of 6-5-2015.  There was a large amount of analysis done on this core on 6-4-2015.

Comment 2 Harshula Jayasuriya 2015-06-05 16:43:46 UTC
Is there a reason why crash is being used directly instead of via retrace-server-interact <taskid> crash?

Comment 3 Harshula Jayasuriya 2015-06-05 16:53:25 UTC
Using atime resulted in vmcores not getting deleted because associates periodically use the vmcore repository as a large dataset and access all the vmcores.

Comment 4 Dave Wysochanski 2015-06-10 22:30:54 UTC
(In reply to Harshula Jayasuriya from comment #2)
> Is there a reason why crash is being used directly instead of via
> retrace-server-interact <taskid> crash?

Yes there are some use cases.  Here are some that I know of:
1. Engineering and others know how to invoke 'crash' on a vmcore file.  Any new command such as 'retrace-server-interact' may not be known / used.

2. For 32-bit vmcores, it's often preferred to run crash directly.  This is due to mock usage and its various problems (see many bugs filed on it).  Especially since crash is so old in mock environment I think people use crash directly on 32-bit vmcores.

3. Some 'failed' tasks are actually quite useful, but fail due to non-errata kernel usage.  If a task status is failed, it will get removed more aggressively due to the different policy for failed tasks (see other bug).  However, people often extract the debuginfo and run crash directly.

The above are all valid use cases to me and I try not to force people into an unnatural / new workflow if there's no benefit.

Comment 5 Dave Wysochanski 2015-06-10 22:32:25 UTC
(In reply to Harshula Jayasuriya from comment #3)
> Using atime resulted in vmcores not getting deleted because associates
> periodically use the vmcore repository as a large dataset and access all the
> vmcores.

I wasn't aware this was an issue.  When did this happen and what was the extent of the usage problem?

To me, using atime is the most sensible approach since it means people are still using the files.  IMO we should not ever remove a vmcore if someone is using it.  Is there some reason you don't like atime other than the reason you mention?

Comment 6 Dave Wysochanski 2015-06-11 17:32:36 UTC
Another reason:
4. Use an alternative 'crashrc' file

Keep in mind "retrace-server-interact <taskid> crash" does give the command to run crash directly when you first start it, so this is a valid way to look at the vmcore:

$ retrace-server-interact 590663811 crash
If you want to execute the command manually, you can run
$ crash -i /cores/retrace/tasks/590663811/crashrc /cores/retrace/tasks/590663811/crash/vmcore /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.17.1.el6.x86_64/vmlinux

Comment 7 Harshula Jayasuriya 2015-06-11 18:21:38 UTC
(In reply to Dave Wysochanski from comment #5)

> I wasn't aware this was an issue.  When did this happen and what was the
> extent of the usage problem?

During testing, Tasks that we were expecting to be deleted were not even though DeleteTaskAfter hours had passed. IIRC, this was because the atime of Task dir was getting updated even though the vmcore was not being used. Using mtime w/ r-s-i solved the problem.

> To me, using atime is the most sensible approach since it means people are
> still using the files.

MAIN ISSUE: A sensible solution is one where the Task gets deleted after exactly DeleteTaskAfter hours have passed since the vmcore was used for vmcore analysis. If you can ensure that happens using atime, go for it. Otherwise, you'll just see unwanted vmcores chewing up the limited storage space.


(In reply to Dave Wysochanski from comment #4)

> 1. Engineering and others know how to invoke 'crash' on a vmcore file.  Any
> new command such as 'retrace-server-interact' may not be known / used.

The explicit command to use is given on the webpage we pass on to ENG: https://optimus[...]/manager/<taskid> . How many vmcores have been deleted while ENG have been analysing them?

The Retrace Server User Guide states the command to run. This should be part of the coaching of GSS associates in kernel related SBRs.

> 2. For 32-bit vmcores, it's often preferred to run crash directly.  This is
> due to mock usage and its various problems (see many bugs filed on it). 
> Especially since crash is so old in mock environment I think people use
> crash directly on 32-bit vmcores.

We need to solve the many issues surrounding 32-bit vmcores.
 
> 3. Some 'failed' tasks are actually quite useful, but fail due to non-errata
> kernel usage.  If a task status is failed, it will get removed more
> aggressively due to the different policy for failed tasks (see other bug). 
> However, people often extract the debuginfo and run crash directly.

Users should be extracting debuginfo in the exceptions directory and definitely not in the Task directories. Again covered in the Retrace Server User Guide. Another coaching opportunity.


(In reply to Dave Wysochanski from comment #6)

> 4. Use an alternative 'crashrc' file

That file is editable, right?
 
> Keep in mind "retrace-server-interact <taskid> crash" does give the command
> to run crash directly when you first start it, so this is a valid way to
> look at the vmcore:
> 
> $ retrace-server-interact 590663811 crash
> If you want to execute the command manually, you can run
> $ crash -i /cores/retrace/tasks/590663811/crashrc
> /cores/retrace/tasks/590663811/crash/vmcore
> /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.17.1.
> el6.x86_64/vmlinux

You "can" run it, but "retrace-server-interact <taskid> crash" is lot easier to run.

Focus on the paragraph annotated "MAIN ISSUE" and solve that problem instead of enumerating a list of tenuous reasons for not using the r-s-i command as it is documented on the Task webpage and the User Guide.

cya,
#

Comment 8 Dave Wysochanski 2015-06-12 10:24:17 UTC
(In reply to Harshula Jayasuriya from comment #7)
> (In reply to Dave Wysochanski from comment #5)
> 
> > I wasn't aware this was an issue.  When did this happen and what was the
> > extent of the usage problem?
> 
> During testing, Tasks that we were expecting to be deleted were not even
> though DeleteTaskAfter hours had passed. IIRC, this was because the atime of
> Task dir was getting updated even though the vmcore was not being used.
> Using mtime w/ r-s-i solved the problem.
> 
Yes I could see that.  I'm not proposing using any directory based time.  Notice in the description:
"vmcore only gets removed if it's not been used for a certain period of time (atime based removal).  Basically I'm suggesting to forget about the mtime of the task directory, and just use the atime of the crash/vmcore file to trigger the cleanup operations."

So we look at the atime of the vmcore and make a decision about the task based on that.


> > To me, using atime is the most sensible approach since it means people are
> > still using the files.
> 
> MAIN ISSUE: A sensible solution is one where the Task gets deleted after
> exactly DeleteTaskAfter hours have passed since the vmcore was used for
> vmcore analysis. If you can ensure that happens using atime, go for it.
> Otherwise, you'll just see unwanted vmcores chewing up the limited storage
> space.
> 
Yes I think atime of the vmcore file accomplishes the above.


> 
> (In reply to Dave Wysochanski from comment #4)
> 
> > 1. Engineering and others know how to invoke 'crash' on a vmcore file.  Any
> > new command such as 'retrace-server-interact' may not be known / used.
> 
> The explicit command to use is given on the webpage we pass on to ENG:
> https://optimus[...]/manager/<taskid> . How many vmcores have been deleted
> while ENG have been analysing them?
> 
> The Retrace Server User Guide states the command to run. This should be part
> of the coaching of GSS associates in kernel related SBRs.
> 
I don't know but my point is that it would be unexpected for a vmcore to be removed if crash is run directly on it.


> > 2. For 32-bit vmcores, it's often preferred to run crash directly.  This is
> > due to mock usage and its various problems (see many bugs filed on it). 
> > Especially since crash is so old in mock environment I think people use
> > crash directly on 32-bit vmcores.
> 
> We need to solve the many issues surrounding 32-bit vmcores.
>  
> > 3. Some 'failed' tasks are actually quite useful, but fail due to non-errata
> > kernel usage.  If a task status is failed, it will get removed more
> > aggressively due to the different policy for failed tasks (see other bug). 
> > However, people often extract the debuginfo and run crash directly.
> 
> Users should be extracting debuginfo in the exceptions directory and
> definitely not in the Task directories. Again covered in the Retrace Server
> User Guide. Another coaching opportunity.
> 
Actually the Users Guide tells them to extract debuginfo into exceptions.  What is the alternative you suggest for non-errata vmcores?

Until recently when someone discovered you could copy non-errata kernel-debuginfo into a certain directory and retrace-server would pick them up (https://bugzilla.redhat.com/show_bug.cgi?id=888874#c5), we had to manually setup these vmcores.  Where are people supposed to do this?


> 
> (In reply to Dave Wysochanski from comment #6)
> 
> > 4. Use an alternative 'crashrc' file
> 
> That file is editable, right?
>  
Yes but what if someone wants their own?


> > Keep in mind "retrace-server-interact <taskid> crash" does give the command
> > to run crash directly when you first start it, so this is a valid way to
> > look at the vmcore:
> > 
> > $ retrace-server-interact 590663811 crash
> > If you want to execute the command manually, you can run
> > $ crash -i /cores/retrace/tasks/590663811/crashrc
> > /cores/retrace/tasks/590663811/crash/vmcore
> > /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.17.1.
> > el6.x86_64/vmlinux
> 
> You "can" run it, but "retrace-server-interact <taskid> crash" is lot easier
> to run.
> 
That's your opinion.

> Focus on the paragraph annotated "MAIN ISSUE" and solve that problem instead
> of enumerating a list of tenuous reasons for not using the r-s-i command as
> it is documented on the Task webpage and the User Guide.
> 
Your "MAIN ISSUE" is solved by the suggested change given in the description of this bug.

Also I guess you and I somewhat disagree on the 'coaching' people to do something model.  I'd much rather work with people's natural tendencies, if they are reasonable, rather than 'coach' them into something unnatural.  What I've found over time is that people often have good reasons for doing something, and as I've enumerated above there's good reasons here.

Comment 9 Dave Wysochanski 2015-06-12 10:52:49 UTC
Note that some failed tasks do not have a vmcore file.  If a vmcore doesn't exist we can use the mtime of the directory and the DeleteFailedTaskAfter logic should remove these.

Comment 10 Dave Wysochanski 2015-06-12 11:07:51 UTC
Looking at the code, all we'd need is a new method to class RetraceTask in src/lib/retrace.py

Right now we have:
    def get_age(self):
        """Returns the age of the task in hours."""
        return int(time.time() - os.path.getmtime(self._savedir)) / 3600

I would suggest a 'get_vmcore_atime' method and then modifying the logic in src/retrace-server-cleanup to try to use the vmcore atime first, then fall back to the task age.  The 'get_vmcore_atime' method should take into account the fact the vmcore file may not exist.

Comment 11 Dave Wysochanski 2015-06-16 13:41:08 UTC
Recently it was pointed out that if atime is used, if someone has a backup job of the retrace tasks, it will update atime and throw off cleanup.  While this isn't a problem for our implementation (we don't have the space for backups), it's possible someone may not want to use atime due to this.  So in addition to Harshula's comments about using all tasks as a large data set, as much as I hate to say it, we may want to make this configurable.  I would like to try atime but if it becomes a problem we can turn it off.

Comment 13 Dave Wysochanski 2015-06-22 17:38:20 UTC
Due to concerns raised, and lack of definitive proof this is affecting a larger number of vmcores, and some of the other open bugs, lowering the severity / priority.  If we can show a certain percentage of users run crash directly instead of retrace-server-interact, we may want to revisit this.  We can probably get some numbers by counting how many tasks have atime of the vmcore != mtime of retrace task directory.

Comment 14 Dave Wysochanski 2018-04-11 15:19:53 UTC
I am taking this and will probably make a final call on this.  I filed / took bug 1566115 to improve the task_age because of another problem we recently encountered so this bug is somewhat similar.


Note You need to log in before you can comment on or make changes to this bug.