Bug 1082327 - RFE: Add a subdirectory to index vmcores by 'caseno' if 'caseno' file exists, need human-readable name for a retrace task
Summary: RFE: Add a subdirectory to index vmcores by 'caseno' if 'caseno' file exists,...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: retrace-server
Version: el6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
Assignee: abrt
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1082376
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-30 12:14 UTC by Dave Wysochanski
Modified: 2020-11-30 15:13 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-11-30 15:13:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Simple bash script to create an index of retrace tasks by caseno and finished_time (938 bytes, application/x-shellscript)
2014-03-30 12:14 UTC, Dave Wysochanski
no flags Details
Simple bash script to create an index of retrace tasks by caseno and finished_time (929 bytes, application/x-shellscript)
2014-03-30 19:21 UTC, Dave Wysochanski
no flags Details

Description Dave Wysochanski 2014-03-30 12:14:27 UTC
Created attachment 880306 [details]
Simple bash script to create an index of retrace tasks by caseno and finished_time

Description of problem:
Often we get cases where customers submit multiple vmcores.  After about 1 or 2 vmcores it gets difficult to keep track of the task id's of the vmcores on the case.  It's not a ton of time but you have to search the case comments, etc.  People usually start with the case they are working on, so they know the case number.  

This RFE is to add a 'core-by-caseno' index subdirectory, and create a symlink to the retrace task with a meaningful name (I picked 'finished_time' in my index, but we could use something else, such as the info inside crash 'sys' output such as the 'DATE' and 'NODENAME').

$ ls -l 001050838/
total 0
lrwxrwxrwx. 1 dwysocha dwysocha 30 Mar 30 07:28 2014-Mar-06-1255-443952481 -> /cores/retrace/tasks/443952481
lrwxrwxrwx. 1 dwysocha dwysocha 30 Mar 30 07:28 2014-Mar-06-1259-864010766 -> /cores/retrace/tasks/864010766
lrwxrwxrwx. 1 dwysocha dwysocha 30 Mar 30 07:28 2014-Mar-12-1504-177582185 -> /cores/retrace/tasks/177582185
...
lrwxrwxrwx. 1 dwysocha dwysocha 30 Mar 30 07:28 2014-Mar-25-0853-229467397 -> /cores/retrace/tasks/229467397
lrwxrwxrwx. 1 dwysocha dwysocha 30 Mar 30 07:28 2014-Mar-26-1333-660655087 -> /cores/retrace/tasks/660655087


Version-Release number of selected component (if applicable):
retrace-server-1.11-1.el6.noarch


Details of the RFE:
1. Create a subdirectory underneath /cores/retrace/ where the index will go (perhaps /cores/retrace/tasks-by-caseno)
2. When a task completes, check for the presence of 'caseno' file in the task directory.
3. If the caseno file exists, cat the file, and based on the contents, check whether /cores/retrace/tasks-by-caseno/$caseno directory already exists in the index, if not mkdir /cores/retrace/tasks-by-caseno/$caseno.
4. Once the /cores/retrace/tasks-by-caseno/$caseno exists, create the symlink to the retrace task, with a human-readable / identifying name, such as the following:
2014-Mar-25-0853-229467397 -> /cores/retrace/tasks/229467397
5. Details of the symlink name are TBD
* The format of the symlink needs some thought.  For example, we may want to use ISO format for the date (i.e. 201403061255) but personally I liked something more readable such as the above.  We could also use the info inside the 'sys' output of crash, due to the fact it's often more meaningful to think about the task in terms of the date/time the crash occurred, rather than the time the retrace task completed, and perhaps the name of the machine that crashed.
6. It may be better to have a small python tool to do this rather than adding a couple lines in existing retrace server.


Additional info:
A number of months back I had put this on the list of retrace RFEs and forgot about it until I got a case with 20+ vmcores on it.  So I created a simple script to index it and asked for feedback on sbr-kernel list (http://post-office.corp.redhat.com/archives/sbr-kernel-list/2014-March/msg00025.html).  Feedback was all positive that it would be a useful thing to have.

This would probably be better implemented inside retrace, since retrace could also cleanup the symlink when the task is deleted.  It probably is very easy to do inside retrace too.  Note there are a few pitfalls to watch out for, and I think I've covered most of them inside the script attached.

Comment 1 Mateusz Guzik 2014-03-30 12:57:30 UTC
I think case-based scheme should replace current one, if that is not feasible we are better off with post-analysis and pre-cleanup hooks that would execute scripts.

In case hooks get implemented, retrace should provide config options +/- follows:
HookUser = blah
HookGroup = blah
HookPostAnalysis = /some/dir
HookPreCleanup = /some/other/dir

Then it would execute each file in respective directories as HookUser:HookGroup, in alphabetical order.

Comment 2 Mateusz Guzik 2014-03-30 12:59:59 UTC
Scripts would receive 2 arguments: task id and task directory

(yes, I know task id can be currently obtained from the path, but this does look like a good thing to depend on)

Comment 3 Dave Wysochanski 2014-03-30 19:21:08 UTC
Created attachment 880444 [details]
Simple bash script to create an index of retrace tasks by caseno and finished_time

Fixed 'find' in script - otherwise we get a lot of these, I think from the 32-bit vmcores.
find: `/cores/retrace/tasks/610785335-kernel': Permission denied

Comment 4 Dave Wysochanski 2014-03-30 19:27:05 UTC
(In reply to Mateusz Guzik from comment #1)
> I think case-based scheme should replace current one, if that is not
> feasible we are better off with post-analysis and pre-cleanup hooks that
> would execute scripts.
> 
Not sure what you mean - are you saying replace the taskid with something like '/cores/retrace/tasks/<caseno>/..."?

If so I'm not sure about it since it's a bigger change.  Right now 'caseno' is optional too.

> In case hooks get implemented, retrace should provide config options +/-
> follows:
> HookUser = blah
> HookGroup = blah
> HookPostAnalysis = /some/dir
> HookPreCleanup = /some/other/dir
> 
> Then it would execute each file in respective directories as
> HookUser:HookGroup, in alphabetical order.

I think the hooks should be a different bz - feel free to file it or I will do it soon.  I agree we need these if there is not already something in retrace.  Need to think about it to make sure existing retrace functionality is not destabilized by some post-analysis hook.

Comment 5 Mateusz Guzik 2014-03-30 21:31:52 UTC
(In reply to Dave Wysochanski from comment #4)
> (In reply to Mateusz Guzik from comment #1)
> > I think case-based scheme should replace current one, if that is not
> > feasible we are better off with post-analysis and pre-cleanup hooks that
> > would execute scripts.
> > 
> Not sure what you mean - are you saying replace the taskid with something
> like '/cores/retrace/tasks/<caseno>/..."?
> 
> If so I'm not sure about it since it's a bigger change.  Right now 'caseno'
> is optional too.
> 

Well, not requiring case number when submitting a vmcore reduces usefulness of this feature in the first place. So case number should become mandatory (and maybe we can work something out to make it easier to submit it).

With case-related numbers ids current id scheme become redundant, thus it makes sense to get rid of it.

However, if that cannot be done for now, it will be easier for everyone to have custom scripts doing the job.

On the other hand we would need hooks even if current id scheme goes away, so I agree separate bz is needed - https://bugzilla.redhat.com/show_bug.cgi?id=1082376

That said, this bz should be: replace current scheme with case number based one, if that is not going to work we should close it.

Comment 6 Dave Wysochanski 2014-03-30 23:42:40 UTC
(In reply to Mateusz Guzik from comment #5)
> (In reply to Dave Wysochanski from comment #4)
> > (In reply to Mateusz Guzik from comment #1)
> > > I think case-based scheme should replace current one, if that is not
> > > feasible we are better off with post-analysis and pre-cleanup hooks that
> > > would execute scripts.
> > > 
> > Not sure what you mean - are you saying replace the taskid with something
> > like '/cores/retrace/tasks/<caseno>/..."?
> > 
> > If so I'm not sure about it since it's a bigger change.  Right now 'caseno'
> > is optional too.
> > 
> 
> Well, not requiring case number when submitting a vmcore reduces usefulness
> of this feature in the first place. So case number should become mandatory
> (and maybe we can work something out to make it easier to submit it).
> 
True but assuming people working on cases are submitting the cores there should be incentives to submit the case # with the core.  That said, if we can make it easier we should do it.  I already filed one bz in this regard and it's been addressed (https://bugzilla.redhat.com/show_bug.cgi?id=999643)

Maybe someone needs to work out a portal app to take a vmcore name and submit from the case, then the case # would be there.  After all, the customer just gives us the tarball name of the vmcore they uploaded and that is how it is identified through case updates.  It then makes sense to have some way directly from the case to submit to retrace-server and we should have most vmcores with caseno file.



> With case-related numbers ids current id scheme become redundant, thus it
> makes sense to get rid of it.
> 
Not really.  Why would you think everyone using retrace-server for vmcores would have a case#?  Remember people can use it outside of case work.  That is just our primary use case.  However, I think we're agreeing on the general idea that task#'s are just numbers, and they don't add value to a user of retrace - they kinda get in the way a bit but I understand why we have them.

On a slightly related topic, it may be nice to be able to load crash just based on the vmcore tarball, or perhaps we need tools to look it up (kinda what this bz is).  I do agree the 'taskid' is just something people need to remember which is really unrelated to the important information which identifies a vmcore (case#, vmcore tarball name, machine name timestamp of the crash, customer name, bugzilla if associated, etc).  So it may be we need to think about a better interface into retrace rather than using the taskid, which is really what is the overarching reason for this bz I think.


> However, if that cannot be done for now, it will be easier for everyone to
> have custom scripts doing the job.
> 
Ok so you'd have this index outside of retrace-server package then?
I didn't like the idea of custom scripts the more I thought about it but yeah it can be done.  We'd need to have cleanup script too which is not too hard.


> On the other hand we would need hooks even if current id scheme goes away,
> so I agree separate bz is needed -
> https://bugzilla.redhat.com/show_bug.cgi?id=1082376
> 
> That said, this bz should be: replace current scheme with case number based
> one, if that is not going to work we should close it.

I don't think I agree with the above but I'll have to take a crack at a patch and/or see what Michal thinks.  Maybe you're right and we should just close the bz and add scripts in some other package, which would go on top of existing retrace-server taskid based design to make the user interface a bit more "workflow friendly"

Comment 7 Dave Wysochanski 2014-03-30 23:57:39 UTC
(In reply to Mateusz Guzik from comment #5)
> That said, this bz should be: replace current scheme with case number based
> one, if that is not going to work we should close it.

One thing we always need is some disambiguation for cores with the same characteristics (same case#, submitted at same time, tarball same name, etc).  If we ditch taskid, we'd have to come up with some scheme for uniqueness problem (such as time of submission to retrace-server if it can be guaranteed unique).  So I think the existing 'taskid' type approach is fine for uniqueness problem and the base of retrace-server.

Comment 8 Lachlan McIlroy 2014-03-31 00:46:46 UTC
(In reply to Dave Wysochanski from comment #6)
> (In reply to Mateusz Guzik from comment #5)
> > (In reply to Dave Wysochanski from comment #4)
> > > (In reply to Mateusz Guzik from comment #1)
> > > > I think case-based scheme should replace current one, if that is not
> > > > feasible we are better off with post-analysis and pre-cleanup hooks that
> > > > would execute scripts.
> > > > 
> > > Not sure what you mean - are you saying replace the taskid with something
> > > like '/cores/retrace/tasks/<caseno>/..."?
> > > 
> > > If so I'm not sure about it since it's a bigger change.  Right now 'caseno'
> > > is optional too.
> > > 
> > 
> > Well, not requiring case number when submitting a vmcore reduces usefulness
> > of this feature in the first place. So case number should become mandatory
> > (and maybe we can work something out to make it easier to submit it).
> > 
> True but assuming people working on cases are submitting the cores there
> should be incentives to submit the case # with the core.  That said, if we
> can make it easier we should do it.  I already filed one bz in this regard
> and it's been addressed (https://bugzilla.redhat.com/show_bug.cgi?id=999643)
> 
> Maybe someone needs to work out a portal app to take a vmcore name and
> submit from the case, then the case # would be there.  After all, the
> customer just gives us the tarball name of the vmcore they uploaded and that
> is how it is identified through case updates.  It then makes sense to have
> some way directly from the case to submit to retrace-server and we should
> have most vmcores with caseno file.

This sounds like a good idea.  The case can then maintain it's own list of vmcores submitted (and possibly include details from the 'sys' command so we can distinguish different cores).

Comment 9 Mateusz Guzik 2014-03-31 01:16:09 UTC
> > Well, not requiring case number when submitting a vmcore reduces usefulness
> > of this feature in the first place. So case number should become mandatory
> > (and maybe we can work something out to make it easier to submit it).
> > 
> True but assuming people working on cases are submitting the cores there
> should be incentives to submit the case # with the core.  That said, if we
> can make it easier we should do it.  I already filed one bz in this regard
> and it's been addressed (https://bugzilla.redhat.com/show_bug.cgi?id=999643)
> 
> Maybe someone needs to work out a portal app to take a vmcore name and
> submit from the case, then the case # would be there.  After all, the
> customer just gives us the tarball name of the vmcore they uploaded and that
> is how it is identified through case updates.  It then makes sense to have
> some way directly from the case to submit to retrace-server and we should
> have most vmcores with caseno file.
> 

That sounds reasonable for the short term.

I think the real problem is that customers upload files to an anonymous server. Instead, an account should be created along with the case in sfdc and you would get credentials in return. Files uploaded with such credentials end up in case-specific directory thus we get stuff automagically classified and can have it acted upon in similar manner.
 
> > With case-related numbers ids current id scheme become redundant, thus it
> > makes sense to get rid of it.
> > 
> Not really.  Why would you think everyone using retrace-server for vmcores
> would have a case#?  Remember people can use it outside of case work.  That
> is just our primary use case.  However, I think we're agreeing on the
> general idea that task#'s are just numbers, and they don't add value to a
> user of retrace - they kinda get in the way a bit but I understand why we
> have them.
> 

I don't really see why would you use retrace for something else than case work, but such usage only strenghtens my argument. :)

As was noted, the problem is that we just have a directory with semi-random numbers known as task id. Nobody knows what the core is based on directory content unless it happens to have caseno files.

That said, some outside-of-sfdc usage with no additional information would only contribute to the mess. Instead all cores should be "resolvable" to a case number, and if there is none we have to know who owns the core - so it needs some user-specific id, username or something else. I don't see why this could not be included in task id.

> On a slightly related topic, it may be nice to be able to load crash just
> based on the vmcore tarball, or perhaps we need tools to look it up (kinda
> what this bz is).  I do agree the 'taskid' is just something people need to
> remember which is really unrelated to the important information which
> identifies a vmcore (case#, vmcore tarball name, machine name timestamp of
> the crash, customer name, bugzilla if associated, etc).  So it may be we
> need to think about a better interface into retrace rather than using the
> taskid, which is really what is the overarching reason for this bz I think.
>

Just core name will lead to ambiguity (read: problems) in the long run, case number is needed. 

For instance if per-case dirs are created we can suddenly get files with the same name from different cases (and the one you want is not submitted yet, or maybe just got removed, while the other one is ready), you don't want to open a different core by accident, do you? :->

... and if you allow people to lookup the core based only on the name this will happen in the future.

But if we will have links/dirs with cases (and we will) this loses most of its value.

> 
> > However, if that cannot be done for now, it will be easier for everyone to
> > have custom scripts doing the job.
> > 
> Ok so you'd have this index outside of retrace-server package then?
> I didn't like the idea of custom scripts the more I thought about it but
> yeah it can be done.  We'd need to have cleanup script too which is not too
> hard.
> 

Not sure what you don't like. With custom scripts we have more flexibility to do various stuff.

> 
> > On the other hand we would need hooks even if current id scheme goes away,
> > so I agree separate bz is needed -
> > https://bugzilla.redhat.com/show_bug.cgi?id=1082376
> > 
> > That said, this bz should be: replace current scheme with case number based
> > one, if that is not going to work we should close it.
> 
> I don't think I agree with the above but I'll have to take a crack at a
> patch and/or see what Michal thinks.  Maybe you're right and we should just
> close the bz and add scripts in some other package, which would go on top of
> existing retrace-server taskid based design to make the user interface a bit
> more "workflow friendly"

Again, custom scripts = more flexibility. On the other hand replacing current scheme with case-based is cleaner and would have to happen in retrace.

Leaving the scheme as it is and implementing this feature in retrace gives us the worst of two worlds. :)

Comment 10 Mateusz Guzik 2014-03-31 01:22:34 UTC
(In reply to Dave Wysochanski from comment #7)
> (In reply to Mateusz Guzik from comment #5)
> > That said, this bz should be: replace current scheme with case number based
> > one, if that is not going to work we should close it.
> 
> One thing we always need is some disambiguation for cores with the same
> characteristics (same case#, submitted at same time, tarball same name,
> etc).  If we ditch taskid, we'd have to come up with some scheme for
> uniqueness problem (such as time of submission to retrace-server if it can
> be guaranteed unique).  So I think the existing 'taskid' type approach is
> fine for uniqueness problem and the base of retrace-server.

Uniqueness is not a problem. We can have: caseno/coreno, caseno-coreno or some other variant where coreno is a counter.

Comment 11 Dave Wysochanski 2014-04-12 11:43:05 UTC
(In reply to Mateusz Guzik from comment #10)
> (In reply to Dave Wysochanski from comment #7)
> > (In reply to Mateusz Guzik from comment #5)
> > > That said, this bz should be: replace current scheme with case number based
> > > one, if that is not going to work we should close it.
> > 
> > One thing we always need is some disambiguation for cores with the same
> > characteristics (same case#, submitted at same time, tarball same name,
> > etc).  If we ditch taskid, we'd have to come up with some scheme for
> > uniqueness problem (such as time of submission to retrace-server if it can
> > be guaranteed unique).  So I think the existing 'taskid' type approach is
> > fine for uniqueness problem and the base of retrace-server.
> 
> Uniqueness is not a problem. We can have: caseno/coreno, caseno-coreno or
> some other variant where coreno is a counter.

Yeah that's a good idea.  But again you're assuming we mandate the case#.  I'm not sure about this since not everyone that uses retrace has 'case numbers'.  For us it would be great though.

I am also concerned about instability should we go the route of moving away from the taskid as the basis.  Recently we've addressed a lot of problems but we've still got multiple issues in production.  It's one thing to add an index, but another thing to change the design.  We need Michal to weigh in on it, someone to create a patch, and probably some help with testing.  It doesn't seem like it would be too bad but we won't know until someone does a patch.

Comment 12 Mateusz Guzik 2014-04-12 12:39:44 UTC
I already noted that in #9:

"That said, some outside-of-sfdc usage with no additional information would only contribute to the mess. Instead all cores should be "resolvable" to a case number, and if there is none we have to know who owns the core - so it needs some user-specific id, username or something else. I don't see why this could not be included in task id."

Replace caseno with ident, where ident is either a case number or user login.

Comment 14 Ben Cotton 2020-11-05 16:50:12 UTC
This message is a reminder that EPEL 6 is nearing its end of life. Fedora will stop maintaining and issuing updates for EPEL 6 on 2020-11-30. It is our policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of 'el6'.

Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later EPEL version.

Thank you for reporting this issue and we are sorry that we were not able to fix it before EPEL 6 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged  change the 'version' to a later Fedora version prior this bug is closed as described in the policy above.

Comment 15 Ben Cotton 2020-11-30 15:13:31 UTC
EPEL el6 changed to end-of-life (EOL) status on 2020-11-30. EPEL el6 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
EPEL please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.