Bug 1428040

Summary: RFE: De-duplication of vmcores - add "AllowDuplicates" option in /etc/retrace-server.conf or de-dedup cleanup job
Product: [Fedora] Fedora EPEL Reporter: Dave Wysochanski <dwysocha>
Component: retrace-serverAssignee: Dave Wysochanski <dwysocha>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: epel7CC: jakub, michal.toman, mmarusak
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 14:23:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1128972    
Bug Blocks:    

Description Dave Wysochanski 2017-03-01 18:38:19 UTC
Description of problem:
In our production system for various reasons a vmcore can get submitted multiple times.  If the task is non-local file (i.e. ftp) this will take up unnecessary space and lead to confusion.  In particular, this may happen with damaged or larger vmcores.  In the case of larger vmcores it can quickly consume a ton of space.  In my experience many of the out of space conditions we've seen in production are the result of very large vmcores and often duplicate vmcores.

This RFE requests some sort of function to avoid duplicates.  The function depends on knowing something is a duplicate, so it depends on implementation of the md5sum bug https://bugzilla.redhat.com/show_bug.cgi?id=1128972 or something similar.

There are at least two approaches to solve the problem of duplicates:
Option 1: Implement logic to check for duplicates at submission time, and automatically remove any task which will end up being a duplicate.  This option has the advantage that there is no cleanup later, and the user submitting the vmcore can be redirected to the existing vmcore and they do not have to wait for the new task.

Option 2: Implement a cleanup job that will periodically scan for duplicates, and remove them.  The downside of this approach is that multiple engineers may not be aware there is a duplicate, and so they may have their task removed.  If any files exist in 'misc' directory this might end up in lost work for an engineer.  Even still this does seem feasible and the cleanup job may email the submitter of the task or anyone with a file with ownership in 'misc' directory.

I probably would lean at "Option 1" since it is probably better to avoid creating duplicates if you can.

In any case of the implementation, I think we need:

1. An "AllowDuplicates" option in /etc/retrace-server.conf.  This should default to 'N' so the cleanup works by default.

2. Some database to track tasks by md5sum.
a) When a task is submitted, the md5sum is run and a lookup is done on the database.  If a hit is found, the taskid stored in the database can be given to the user in the notification of his deleted task.  "Task XYZ contained file foo.tar which is a duplicate of taskid ABC.  Task XYZ has been cancelled and deleted.  Please use taskid ABC for your analysis."  If no hit to the database, a record is added to the database.
b) When a task is deleted, its record in the database will need to be deleted.

Version-Release number of selected component (if applicable):
retrace-server-1.16-1.el6.noarch

How reproducible:
Everytime

Steps to Reproduce:
1. Submit a vmcore to retrace-server
2. Submit the same vmcore a second time

Actual results:
Duplicate task which unnecessarily takes up disk space

Expected results:
No duplicate task

Additional info:
There may be instances when duplicates are desired.  The one case I can think of off the top of my head is for testing - we submit a series of files / vmcores to ensure proper function of a new retrace-server build.  In this use case though we can easily change AllowDuplicates=Yes.

This is a lower priority item but does come up enough that it warrants a look, and probably the implementation is not very hard either.

Comment 1 Dave Wysochanski 2018-04-10 14:23:03 UTC

*** This bug has been marked as a duplicate of bug 1558903 ***