Bug 1558903 - RFE: Either prevent or cleanup tasks that have duplicate md5sums
Summary: RFE: Either prevent or cleanup tasks that have duplicate md5sums
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: retrace-server
Version: epel7
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
Assignee: Dave Wysochanski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1428040 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-21 09:34 UTC by Dave Wysochanski
Modified: 2018-12-21 15:45 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-12-21 15:45:28 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
standalone dedup based on md5 and harlinking (2.56 KB, text/x-python)
2018-03-22 21:36 UTC, Dave Wysochanski
no flags Details
v2 - standalone dedup based on md5 and harlinking (2.84 KB, text/x-python)
2018-03-23 02:34 UTC, Dave Wysochanski
no flags Details
v3 - standalone dedup based on md5 and harlinking (2.80 KB, text/x-python)
2018-03-23 02:51 UTC, Dave Wysochanski
no flags Details
v4 - standalone dedup based on md5 and harlinking (2.19 KB, text/x-python)
2018-03-24 01:04 UTC, Dave Wysochanski
no flags Details
v2: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup; patch is currently under test (6.23 KB, text/plain)
2018-03-24 18:12 UTC, Dave Wysochanski
no flags Details
v2: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup; patch is currently under test (6.23 KB, patch)
2018-03-24 18:15 UTC, Dave Wysochanski
no flags Details | Diff
v3: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup (6.46 KB, patch)
2018-03-26 01:25 UTC, Dave Wysochanski
no flags Details | Diff
v5: standalone md5dedup based on cec0f16ef992650f4a459dc83e1bf24dfe1ab940 (2.67 KB, text/x-python)
2018-03-28 03:55 UTC, Dave Wysochanski
no flags Details
v5: add dedup_vmcore to RetraceWorker and call from retrace-server-cleanup (7.16 KB, patch)
2018-03-28 03:56 UTC, Dave Wysochanski
no flags Details | Diff

Description Dave Wysochanski 2018-03-21 09:34:05 UTC
Description of problem:
Users often inadvertently submit the same vmcore twice, leading to a ton of wasted space.  On our production system, duplicated tasks (same md5sum) accounted for over 10% of total cores space even when the inode number was checked (indicating it was not the same file).

There's a couple ways we could approach this:

1. Add a section to retrace-server-cleanup to remove duplicate tasks.  One downside to this is that users are likely to be caught offguard as they didn't know about the previous duplicate so it's not ideal.

2. Build up a small database of existing tasks and store the taskid,md5sum pairs as records.  Then when a new task is submitted, we finish processing but at the very end we check to see if it is a duplicate.  If it is a duplicate we mark the task as "failed" with a special message in the log and in email notification (if a email was given) indicating the task has completed but was failed due to it being a duplicate and pointing at the duplicate task.  Why complete processing, and only at the very end fail the task?  There could be some reason why a duplicate was desired, so it will give the user the possibility to use "retrace-server-interact <taskid> set-success" to keep the task.  By failing the task duplicates are cleaned up faster using DeleteFailedTaskAfter.



Version-Release number of selected component (if applicable):
retrace-server-1.18

How reproducible:
everytime

Steps to Reproduce:
1. submit a vmcore that is ftp based
2. submit the same vmcore again

Actual results:
multiple tasks are created without any indication the second one is a duplicate

Expected results:
some handling of duplicates, either through notification or automatic deletion

Additional info:
Here is a simple python program that implements most of the logic for #1 to flag the tasks that have duplicate md5sums and yet different inode numbers.  If I run this on our production system (filesystem is 44TB) I get this last line of output:
Total space savings if duplicates (md5sums equal) removed: 4099 GB


$ cat retrace-server-dedup-md5 
#!/usr/bin/python2
import os
import sys
from retrace import *

CONFIG = config.Config()

def get_md5_tasks():
    tasks = []

    for filename in os.listdir(CONFIG["SaveDir"]):
        if len(filename) != CONFIG["TaskIdLength"]:
            continue

        try:
            task = RetraceTask(int(filename))
        except:
            continue

        if task.has_md5sum():
            tasks.append(task)

    return tasks

if __name__ == "__main__":

    md5_tasks = {}
    total_savings = 0
    for task in get_md5_tasks():
        md5 = str.split(task.get_md5sum())[0]
        #print("Processing task %d with md5sum %s" % (task.get_taskid(), md5))
        if md5 in md5_tasks:
            print("Task was %d is a md5 duplicate of %d - md5sum of %s" % (task.get_taskid(), md5_tasks[md5], md5))
            try:
                s1 = os.stat(CONFIG["SaveDir"] + "/" + str(task.get_taskid()) + "/crash/vmcore")
                s2 = os.stat(CONFIG["SaveDir"] + "/" + str(md5_tasks[md5]) + "/crash/vmcore")
            except:
                print("One or both tasks did not have a vmcore")
            if s1.st_ino != s2.st_ino:
                print("Task %d has a different inode than task %d - likely space savings of %d MB if duplicate removed" % (task.get_taskid(), md5_tasks[md5], s1.st_size / 1024 / 1024))
                total_savings += s1.st_size
        else:
            md5_tasks[md5] = task.get_taskid()
    
    print("Total space savings if duplicates (md5sums equal) removed: %d GB" % (total_savings / 1024 / 1024 / 1024))

Comment 1 Dave Wysochanski 2018-03-21 09:39:18 UTC
Example of how to use the above program for simple identification of duplicates that take up a lot of space for manual removal:

$ ./retrace-server-dedup-md5 > md5-dups.txt
$ grep savings md5-dups.txt | sort -nk15 | tail
Task 861293200 has a different inode than task 839030564 - likely space savings of 65426 MB if duplicate removed
Task 463009519 has a different inode than task 580782110 - likely space savings of 65545 MB if duplicate removed
Task 652684601 has a different inode than task 349788403 - likely space savings of 70041 MB if duplicate removed
Task 142728031 has a different inode than task 410198281 - likely space savings of 81638 MB if duplicate removed
Task 935635865 has a different inode than task 102742728 - likely space savings of 132521 MB if duplicate removed
Task 293055325 has a different inode than task 846757068 - likely space savings of 133341 MB if duplicate removed
Task 438428309 has a different inode than task 171131314 - likely space savings of 134949 MB if duplicate removed
Task 364044050 has a different inode than task 590971774 - likely space savings of 135085 MB if duplicate removed
Task 573310972 has a different inode than task 590971774 - likely space savings of 135085 MB if duplicate removed
Task 158800038 has a different inode than task 315736364 - likely space savings of 136482 MB if duplicate removed

Comment 2 Miroslav Suchý 2018-03-21 11:05:45 UTC
Isn't it better to handle it using hardlink or some deduplicating FS (e.g., vdo)?

Comment 3 Dave Wysochanski 2018-03-21 11:20:06 UTC
(In reply to Miroslav Suchý from comment #2)
> Isn't it better to handle it using hardlink or some deduplicating FS (e.g.,
> vdo)?

We already hardlink when possible.  What is happening is that multiple users submit either an FTP task, or the vmcore is say on NFS so there is a genuine copy.

Regarding VDO / generic de-dup fs, this is not free.  IMO this is not complicated RFE and it makes no sense to allow copies like this.

Comment 4 Dave Wysochanski 2018-03-21 11:25:19 UTC
(In reply to Dave Wysochanski from comment #0)
> 
> 2. Build up a small database of existing tasks and store the taskid,md5sum
> pairs as records.  Then when a new task is submitted, we finish processing
> but at the very end we check to see if it is a duplicate.  If it is a
> duplicate we mark the task as "failed" with a special message in the log and
> in email notification (if a email was given) indicating the task has
> completed but was failed due to it being a duplicate and pointing at the
> duplicate task.  Why complete processing, and only at the very end fail the
> task?  There could be some reason why a duplicate was desired, so it will
> give the user the possibility to use "retrace-server-interact <taskid>
> set-success" to keep the task.  By failing the task duplicates are cleaned
> up faster using DeleteFailedTaskAfter.
> 

There is already a 'stats.db' file which stores statistics for each task.  We could easily store the md5sum in there though it is a running database so the searches for a new task would be longer.  This is probably the simplest way to implement this though.

Once you insert the md5sum, search the stats.db for any matches and if there is a match mark the task failed.

Another reason this RFE makes sense is because we already have the md5 for another reason (to validate what the customer uploaded to us is the correct file).  In many other applications there is no md5sum so that is where generic dedup is better IMO.

Comment 5 Dave Wysochanski 2018-03-21 11:28:49 UTC
(In reply to Dave Wysochanski from comment #4)
> 
> Once you insert the md5sum, search the stats.db for any matches and if there
> is a match mark the task failed.
> 
If you find a match, also do a 'stat' on the matched task to ensure it still exists since the stats.db file stores all tasks, it is possible the matched task no longer exists in which case you would not want to fail the new task.

Comment 6 Dave Wysochanski 2018-03-22 20:42:55 UTC
(In reply to Miroslav Suchý from comment #2)
> Isn't it better to handle it using hardlink or some deduplicating FS (e.g.,
> vdo)?

hardlinks - actually Sterling had a similar idea - remove the duplicate file and hardlink to the other file.  Maybe this is what you meant - hardlink but don't delete the task.  That way both tasks can still exist but we save the storage.  I may try this approach in the cleanup job since if we could deploy that, it would give us a fairly instant 4TB of savings.

Comment 7 Dave Wysochanski 2018-03-22 21:36:31 UTC
Created attachment 1411855 [details]
standalone dedup based on md5 and harlinking

Comment 8 Dave Wysochanski 2018-03-23 02:34:40 UTC
Created attachment 1411914 [details]
v2 - standalone dedup based on md5 and harlinking

Comment 9 Dave Wysochanski 2018-03-23 02:39:20 UTC
From what I can tell, v2 works but needs a bit more testing.  Then converting into a patch for retrace-server-cleanup.

Comment 10 Dave Wysochanski 2018-03-23 02:51:28 UTC
Created attachment 1411930 [details]
v3 - standalone dedup based on md5 and harlinking

Comment 11 Dave Wysochanski 2018-03-24 01:04:27 UTC
Created attachment 1412356 [details]
v4 - standalone dedup based on md5 and harlinking

Comment 12 Dave Wysochanski 2018-03-24 18:12:17 UTC
Created attachment 1412577 [details]
v2: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup; patch is currently under test

Comment 13 Dave Wysochanski 2018-03-24 18:15:41 UTC
Created attachment 1412578 [details]
v2: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup; patch is currently under test

Comment 14 Dave Wysochanski 2018-03-26 01:24:10 UTC
https://github.com/abrt/retrace-server/pull/181

Comment 15 Dave Wysochanski 2018-03-26 01:25:14 UTC
Created attachment 1412896 [details]
v3: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup

Comment 16 Dave Wysochanski 2018-03-26 10:48:58 UTC
Testing on the latest patch found a couple issues:
- get_md5_tasks: should not add tasks with md5sum file containing "Enabled"
- use of task.set_log can truncate an in-progress tasks's log
- use of task.set_log does not contain the date/timestamp

Probably the above two are related since we should not be trying to de-duplicate an in-progress task.  Probably modify get_md5_tasks to only add a task if:
- md5sum file really contains an md5sum
- a task is completed

In testing I saw what seems to be happening with these duplicates and it looks unavoidable so we need some deduplication.  Customers can submit the same exact vmcore but with different filenames.  We won't know it is the same until we actually submit it, and then it is too late - we have a duplicate.  I just observed this first hand with a couple submissions which were a couple weeks apart.

Comment 17 Dave Wysochanski 2018-03-26 18:26:30 UTC
https://github.com/abrt/retrace-server/pull/182

Comment 18 Dave Wysochanski 2018-03-28 03:55:01 UTC
Created attachment 1413961 [details]
v5: standalone md5dedup based on cec0f16ef992650f4a459dc83e1bf24dfe1ab940

Comment 19 Dave Wysochanski 2018-03-28 03:56:16 UTC
Created attachment 1413962 [details]
v5: add dedup_vmcore to RetraceWorker and call from retrace-server-cleanup

Comment 20 Dave Wysochanski 2018-03-28 04:10:09 UTC
I think for now doing the logic in the cleanup job is fine.  However, at some point we may want to shift the calling of dedup_vmcore from the cleanup job to the task creation / processing time.  I think there are limited use cases for duplicate tasks (i.e. testing) but not duplicate vmcores.  So if someone submits a duplicate, probably:
- make an email notification of the duplicate and the location
- call the dedup_vmcore function 
- have the status be the same as the duplicate (i.e. don't fail because it's a duplicate)

In order to implement the calling of dedup at submit time I think that is another level of work I'm not sure I want to take on right now and it can be another bug.

Comment 21 Dave Wysochanski 2018-04-10 14:23:03 UTC
*** Bug 1428040 has been marked as a duplicate of this bug. ***

Comment 22 Dave Wysochanski 2018-06-11 14:26:47 UTC
Patch series to implement dedup on existing tasks (which is probably the best we can do currently since we do not know the md5sum ahead of time) has been merged:
https://github.com/abrt/retrace-server/pull/182

I am going to open a separate bug to reject a task submission once we have a md5sum of the tarball fetchable via remote system.  Due to how support works, it is often the case that a large vmcore may be in progress and someone else re-submits the same one.  It is possible the files are different but we cannot know just from the filename usually - we need a remote checksum before we reject it.

Comment 23 Dave Wysochanski 2018-12-21 15:45:28 UTC
$ git tag --contains ef1fb69
1.19.0


Note You need to log in before you can comment on or make changes to this bug.