Bug 1558903
Description
Dave Wysochanski
2018-03-21 09:34:05 UTC
Example of how to use the above program for simple identification of duplicates that take up a lot of space for manual removal: $ ./retrace-server-dedup-md5 > md5-dups.txt $ grep savings md5-dups.txt | sort -nk15 | tail Task 861293200 has a different inode than task 839030564 - likely space savings of 65426 MB if duplicate removed Task 463009519 has a different inode than task 580782110 - likely space savings of 65545 MB if duplicate removed Task 652684601 has a different inode than task 349788403 - likely space savings of 70041 MB if duplicate removed Task 142728031 has a different inode than task 410198281 - likely space savings of 81638 MB if duplicate removed Task 935635865 has a different inode than task 102742728 - likely space savings of 132521 MB if duplicate removed Task 293055325 has a different inode than task 846757068 - likely space savings of 133341 MB if duplicate removed Task 438428309 has a different inode than task 171131314 - likely space savings of 134949 MB if duplicate removed Task 364044050 has a different inode than task 590971774 - likely space savings of 135085 MB if duplicate removed Task 573310972 has a different inode than task 590971774 - likely space savings of 135085 MB if duplicate removed Task 158800038 has a different inode than task 315736364 - likely space savings of 136482 MB if duplicate removed Isn't it better to handle it using hardlink or some deduplicating FS (e.g., vdo)? (In reply to Miroslav Suchý from comment #2) > Isn't it better to handle it using hardlink or some deduplicating FS (e.g., > vdo)? We already hardlink when possible. What is happening is that multiple users submit either an FTP task, or the vmcore is say on NFS so there is a genuine copy. Regarding VDO / generic de-dup fs, this is not free. IMO this is not complicated RFE and it makes no sense to allow copies like this. (In reply to Dave Wysochanski from comment #0) > > 2. Build up a small database of existing tasks and store the taskid,md5sum > pairs as records. Then when a new task is submitted, we finish processing > but at the very end we check to see if it is a duplicate. If it is a > duplicate we mark the task as "failed" with a special message in the log and > in email notification (if a email was given) indicating the task has > completed but was failed due to it being a duplicate and pointing at the > duplicate task. Why complete processing, and only at the very end fail the > task? There could be some reason why a duplicate was desired, so it will > give the user the possibility to use "retrace-server-interact <taskid> > set-success" to keep the task. By failing the task duplicates are cleaned > up faster using DeleteFailedTaskAfter. > There is already a 'stats.db' file which stores statistics for each task. We could easily store the md5sum in there though it is a running database so the searches for a new task would be longer. This is probably the simplest way to implement this though. Once you insert the md5sum, search the stats.db for any matches and if there is a match mark the task failed. Another reason this RFE makes sense is because we already have the md5 for another reason (to validate what the customer uploaded to us is the correct file). In many other applications there is no md5sum so that is where generic dedup is better IMO. (In reply to Dave Wysochanski from comment #4) > > Once you insert the md5sum, search the stats.db for any matches and if there > is a match mark the task failed. > If you find a match, also do a 'stat' on the matched task to ensure it still exists since the stats.db file stores all tasks, it is possible the matched task no longer exists in which case you would not want to fail the new task. (In reply to Miroslav Suchý from comment #2) > Isn't it better to handle it using hardlink or some deduplicating FS (e.g., > vdo)? hardlinks - actually Sterling had a similar idea - remove the duplicate file and hardlink to the other file. Maybe this is what you meant - hardlink but don't delete the task. That way both tasks can still exist but we save the storage. I may try this approach in the cleanup job since if we could deploy that, it would give us a fairly instant 4TB of savings. Created attachment 1411855 [details]
standalone dedup based on md5 and harlinking
Created attachment 1411914 [details]
v2 - standalone dedup based on md5 and harlinking
From what I can tell, v2 works but needs a bit more testing. Then converting into a patch for retrace-server-cleanup. Created attachment 1411930 [details]
v3 - standalone dedup based on md5 and harlinking
Created attachment 1412356 [details]
v4 - standalone dedup based on md5 and harlinking
Created attachment 1412577 [details]
v2: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup; patch is currently under test
Created attachment 1412578 [details]
v2: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup; patch is currently under test
Created attachment 1412896 [details]
v3: Add dedup_vmcore method to RetraceTask and call from retrace-server-cleanup
Testing on the latest patch found a couple issues: - get_md5_tasks: should not add tasks with md5sum file containing "Enabled" - use of task.set_log can truncate an in-progress tasks's log - use of task.set_log does not contain the date/timestamp Probably the above two are related since we should not be trying to de-duplicate an in-progress task. Probably modify get_md5_tasks to only add a task if: - md5sum file really contains an md5sum - a task is completed In testing I saw what seems to be happening with these duplicates and it looks unavoidable so we need some deduplication. Customers can submit the same exact vmcore but with different filenames. We won't know it is the same until we actually submit it, and then it is too late - we have a duplicate. I just observed this first hand with a couple submissions which were a couple weeks apart. Created attachment 1413961 [details]
v5: standalone md5dedup based on cec0f16ef992650f4a459dc83e1bf24dfe1ab940
Created attachment 1413962 [details]
v5: add dedup_vmcore to RetraceWorker and call from retrace-server-cleanup
I think for now doing the logic in the cleanup job is fine. However, at some point we may want to shift the calling of dedup_vmcore from the cleanup job to the task creation / processing time. I think there are limited use cases for duplicate tasks (i.e. testing) but not duplicate vmcores. So if someone submits a duplicate, probably: - make an email notification of the duplicate and the location - call the dedup_vmcore function - have the status be the same as the duplicate (i.e. don't fail because it's a duplicate) In order to implement the calling of dedup at submit time I think that is another level of work I'm not sure I want to take on right now and it can be another bug. *** Bug 1428040 has been marked as a duplicate of this bug. *** Patch series to implement dedup on existing tasks (which is probably the best we can do currently since we do not know the md5sum ahead of time) has been merged: https://github.com/abrt/retrace-server/pull/182 I am going to open a separate bug to reject a task submission once we have a md5sum of the tarball fetchable via remote system. Due to how support works, it is often the case that a large vmcore may be in progress and someone else re-submits the same one. It is possible the files are different but we cannot know just from the filename usually - we need a remote checksum before we reject it. $ git tag --contains ef1fb69 1.19.0 |