Bug 963492
Summary: | duplicate identical logs for a result | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Dan Callaghan <dcallagh> |
Component: | scheduler | Assignee: | Dan Callaghan <dcallagh> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 0.12 | CC: | dcallagh, dowang, ebaak, huiwang, mgrigull, mjia, qwan, rjoost, tools-bugs |
Target Milestone: | 24.0 | Keywords: | Patch, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-02-21 18:48:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Callaghan
2013-05-15 22:19:35 UTC
This happened again. mysql> select id, path, filename, start_time from log_recipe_task_result where recipe_task_result_id = 93340454; +-----------+------+--------------------------------------------+---------------------+ | id | path | filename | start_time | +-----------+------+--------------------------------------------+---------------------+ | 227840943 | / | test_log-gpl_mod.ko.signed-EMS_ON.log | 2014-01-22 10:40:10 | | 227840961 | / | test_log-gpl_mod.ko.signed-EMS_ON.log | 2014-01-22 10:45:16 | | 227840966 | / | console_dmesg-gpl_mod.ko.signed-EMS_ON.log | 2014-01-22 10:45:23 | | 227840978 | / | avc_log-gpl_mod.ko.signed-EMS_ON.log | 2014-01-22 10:45:30 | +-----------+------+--------------------------------------------+---------------------+ 4 rows in set (0.00 sec) cleared on prod instance. *** Bug 1368868 has been marked as a duplicate of this bug. *** At least two more in our production Beaker instance: R:483436 and R:839972. We should get those ones corrected. I would like to have another go at figuring out how we get into this state, since lazy_create is supposed to prevent it. But we should probably also mitigate the problem in recipe.files() by detecting the duplicate named logs and only returning one copy, to avoid confusing beaker-transfer. public=> select log_recipe_task.id, log_recipe_task.path, log_recipe_task.filename, log_recipe_task.start_time from Beaker.log_recipe_task where recipe_task_id = 5612659; id | path | filename | start_time ----------+-------+----------------------+----------------------- 55075167 | debug | .task_beah_raw | 2012-05-03 17:55:04.0 55075168 | debug | .task_beah_raw | 2012-05-03 17:55:04.0 55075262 | / | journal.xml | 2012-05-03 17:55:50.0 55075269 | / | TESTOUT.log | 2012-05-03 17:55:53.0 55075551 | debug | task_beah_unexpected | 2012-05-03 17:58:08.0 55081674 | / | messages | 2012-05-03 18:55:43.0 55081676 | / | current.log | 2012-05-03 18:55:46.0 55081687 | debug | task_output_stdout | 2012-05-03 18:55:49.0 55081692 | debug | task_log | 2012-05-03 18:55:52.0 (9 rows) public=> select id, path, filename, start_time from log_recipe_task_result where recipe_task_result_id = 61068584; id | path | filename | start_time -----------+------+--------------------+----------------------- 172814572 | / | test_log-Setup.log | 2013-04-04 14:42:00.0 172814573 | / | avc_log-Setup.log | 2013-04-04 14:45:03.0 172814816 | / | avc_log-Setup.log | 2013-04-04 14:47:01.0 (3 rows) Even though the implementation of lazy_create is supposed to prevent this situation, and we can't figure out how it could possibly happen, we should probably mitigate it in recipe.files() by removing duplicate log files (with an accompanying warning) so at least it doesn't cause beaker-transfer to get stuck if it hits it again in future. We've only seen this issue on rows from 2014 or earlier... so I could also write a migration to find and clean up any existing duplicates, under the assumption that the recent versions of lazy_create do indeed work properly and this can't happen on any new rows. Not sure if that's a valid assumption though. It looks like writing a migration to delete any existing duplicates will be out of the question... I have a straightforward GROUP BY.. HAVING COUNT > 1 query running now against log_recipe (prod dump on our devel db server). It's still running after 2.5 hours and produced a 5GB tmp table and I have no idea how much progress it's made. The other log_* tables are far bigger. And to actually delete the duplicates would need a DELETE with the LEFT JOIN matching on NULL trick, which is certain to be even *slower* still. So mitigating in recipe.files() seems like the best choice right now. If I leave the GROUP BY HAVING COUNT > 1 queries running long enough we should at least be able to figure out if there any other possible duplicates lurking as well. http://gerrit.beaker-project.org/5159 filter out duplicate logs in recipe.files() The GROUP BY HAVING COUNT queries came back overnight. select recipe_id, path, filename, count(*) c from log_recipe group by recipe_id, path, filename having c > 1; 244 rows in set (1 min 56.59 sec) They are all console.log, the most recent is R:358427 created 2011-12-14 21:09:13. select recipe_task_id, path, filename, count(*) c from log_recipe_task group by recipe_task_id, path, filename having c > 1; 20 rows in set (2 hours 40 min 21.58 sec) The most recent is debug/.task_beah_raw on T:5612659 created 2012-05-03 17:55:04 (that's the one in comment 5). select recipe_task_result_id, path, filename, count(*) c from log_recipe_task_result group by recipe_task_result_id, path, filename having c > 1; 14 rows in set (12 hours 49 min 9.05 sec) The most recent is avc_log-Setup.log on TR:61068759 created 2013-04-04 14:51:05. We switched to the most recent (current) implementation of lazy_create, using conditional ISNERTs, in d1a31ded9b2b23d35151dc78abda0fb909bba4c5 which was in Beaker 0.15, released to production in December 2013 (and then rolled back and re-released with 0.15.3 on 5 February 2014). Since we haven't seen any duplicates created after that date, it might be reasonable to assume that it's properly fixed and only old rows remain. I will have a go at crafting some DELETE statements we can use in a migration, although the run time of the log_recipe_task_result query above (12 hours 49 minutes) doesn't give me much hope that I can write a migration which runs in any reasonable time. I craft a DELETE using LEFT JOIN NOT NULL trick: MySQL [beaker_dcallagh]> delete log_recipe from log_recipe left join log_recipe x on log_recipe.recipe_id = x.recipe_id and log_recipe.path = x.path and log_recipe.filename = x.filename and log_recipe.id > x.id where x.id is not null; Query OK, 244 rows affected (3 min 24.01 sec) Which makes it almost twice as slow as the GROUP BY HAVING COUNT versions shown above. Assuming it's linear that means the DELETE for log_recipe_task_result would take 22.5 hours. And prod is slower than this machine too. :-( There's also no way to clean up these duplicate rows in an online migration script in batches, because: * you can't apply LIMIT to DELETE...FROM...JOIN queries; * it can't be written as a DELETE with subquery, since the subquery would reference the table being DELETEd which MySQL disallows; and * the SELECT...GROUP BY...HAVING COUNT > 1 query given above to find the duplicates for deletion takes just as long to run with LIMIT 1 applied, as it does with no limit. So I guess we just go with the mitigation in recipe.files() and then consider this fixed. There is no known way to reproduce the original problem here, since I believe it was fixed by the lazy_create() changes in 2014. However it is possible to manually insert a duplicate log into the database. The expected result in that case is that recipe.files() does not return the same filename multiple times, only once. Beaker 24.0 has been released. |