Bug 963492

Summary: duplicate identical logs for a result
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: schedulerAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.12CC: dcallagh, dowang, ebaak, huiwang, mgrigull, mjia, qwan, rjoost, tools-bugs
Target Milestone: 24.0Keywords: Patch, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-21 18:48:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Callaghan 2013-05-15 22:19:35 UTC
I found two recipe-task-results with duplicate logs (having exactly the same path and filename):

mysql> select id, path, filename, start_time from log_recipe_task_result where recipe_task_result_id = 63139789;
+-----------+------+--------------------+---------------------+
| id        | path | filename           | start_time          |
+-----------+------+--------------------+---------------------+
| 176434288 | /    | test_log-Setup.log | 2013-04-29 18:51:21 |
| 176434293 | /    | avc_log-Setup.log  | 2013-04-29 18:56:08 |
| 176434304 | /    | avc_log-Setup.log  | 2013-04-29 18:57:06 |
+-----------+------+--------------------+---------------------+
3 rows in set (0.00 sec)

mysql> select id, path, filename, start_time from log_recipe_task_result where recipe_task_result_id = 63139788;
+-----------+------+----------------------------------------+---------------------+
| id        | path | filename                               | start_time          |
+-----------+------+----------------------------------------+---------------------+
| 176434287 | /    | test_log-Testing-ppc-binary-inputs.log | 2013-04-29 18:51:21 |
| 176434289 | /    | avc_log-Testing-ppc-binary-inputs.log  | 2013-04-29 18:51:23 |
| 176434291 | /    | avc_log-Testing-ppc-binary-inputs.log  | 2013-04-29 18:56:05 |
+-----------+------+----------------------------------------+---------------------+
3 rows in set (0.00 sec)

This causes problems for beaker-transfer when it tries to hard-link the same file twice:

2013-05-14 04:18:59 [ERROR   ] { 1642} bkr.labcontroller.proxy:317 unable to hardlink /var/www/beaker/logs/results/63139+/63139788////avc_log-Testing-ppc-binary-inputs.log to /var/www/beaker/logs/tmpQ4jyWj/2013/04/4104/410470/863758/12198118/63139788///avc_log-Testing-ppc-binary-inputs.log, [Errno 17] File exists
2013-05-14 04:19:00 [ERROR   ] { 1642} bkr.labcontroller.proxy:317 unable to hardlink /var/www/beaker/logs/results/63139+/63139789////avc_log-Setup.log to /var/www/beaker/logs/tmpexc6z3/2013/04/4105/410576/863964/12205260/63139789///avc_log-Setup.log, [Errno 17] File exists

Given that beaker-transfer has been hit by a similar issue in the past (bug 865265) it might be worth fixing beaker-transfer to deal with this more gracefully.

The actual cause of the duplicate rows is less clear. We don't have a UNIQUE constraint on those columns because they are TEXT and MySQL doesn't support that, as mentioned in the docstring for Log.lazy_create:

    def lazy_create(cls, path=None, **kwargs):
        """
        Unlike the "real" lazy_create above, we can't rely on unique
        constraints here because 'path' and 'filename' are TEXT columns and
        MySQL can't index those. :-(

        So we just use a query-then-insert approach. There is a race window
        between querying and inserting, but it's about the best we can do.
        """

The strange thing is that the timestamps on the duplicate logs are quite far apart in both cases, so it seems unlikely to be the race window mentioned in that docstring. The problem may lie elsewhere.

Comment 1 Dan Callaghan 2014-06-10 07:02:03 UTC
This happened again.

mysql> select id, path, filename, start_time from log_recipe_task_result where recipe_task_result_id = 93340454;
+-----------+------+--------------------------------------------+---------------------+
| id        | path | filename                                   | start_time          |
+-----------+------+--------------------------------------------+---------------------+
| 227840943 | /    | test_log-gpl_mod.ko.signed-EMS_ON.log      | 2014-01-22 10:40:10 |
| 227840961 | /    | test_log-gpl_mod.ko.signed-EMS_ON.log      | 2014-01-22 10:45:16 |
| 227840966 | /    | console_dmesg-gpl_mod.ko.signed-EMS_ON.log | 2014-01-22 10:45:23 |
| 227840978 | /    | avc_log-gpl_mod.ko.signed-EMS_ON.log       | 2014-01-22 10:45:30 |
+-----------+------+--------------------------------------------+---------------------+
4 rows in set (0.00 sec)

Comment 2 Sage Grigull 2014-06-11 04:51:51 UTC
cleared on prod instance.

Comment 3 Dan Callaghan 2016-08-23 02:24:50 UTC
*** Bug 1368868 has been marked as a duplicate of this bug. ***

Comment 4 Dan Callaghan 2016-08-23 03:33:46 UTC
At least two more in our production Beaker instance: R:483436 and R:839972. We should get those ones corrected. I would like to have another go at figuring out how we get into this state, since lazy_create is supposed to prevent it.

But we should probably also mitigate the problem in recipe.files() by detecting the duplicate named logs and only returning one copy, to avoid confusing beaker-transfer.

Comment 5 Dan Callaghan 2016-08-25 00:57:29 UTC
public=> select log_recipe_task.id, log_recipe_task.path, log_recipe_task.filename, log_recipe_task.start_time from Beaker.log_recipe_task where recipe_task_id = 5612659;
    id    | path  |       filename       |      start_time       
----------+-------+----------------------+-----------------------
 55075167 | debug | .task_beah_raw       | 2012-05-03 17:55:04.0
 55075168 | debug | .task_beah_raw       | 2012-05-03 17:55:04.0
 55075262 | /     | journal.xml          | 2012-05-03 17:55:50.0
 55075269 | /     | TESTOUT.log          | 2012-05-03 17:55:53.0
 55075551 | debug | task_beah_unexpected | 2012-05-03 17:58:08.0
 55081674 | /     | messages             | 2012-05-03 18:55:43.0
 55081676 | /     | current.log          | 2012-05-03 18:55:46.0
 55081687 | debug | task_output_stdout   | 2012-05-03 18:55:49.0
 55081692 | debug | task_log             | 2012-05-03 18:55:52.0
(9 rows)

public=> select id, path, filename, start_time from log_recipe_task_result where recipe_task_result_id = 61068584;
    id     | path |      filename      |      start_time       
-----------+------+--------------------+-----------------------
 172814572 | /    | test_log-Setup.log | 2013-04-04 14:42:00.0
 172814573 | /    | avc_log-Setup.log  | 2013-04-04 14:45:03.0
 172814816 | /    | avc_log-Setup.log  | 2013-04-04 14:47:01.0
(3 rows)

Comment 7 Dan Callaghan 2016-08-25 03:14:12 UTC
Even though the implementation of lazy_create is supposed to prevent this situation, and we can't figure out how it could possibly happen, we should probably mitigate it in recipe.files() by removing duplicate log files (with an accompanying warning) so at least it doesn't cause beaker-transfer to get stuck if it hits it again in future.

Comment 8 Dan Callaghan 2016-08-25 03:23:12 UTC
We've only seen this issue on rows from 2014 or earlier... so I could also write a migration to find and clean up any existing duplicates, under the assumption that the recent versions of lazy_create do indeed work properly and this can't happen on any new rows. Not sure if that's a valid assumption though.

Comment 9 Dan Callaghan 2016-08-25 07:06:31 UTC
It looks like writing a migration to delete any existing duplicates will be out of the question...

I have a straightforward GROUP BY.. HAVING COUNT > 1 query running now against log_recipe (prod dump on our devel db server). It's still running after 2.5 hours and produced a 5GB tmp table and I have no idea how much progress it's made. The other log_* tables are far bigger.

And to actually delete the duplicates would need a DELETE with the LEFT JOIN matching on NULL trick, which is certain to be even *slower* still.

So mitigating in recipe.files() seems like the best choice right now. If I leave the GROUP BY HAVING COUNT > 1 queries running long enough we should at least be able to figure out if there any other possible duplicates lurking as well.

Comment 10 Dan Callaghan 2016-08-25 07:13:20 UTC
http://gerrit.beaker-project.org/5159 filter out duplicate logs in recipe.files()

Comment 11 Dan Callaghan 2016-08-26 01:18:24 UTC
The GROUP BY HAVING COUNT queries came back overnight.

    select recipe_id, path, filename, count(*) c
    from log_recipe
    group by recipe_id, path, filename
    having c > 1;

    244 rows in set (1 min 56.59 sec)

They are all console.log, the most recent is R:358427 created 2011-12-14 21:09:13.

    select recipe_task_id, path, filename, count(*) c
    from log_recipe_task
    group by recipe_task_id, path, filename
    having c > 1;

    20 rows in set (2 hours 40 min 21.58 sec)

The most recent is debug/.task_beah_raw on T:5612659 created 2012-05-03 17:55:04 (that's the one in comment 5).

    select recipe_task_result_id, path, filename, count(*) c
    from log_recipe_task_result
    group by recipe_task_result_id, path, filename
    having c > 1;

    14 rows in set (12 hours 49 min 9.05 sec)

The most recent is avc_log-Setup.log on TR:61068759 created 2013-04-04 14:51:05.

We switched to the most recent (current) implementation of lazy_create, using conditional ISNERTs, in d1a31ded9b2b23d35151dc78abda0fb909bba4c5 which was in Beaker 0.15, released to production in December 2013 (and then rolled back and re-released with 0.15.3 on 5 February 2014).

Since we haven't seen any duplicates created after that date, it might be reasonable to assume that it's properly fixed and only old rows remain.

I will have a go at crafting some DELETE statements we can use in a migration, although the run time of the log_recipe_task_result query above (12 hours 49 minutes) doesn't give me much hope that I can write a migration which runs in any reasonable time.

Comment 12 Dan Callaghan 2016-08-26 01:42:21 UTC
I craft a DELETE using LEFT JOIN NOT NULL trick:

MySQL [beaker_dcallagh]> delete log_recipe from log_recipe left join log_recipe x on log_recipe.recipe_id = x.recipe_id and log_recipe.path = x.path and log_recipe.filename = x.filename and log_recipe.id > x.id where x.id is not null;                                      
Query OK, 244 rows affected (3 min 24.01 sec)

Which makes it almost twice as slow as the GROUP BY HAVING COUNT versions shown above. Assuming it's linear that means the DELETE for log_recipe_task_result would take 22.5 hours. And prod is slower than this machine too. :-(

Comment 13 Dan Callaghan 2016-08-26 02:41:31 UTC
There's also no way to clean up these duplicate rows in an online migration script in batches, because:

* you can't apply LIMIT to DELETE...FROM...JOIN queries;
* it can't be written as a DELETE with subquery, since the subquery would reference the table being DELETEd which MySQL disallows; and
* the SELECT...GROUP BY...HAVING COUNT > 1 query given above to find the duplicates for deletion takes just as long to run with LIMIT 1 applied, as it does with no limit.

Comment 14 Dan Callaghan 2016-08-26 04:16:53 UTC
So I guess we just go with the mitigation in recipe.files() and then consider this fixed.

Comment 16 Dan Callaghan 2016-09-06 00:59:29 UTC
There is no known way to reproduce the original problem here, since I believe it was fixed by the lazy_create() changes in 2014.

However it is possible to manually insert a duplicate log into the database. The expected result in that case is that recipe.files() does not return the same filename multiple times, only once.

Comment 21 Dan Callaghan 2017-02-21 18:48:25 UTC
Beaker 24.0 has been released.