Bug 903902

Summary: race condition in marking a system broken
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: schedulerAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.11CC: asaha, dcallagh, llim, qwan, rglasz, rmancy, xjia
Target Milestone: 0.13   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Misc
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-25 06:27:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Callaghan 2013-01-25 02:57:10 UTC
I found this in the activity records for a system:

+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+
| id      | user_id | created             | type            | field_name | service   | action  | old_value | new_value |
+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+
| 6888810 |    1657 | 2012-12-20 12:21:56 | system_activity | Status     | Scheduler | Changed | Automated | Broken    |
| 6888812 |    1657 | 2012-12-20 12:21:56 | system_activity | Status     | Scheduler | Changed | Automated | Broken    |
+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+

I'm still not sure how this can happen, since beaker-provision is supposed to serialize handling power commands for a given system, and the failure reporting (which triggers mark_broken) is part of that. So this may indicate some kind of bug in beaker-provision.

In any case, the server should also guard against this by doing the update atomically.

Aside from the obvious side-effects which aren't too bad (duplicate entries in system history, duplicate breakage notifications sent to the owner), this also breaks the invariant of the system_status_durations table there is always exactly one "open" row (with NULL finish_time) per system. That is a more insidious problem which can cause failures much later down the track, for example in reporting code.

+-------+-----------+-----------+---------------------+---------------------+
| id    | system_id | status    | start_time          | finish_time         |
+-------+-----------+-----------+---------------------+---------------------+
| 23592 |       265 | Broken    | 2012-12-20 12:21:56 | NULL                |
| 23593 |       265 | Broken    | 2012-12-20 12:21:56 | 2012-12-20 15:47:16 |
| 23596 |       265 | Automated | 2012-12-20 15:47:16 | NULL                |
+-------+-----------+-----------+---------------------+---------------------+

Comment 5 Nick Coghlan 2013-04-23 01:53:33 UTC
Can this be closed in light of 951309?

Comment 6 Dan Callaghan 2013-04-23 01:59:19 UTC
(In reply to comment #5)

Bug 951309 was the cause, but I still think the server needs to protect the invariant that only one system_status_duration row has a NULL finish time. We can do it fairly easily, using the same pattern as we have been using for the other similar tables (conditional UPDATE with a row count check). It should have been done that way from the start.

Comment 8 Dan Callaghan 2013-05-01 06:07:26 UTC
On Gerrit: http://gerrit.beaker-project.org/1911

Comment 12 Amit Saha 2013-06-25 06:27:26 UTC
Beaker 0.13.1 has been released.