Bug 903902 - race condition in marking a system broken
Summary: race condition in marking a system broken
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: scheduler
Version: 0.11
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: 0.13
Assignee: Dan Callaghan
QA Contact:
URL:
Whiteboard: Misc
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-01-25 02:57 UTC by Dan Callaghan
Modified: 2018-02-06 00:41 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-06-25 06:27:26 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 951309 0 unspecified CLOSED beaker-provision sometimes runs commands twice 2021-02-22 00:41:40 UTC

Internal Links: 951309

Description Dan Callaghan 2013-01-25 02:57:10 UTC
I found this in the activity records for a system:

+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+
| id      | user_id | created             | type            | field_name | service   | action  | old_value | new_value |
+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+
| 6888810 |    1657 | 2012-12-20 12:21:56 | system_activity | Status     | Scheduler | Changed | Automated | Broken    |
| 6888812 |    1657 | 2012-12-20 12:21:56 | system_activity | Status     | Scheduler | Changed | Automated | Broken    |
+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+

I'm still not sure how this can happen, since beaker-provision is supposed to serialize handling power commands for a given system, and the failure reporting (which triggers mark_broken) is part of that. So this may indicate some kind of bug in beaker-provision.

In any case, the server should also guard against this by doing the update atomically.

Aside from the obvious side-effects which aren't too bad (duplicate entries in system history, duplicate breakage notifications sent to the owner), this also breaks the invariant of the system_status_durations table there is always exactly one "open" row (with NULL finish_time) per system. That is a more insidious problem which can cause failures much later down the track, for example in reporting code.

+-------+-----------+-----------+---------------------+---------------------+
| id    | system_id | status    | start_time          | finish_time         |
+-------+-----------+-----------+---------------------+---------------------+
| 23592 |       265 | Broken    | 2012-12-20 12:21:56 | NULL                |
| 23593 |       265 | Broken    | 2012-12-20 12:21:56 | 2012-12-20 15:47:16 |
| 23596 |       265 | Automated | 2012-12-20 15:47:16 | NULL                |
+-------+-----------+-----------+---------------------+---------------------+

Comment 5 Nick Coghlan 2013-04-23 01:53:33 UTC
Can this be closed in light of 951309?

Comment 6 Dan Callaghan 2013-04-23 01:59:19 UTC
(In reply to comment #5)

Bug 951309 was the cause, but I still think the server needs to protect the invariant that only one system_status_duration row has a NULL finish time. We can do it fairly easily, using the same pattern as we have been using for the other similar tables (conditional UPDATE with a row count check). It should have been done that way from the start.

Comment 8 Dan Callaghan 2013-05-01 06:07:26 UTC
On Gerrit: http://gerrit.beaker-project.org/1911

Comment 12 Amit Saha 2013-06-25 06:27:26 UTC
Beaker 0.13.1 has been released.


Note You need to log in before you can comment on or make changes to this bug.