903902 – race condition in marking a system broken

Bug 903902 - race condition in marking a system broken

Summary: race condition in marking a system broken

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	0.13
Assignee:	Dan Callaghan
QA Contact:
Docs Contact:
URL:
Whiteboard:	Misc
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-25 02:57 UTC by Dan Callaghan
Modified:	2018-02-06 00:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-06-25 06:27:26 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	951309	0	unspecified	CLOSED	beaker-provision sometimes runs commands twice	2021-02-22 00:41:40 UTC

Internal Links: 951309

Description Dan Callaghan 2013-01-25 02:57:10 UTC

I found this in the activity records for a system:

+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+
| id      | user_id | created             | type            | field_name | service   | action  | old_value | new_value |
+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+
| 6888810 |    1657 | 2012-12-20 12:21:56 | system_activity | Status     | Scheduler | Changed | Automated | Broken    |
| 6888812 |    1657 | 2012-12-20 12:21:56 | system_activity | Status     | Scheduler | Changed | Automated | Broken    |
+---------+---------+---------------------+-----------------+------------+-----------+---------+-----------+-----------+

I'm still not sure how this can happen, since beaker-provision is supposed to serialize handling power commands for a given system, and the failure reporting (which triggers mark_broken) is part of that. So this may indicate some kind of bug in beaker-provision.

In any case, the server should also guard against this by doing the update atomically.

Aside from the obvious side-effects which aren't too bad (duplicate entries in system history, duplicate breakage notifications sent to the owner), this also breaks the invariant of the system_status_durations table there is always exactly one "open" row (with NULL finish_time) per system. That is a more insidious problem which can cause failures much later down the track, for example in reporting code.

+-------+-----------+-----------+---------------------+---------------------+
| id    | system_id | status    | start_time          | finish_time         |
+-------+-----------+-----------+---------------------+---------------------+
| 23592 |       265 | Broken    | 2012-12-20 12:21:56 | NULL                |
| 23593 |       265 | Broken    | 2012-12-20 12:21:56 | 2012-12-20 15:47:16 |
| 23596 |       265 | Automated | 2012-12-20 15:47:16 | NULL                |
+-------+-----------+-----------+---------------------+---------------------+

Comment 5 Nick Coghlan 2013-04-23 01:53:33 UTC

Can this be closed in light of 951309?

Comment 6 Dan Callaghan 2013-04-23 01:59:19 UTC

(In reply to comment #5)

Bug 951309 was the cause, but I still think the server needs to protect the invariant that only one system_status_duration row has a NULL finish time. We can do it fairly easily, using the same pattern as we have been using for the other similar tables (conditional UPDATE with a row count check). It should have been done that way from the start.

Comment 8 Dan Callaghan 2013-05-01 06:07:26 UTC

On Gerrit: http://gerrit.beaker-project.org/1911

Comment 12 Amit Saha 2013-06-25 06:27:26 UTC

Beaker 0.13.1 has been released.

Note You need to log in before you can comment on or make changes to this bug.