953543 – external watchdog never fires if system is stuck in panic loop

Bug 953543 - external watchdog never fires if system is stuck in panic loop

Summary: external watchdog never fires if system is stuck in panic loop

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	lab controller
Sub Component:
Version:	0.12
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	0.14.2
Assignee:	Raymond Mancy
QA Contact:	Amit Saha
Docs Contact:
URL:
Whiteboard:	Misc
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-04-18 12:19 UTC by Jeff Burke
Modified:	2018-02-06 00:41 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	954219 (view as bug list)
Environment:
Last Closed:	2013-11-07 01:46:36 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 2 Nick Coghlan 2013-04-19 08:03:13 UTC

Those don't look like the same symptoms (one is a kernel panic, the other an issue with anaconda trying to ask a question in non-interactive mode).

In the latter case, we've seen it before (in https://bugzilla.redhat.com/show_bug.cgi?id=952661) and were of the opinion that it was picking up a bug in the software under test (from the storage log):

00:33:03,800 DEBUG storage: rhel_ibm-z10-27 size is 28160MB
00:33:03,801 DEBUG storage: vg rhel_ibm-z10-27 has 0MB free
00:33:03,802 DEBUG storage: rhel_ibm-z10-27 size is 28160MB
00:33:03,819 DEBUG storage: vg rhel_ibm-z10-27 has 0MB free

Or is the problem that the External Watchdog didn't fire properly in these cases, making it necessary to cancel them manually? I thought I had seen a previous issue about something similar, but can't seem to find anything like that now - I'll bring it up on beaker-dev-list.

Comment 3 Dan Callaghan 2013-04-22 00:35:29 UTC

I'm guessing the panic loop is because we now extend the watchdog for 10 minutes after panic, to allow for kdump or other post-panic activities. Really after that has happened, we should (a) disable any further panic detection, and maybe (b) prevent any further watchdog extensions (although it's possible that some people's post-panic activities might actually be extending the watchdog?).

The ./start loop is a separate issue, but similar. When Anaconda checks in during %pre we extend the watchdog and record the ./start result. After that Beaker should not accept any more %pre check-ins, since that just indicates Anaconda bailed out and rebooted.

The latter problem has been known for a long time, we have discussed it before but I can't find an open bug for it. I will clone this one.

Comment 4 Dan Callaghan 2013-04-22 00:41:20 UTC

(In reply to comment #3)
> The latter problem has been known for a long time, we have discussed it
> before but I can't find an open bug for it. I will clone this one.

Cloned as bug 954219.

Comment 6 Raymond Mancy 2013-08-23 03:20:32 UTC

Yeah so I think what makes most sense here is to just not test for further panic strings once a panic has already been detected.

Comment 7 Raymond Mancy 2013-08-23 05:57:20 UTC

http://gerrit.beaker-project.org/#/c/2181/

Comment 10 Nick Coghlan 2013-10-03 02:27:11 UTC

Beaker 0.15 has been released.

Comment 11 Raymond Mancy 2013-10-23 01:56:58 UTC

This change has been nominated to be back ported to the 0.14 branch, to be released as part of the next maintenance release 0.14.2.

Comment 12 Nick Coghlan 2013-10-25 06:35:40 UTC

Adjusting target milestone to make the changes backported to 0.14.2 easier to identify. 0.15.0 has enough significant regressions that it shouldn't be used, so the change means that 0.15.1 can be effectively reidentified as the union of that tag and the 0.14.2 target milestone.

Comment 15 Nick Coghlan 2013-11-07 01:46:36 UTC

Closing as addressed in Beaker 0.14.2.

Note You need to log in before you can comment on or make changes to this bug.