Bug 572836 - [RFE] Collect crash dump and other information useful for analysis when test panics/stalls
[RFE] Collect crash dump and other information useful for analysis when test ...
Status: CLOSED CURRENTRELEASE
Product: Beaker
Classification: Community
Component: lab controller (Show other bugs)
0.5
All Linux
low Severity medium (vote)
: 0.8.2
: ---
Assigned To: Bill Peck
:
Depends On:
Blocks: 604328
  Show dependency treegraph
 
Reported: 2010-03-12 02:04 EST by Jun'ichi NOMURA
Modified: 2012-04-26 03:16 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-04-26 03:16:35 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Add script-callout feature on external watchdog timeout (2.11 KB, patch)
2011-08-01 23:04 EDT, Jun'ichi NOMURA
no flags Details | Diff
Example of watchdog_script (5.21 KB, text/plain)
2011-08-01 23:13 EDT, Jun'ichi NOMURA
no flags Details

  None (edit)
Description Jun'ichi NOMURA 2010-03-12 02:04:11 EST
[RFE] Collect crash dump and other information useful for analysis
      when test panics/stalls

If the test machine stalls or panics, Beaker should be able to
collect information of the system for post-motem analysis.
Information such as:
  - crash dump
  - crash dump summary
  - SysRq logs
  - syslogs

With legacy RHTS, I had a local patch to watchdog script, which
kicks SysRq commands for some information and then triggers crash dump.
Also, there are separate tests for setting up kdump and checking vmcore.

So, this feature might be break-down to the following sub-features:
  - metadata showing how to access the remote dump server
  - utility test program to set up crash dump
  - utility test program to check the collected dump
  - utility test program to collect logs
  - watchdog feature to run host/distro-specific program on lab controller
  - interface for the watchdog script to obtain console/BMC information
Comment 1 Raymond Mancy 2010-11-05 16:41:23 EDT
Hi Jun'ichi,

The latest version of the beah harness has the following implemented https://bugzilla.redhat.com/show_bug.cgi?id=633258, although perhaps it's not quite what you are after.

We don't have it on our roadmap to implement this feature in the immediate future. Are you able to apply/implement your old patch onto the new watchdog?

Thanks
Comment 2 Jun'ichi NOMURA 2010-11-08 00:39:33 EST
(In reply to comment #1)
> The latest version of the beah harness has the following implemented
> https://bugzilla.redhat.com/show_bug.cgi?id=633258, although perhaps it's not
> quite what you are after.

I can't tell from the comments in BZ#633258.
But if the feature is limited to harness errors, as there said
"in case of harness errors", it's not what I want.

I found "bkr workflow-simple" has an option "--dump".

$ bkr workflow-simple --help
...
  --dump                           Turn on ndnc/kdump. (which one depends on the family)

Isn't this something intended for the feature I described?


> We don't have it on our roadmap to implement this feature in the immediate
> future. Are you able to apply/implement your old patch onto the new watchdog?

The patch needs to be applied where the watchdog calls lab controller
to finish testing.
Where shall I apply the patch?
Comment 3 Raymond Mancy 2010-11-08 09:14:11 EST
I'm embarrassed to say I didn't realise that option existed.
That should do something like what you want, however you'll need to have specific tasks in your Beaker library for them to work. I'll have to have a look at them because I don't think they will work in an environment external to red hat as they are.
Comment 4 Bill Peck 2011-02-22 14:01:18 EST
will review patch provided by Jun'ichi.
Comment 5 Jun'ichi NOMURA 2011-08-01 23:04:08 EDT
Created attachment 516239 [details]
Add script-callout feature on external watchdog timeout

When external watchdog expires, it might mean the system is stalled
and collecting additional information is often useful.

This patch adds a feature to run a script for such a case.
Typically, the script would trigger crash dump on the system.
Since crash dump can take a long time, it allows the script to
say 'extend watchdog' (by return value 2).
Patch is made for beaker 0.6.14-7.el5.

I left the script path and the extention length ('1800') hardcoded
but a configurable parameter might be better.
Example of the watchdog script is attached.
Comment 6 Jun'ichi NOMURA 2011-08-01 23:13:15 EDT
Created attachment 516241 [details]
Example of watchdog_script

This script does the following:
  - Try to find serial console connection to the host and
    send SysRq commands to dump information on console.log
  - Try to trigger crash dump by following method
      * send NMI to the host
          o use cobbler feature BZ#727394 if available
      * or send SysRq-c
Comment 7 Jun'ichi NOMURA 2011-08-01 23:21:36 EDT
Job 953 and 958 in our lab are sample results.
(953 is on a machine without IPMI support, where dump is triggered by SysRq-c.
 958 is on a machine with IPMI support, where dump is triggered by NMI via cobbler.)

"system-crash" is the test emulating system stall.
(And it reports 'PASS' if the system is successfully rebooted after the stall.)
Comment 8 Bill Peck 2011-11-30 07:39:40 EST
This looks pretty good.  I'll work on getting this into 0.8.1 and I'll see about making a back port for 0.6.14 as well.

Couple of things I plan to change:

1 - watchdog script will be optional and full path to script will be specified in config file.
2 - watchdog script will return the number of seconds the watchdog should be extended by.

0.8.1 is scheduled to be released during the week of Dec 19th.  I can make the updated version of 0.6.14 at that time as well.
Comment 9 Bill Peck 2012-01-17 05:56:12 EST
keeping 0.8.1 for stability changes
Comment 10 Bill Peck 2012-03-26 14:40:56 EDT
pushed to gerrit for review.

Note You need to log in before you can comment on or make changes to this bug.