Bug 1328227

Summary: Improve handling of damaged / partial vmcores: add --zero_excluded and --minimal to 'crash_cmd' if certain crash failures occur
Product: [Fedora] Fedora EPEL Reporter: Dave Wysochanski <dwysocha>
Component: retrace-serverAssignee: Dave Wysochanski <dwysocha>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: el6CC: abrt-devel-list, jberan, michal.toman
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-21 15:43:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Dave Wysochanski 2016-04-18 19:00:09 UTC
Description of problem:
Some vmcores are damaged but can be run with other options.  For example, recently we received one vmcore and crash load showed the following:

WARNING: /cores/retrace/tasks/776013158/crash/vmcore:
         This dumpfile is incomplete.  This may cause the crash session
         to fail entirely, may cause commands to fail, or may result in
         unpredictable runtime behavior.
   NOTE: This dumpfile may be analyzed with the --zero_excluded command
         line option, in which case any read requests from missing pages
         will return zero-filled memory.


I manually added the option via the following:
$ echo -n "crash --zero_excluded" > /cores/retrace/tasks/776013158/crash_cmd 

We can then run other commands in the vmcore such as 'bt', etc.


Version-Release number of selected component (if applicable):
retrace-server-1.15-1.el6.noarch

How reproducible:
Once so far but depends on how many damaged / partial vmcores we get.


Steps to Reproduce:
TBD

Actual results:
"retrace-server-interact <taskid> crash" fails with crash exiting with an error, but the retrace-server task status == 'success'

Expected results:
"retrace-server-interact <taskid> crash" does not fail if some other option such as '--zero_excluded' would work.


Additional info:

This is not a high priority but it would help in some instances.  We still get approximately 20% vmcores which fail in some way.

There's some other options as well which may be useful, such as --no_kmem_cache, and of course, if all else fails, --minimal.  Right now the way the code is structured for the 32-bit vmcore and the VMware --phys_base parameter should probably be refactored so we can add these other options.

Also probably we need some patches so that running crash affects the task 'status' in some way, or there's a secondary status possibly.  Will need to think about it and work on some patches to see what can be done.

The one example we had occurred on a vmcore with kernel 3.10.0-327.13.1.el7.x86_64.debug

Comment 2 Dave Wysochanski 2018-02-05 18:59:58 UTC
This probably should be tackled alongside https://bugzilla.redhat.com/show_bug.cgi?id=1232019

Comment 3 Dave Wysochanski 2018-03-02 21:52:01 UTC
I am not sure how important this is but keeping it open for now.

Comment 4 Dave Wysochanski 2018-04-18 11:01:28 UTC
For now just set --minimal if we recognize the kernel and have a decent sized kernel log.  https://github.com/abrt/retrace-server/pull/187

Comment 5 Dave Wysochanski 2018-12-21 15:43:30 UTC
$ git tag --contains e27be24
1.19.0