Bug 1232019

Summary:

retrace-server should fail tasks where crash exits with an error and retrace_backtrace file is tiny and contains "log: seek error"

Product:

[Fedora] Fedora EPEL

Reporter:

Dave Wysochanski <dwysocha>

Component:

retrace-server

Assignee:

Dave Wysochanski <dwysocha>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

low

Docs Contact:

Priority:

medium

Version:

el6

CC:

abrt-devel-list, bubrown, jberan

Target Milestone:

---

Keywords:

Patch

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-12-21 15:42:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
v3: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes	none
v4: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes	none
v5: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes	none

Description Dave Wysochanski 2015-06-15 20:37:40 UTC

Description of problem:
Today, retrace-server will flag a task as 'success' if it can identify the kernel version and setup the symbols.  But it doesn't take into account if the vmcore is usable.

We get a fair number of damaged / incomplete vmcores.  One "class" of such vmcores is such that the kernel log is not even readable.  In this case the vmcore is totally useless AFAIK.  These should really be marked as 'failed' IMO and so the more aggressive cleanup can be done.

Here is a good example of what we see in a log.
$ cat /cores/retrace/tasks/830339522/retrace_backtrace
NOTE: minimal mode commands: log, dis, rd, sym, eval, set, extend and exit

log: seek error: kernel virtual address: ffffffff81a9abe8  type: "log_buf_len"

$ cat /cores/retrace/tasks/830339522/retrace_log 
INFO:root:    2015-05-20 12:14:48 Downloading remote resources
INFO:root:    2015-05-20 12:14:48 Retrieving local file '/cores/exceptions/01445357/vmcore-incomplete'
DEBUG:root:   2015-05-20 12:14:48 File type: data
DEBUG:root:   2015-05-20 12:14:48 unknown file type, unpacking finished
DEBUG:root:   2015-05-20 12:14:48 Trying hardlink
DEBUG:root:   2015-05-20 12:14:48 Succeeded
INFO:root:    2015-05-20 12:14:48 Post-processing downloaded file
DEBUG:root:   2015-05-20 12:14:48 File type: data
DEBUG:root:   2015-05-20 12:14:48 unknown file type, unpacking finished
DEBUG:root:   2015-05-20 12:14:48 File type: data
DEBUG:root:   2015-05-20 12:14:48 unknown file type, unpacking finished
INFO:root:    2015-05-20 12:14:48 Vmcore size: 5.74 GB
DEBUG:root:   2015-05-20 12:14:49 Vmcore dump level is 31
INFO:root:    2015-05-20 12:14:49 Stripping to 1 would have no effect
INFO:root:    2015-05-20 12:14:49 Analyzing crash data
DEBUG:root:   2015-05-20 12:14:49 Parsing kernel version '2.6.32-358.6.2.el6.x86_64'
DEBUG:root:   2015-05-20 12:14:49 Version: '2.6.32'; Release: '358.6.2.el6'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False
DEBUG:root:   2015-05-20 12:14:49 Determined kernel version: 2.6.32-358.6.2.el6.x86_64
INFO:root:    2015-05-20 12:14:49 Preparing environment for backtrace generation
DEBUG:root:   2015-05-20 12:14:50 Parsing kernel version '2.6.32-358.6.2.el6.x86_64'
DEBUG:root:   2015-05-20 12:14:50 Version: '2.6.32'; Release: '358.6.2.el6'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False
DEBUG:root:   2015-05-20 12:14:50 Parsing kernel version '2.6.32-358.6.2.el6.x86_64'
DEBUG:root:   2015-05-20 12:14:50 Version: '2.6.32'; Release: '358.6.2.el6'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False
DEBUG:root:   2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/kernel/Packages/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm
DEBUG:root:   2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/kernel/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm
DEBUG:root:   2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/download/Packages/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm
DEBUG:root:   2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/download/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm
DEBUG:root:   2015-05-20 12:14:50 Trying debuginfo file: /mnt/brewroot/packages/kernel/2.6.32/358.6.2.el6/x86_64/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm
WARNING:root: 2015-05-20 12:14:53 Unable to list modules: crash exited with 1:
crash: seek error: kernel virtual address: ffffffff81614820  type: "cpu_possible_mask"

INFO:root:    2015-05-20 12:14:53 Generating backtrace
WARNING:root: 2015-05-20 12:14:57 crash 'bt -a' exited with 1
WARNING:root: 2015-05-20 12:14:58 crash 'sys' exited with 1
WARNING:root: 2015-05-20 12:15:00 crash 'sys -c' exited with 1
WARNING:root: 2015-05-20 12:15:01 crash 'foreach bt' exited with 1
INFO:root:    2015-05-20 12:15:01 Saving crash statistics
INFO:root:    2015-05-20 12:15:01 Cleaning environment after backtrace generation
INFO:root:    2015-05-20 12:15:01 Retrace took 13 seconds
INFO:root:    2015-05-20 12:15:01 Retrace job finished successfully


Version-Release number of selected component (if applicable):
retrace-server-1.12-3.el6.noarch

How reproducible:
Any vmcore where crash exits, and no kernel log is created.  I've found the 'retrace_backtrace' file is very tiny, usually around 155 bytes.
$ ls -l /cores/retrace/tasks/830339522/retrace_backtrace
-rw-r--r--. 1 retrace gss-eng-collab 155 May 20 12:15 /cores/retrace/tasks/830339522/retrace_backtrace

Steps to Reproduce:
1. Submit a partial / damaged vmcore where we cannot even extract a kernel log, but we can identify the kernel version.

Actual results:
retrace-server marks the task as a success.

Expected results:
Tasks where there's no kernel log created (maybe check for size above a certain amount) should be marked as 'failed'.


Additional info:
FWIW, I looked just at the tasks where the retrace_backtrace file contained "log: seek error: kernel virtual address".  All of these were very tiny files (155 bytes mostly).  These tasks took up over 800 GB of space on our filesystem which stores the tasks.

There's probably corner cases where crash may fail for some commands but still be usable.  However, if crash fails and we cannot get a kernel log into retrace_backtrace, then I think the vmcore is useless and we should probably mark it failed so more aggressive cleanup may be used.

Comment 1 Dave Wysochanski 2015-06-22 17:41:44 UTC

Marking medium priorty / low severity.  Assuming it's not too invasive, it would be good to remove tasks like this which we know are useless.  I have not tried a patch yet.

Comment 2 Dave Wysochanski 2018-03-02 14:35:02 UTC

This has become a problem again as we have received larger 'vmem' files that take up our excess space margin.  They 'succeed' because our kernelver detection can find a kernelver but they are useless because crash fails to load them.  These files need converted with a 3rd party vmware 'vmss2core' tool to be useful.  The end result today is that we keep these vmem files around longer than we should (we should remove these based on DeleteFailedTaskAfter not DeleteTaskAfter).

Comment 3 Dave Wysochanski 2018-03-02 15:20:30 UTC

Probably this is simple - just look for non-zero 'crash' exit code.  But will require a decent amount of testing to make sure no corner cases exist (mark something not useful we should retain, crash exit code correct, etc).  Also we should think whether we need another status code or just re-use 'failed'

FWIW, example of the failure from comment #2 - crash output has "not a supported file format", and this is fairly common:

crash: /cores/retrace/tasks/784434840/crash/vmcore: not a supported file format

Usage:

  crash [OPTION]... NAMELIST MEMORY-IMAGE[@ADDRESS]	(dumpfile form)
  crash [OPTION]... [NAMELIST]             		(live system form)

Enter "crash -h" for details.

Comment 7 Dave Wysochanski 2018-03-02 22:47:42 UTC

Created attachment 1403242 [details]
v3: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes

Comment 8 Dave Wysochanski 2018-03-05 14:29:01 UTC

Created attachment 1404350 [details]
v4: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes

Comment 9 Dave Wysochanski 2018-03-06 13:56:55 UTC

https://github.com/abrt/retrace-server/pull/178

Comment 10 Dave Wysochanski 2018-03-06 13:58:03 UTC

Created attachment 1404809 [details]
v5: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes

Comment 11 Dave Wysochanski 2018-03-24 11:59:49 UTC

After much testing, pull request from comment #9 has been merged and has been deployed in production.  So far it looks good.

Comment 13 Dave Wysochanski 2018-12-21 15:42:29 UTC

$ git tag --contains daea8e8
1.19.0