Description of problem: Today, retrace-server will flag a task as 'success' if it can identify the kernel version and setup the symbols. But it doesn't take into account if the vmcore is usable. We get a fair number of damaged / incomplete vmcores. One "class" of such vmcores is such that the kernel log is not even readable. In this case the vmcore is totally useless AFAIK. These should really be marked as 'failed' IMO and so the more aggressive cleanup can be done. Here is a good example of what we see in a log. $ cat /cores/retrace/tasks/830339522/retrace_backtrace NOTE: minimal mode commands: log, dis, rd, sym, eval, set, extend and exit log: seek error: kernel virtual address: ffffffff81a9abe8 type: "log_buf_len" $ cat /cores/retrace/tasks/830339522/retrace_log INFO:root: 2015-05-20 12:14:48 Downloading remote resources INFO:root: 2015-05-20 12:14:48 Retrieving local file '/cores/exceptions/01445357/vmcore-incomplete' DEBUG:root: 2015-05-20 12:14:48 File type: data DEBUG:root: 2015-05-20 12:14:48 unknown file type, unpacking finished DEBUG:root: 2015-05-20 12:14:48 Trying hardlink DEBUG:root: 2015-05-20 12:14:48 Succeeded INFO:root: 2015-05-20 12:14:48 Post-processing downloaded file DEBUG:root: 2015-05-20 12:14:48 File type: data DEBUG:root: 2015-05-20 12:14:48 unknown file type, unpacking finished DEBUG:root: 2015-05-20 12:14:48 File type: data DEBUG:root: 2015-05-20 12:14:48 unknown file type, unpacking finished INFO:root: 2015-05-20 12:14:48 Vmcore size: 5.74 GB DEBUG:root: 2015-05-20 12:14:49 Vmcore dump level is 31 INFO:root: 2015-05-20 12:14:49 Stripping to 1 would have no effect INFO:root: 2015-05-20 12:14:49 Analyzing crash data DEBUG:root: 2015-05-20 12:14:49 Parsing kernel version '2.6.32-358.6.2.el6.x86_64' DEBUG:root: 2015-05-20 12:14:49 Version: '2.6.32'; Release: '358.6.2.el6'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False DEBUG:root: 2015-05-20 12:14:49 Determined kernel version: 2.6.32-358.6.2.el6.x86_64 INFO:root: 2015-05-20 12:14:49 Preparing environment for backtrace generation DEBUG:root: 2015-05-20 12:14:50 Parsing kernel version '2.6.32-358.6.2.el6.x86_64' DEBUG:root: 2015-05-20 12:14:50 Version: '2.6.32'; Release: '358.6.2.el6'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False DEBUG:root: 2015-05-20 12:14:50 Parsing kernel version '2.6.32-358.6.2.el6.x86_64' DEBUG:root: 2015-05-20 12:14:50 Version: '2.6.32'; Release: '358.6.2.el6'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False DEBUG:root: 2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/kernel/Packages/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm DEBUG:root: 2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/kernel/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm DEBUG:root: 2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/download/Packages/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm DEBUG:root: 2015-05-20 12:14:50 Trying debuginfo file: /cores/retrace/repos/download/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm DEBUG:root: 2015-05-20 12:14:50 Trying debuginfo file: /mnt/brewroot/packages/kernel/2.6.32/358.6.2.el6/x86_64/kernel-debuginfo-2.6.32-358.6.2.el6.x86_64.rpm WARNING:root: 2015-05-20 12:14:53 Unable to list modules: crash exited with 1: crash: seek error: kernel virtual address: ffffffff81614820 type: "cpu_possible_mask" INFO:root: 2015-05-20 12:14:53 Generating backtrace WARNING:root: 2015-05-20 12:14:57 crash 'bt -a' exited with 1 WARNING:root: 2015-05-20 12:14:58 crash 'sys' exited with 1 WARNING:root: 2015-05-20 12:15:00 crash 'sys -c' exited with 1 WARNING:root: 2015-05-20 12:15:01 crash 'foreach bt' exited with 1 INFO:root: 2015-05-20 12:15:01 Saving crash statistics INFO:root: 2015-05-20 12:15:01 Cleaning environment after backtrace generation INFO:root: 2015-05-20 12:15:01 Retrace took 13 seconds INFO:root: 2015-05-20 12:15:01 Retrace job finished successfully Version-Release number of selected component (if applicable): retrace-server-1.12-3.el6.noarch How reproducible: Any vmcore where crash exits, and no kernel log is created. I've found the 'retrace_backtrace' file is very tiny, usually around 155 bytes. $ ls -l /cores/retrace/tasks/830339522/retrace_backtrace -rw-r--r--. 1 retrace gss-eng-collab 155 May 20 12:15 /cores/retrace/tasks/830339522/retrace_backtrace Steps to Reproduce: 1. Submit a partial / damaged vmcore where we cannot even extract a kernel log, but we can identify the kernel version. Actual results: retrace-server marks the task as a success. Expected results: Tasks where there's no kernel log created (maybe check for size above a certain amount) should be marked as 'failed'. Additional info: FWIW, I looked just at the tasks where the retrace_backtrace file contained "log: seek error: kernel virtual address". All of these were very tiny files (155 bytes mostly). These tasks took up over 800 GB of space on our filesystem which stores the tasks. There's probably corner cases where crash may fail for some commands but still be usable. However, if crash fails and we cannot get a kernel log into retrace_backtrace, then I think the vmcore is useless and we should probably mark it failed so more aggressive cleanup may be used.
Marking medium priorty / low severity. Assuming it's not too invasive, it would be good to remove tasks like this which we know are useless. I have not tried a patch yet.
This has become a problem again as we have received larger 'vmem' files that take up our excess space margin. They 'succeed' because our kernelver detection can find a kernelver but they are useless because crash fails to load them. These files need converted with a 3rd party vmware 'vmss2core' tool to be useful. The end result today is that we keep these vmem files around longer than we should (we should remove these based on DeleteFailedTaskAfter not DeleteTaskAfter).
Probably this is simple - just look for non-zero 'crash' exit code. But will require a decent amount of testing to make sure no corner cases exist (mark something not useful we should retain, crash exit code correct, etc). Also we should think whether we need another status code or just re-use 'failed' FWIW, example of the failure from comment #2 - crash output has "not a supported file format", and this is fairly common: crash: /cores/retrace/tasks/784434840/crash/vmcore: not a supported file format Usage: crash [OPTION]... NAMELIST MEMORY-IMAGE[@ADDRESS] (dumpfile form) crash [OPTION]... [NAMELIST] (live system form) Enter "crash -h" for details.
Created attachment 1403242 [details] v3: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes
Created attachment 1404350 [details] v4: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes
https://github.com/abrt/retrace-server/pull/178
Created attachment 1404809 [details] v5: fail a task if crash 'sys' command exits with non-zero status and size of kernellog is less than 1024 bytes
After much testing, pull request from comment #9 has been merged and has been deployed in production. So far it looks good.
$ git tag --contains daea8e8 1.19.0