Bug 1053186

Summary: "retrace-server-interact <taskid> crash" should not try to detect the kernel version but should obtain it from a saved location
Product: [Fedora] Fedora EPEL Reporter: Dave Wysochanski <dwysocha>
Component: retrace-serverAssignee: Michal Toman <mtoman>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: el6CC: kwalker, mtoman, pknirsch, rvokal, smayhew, stalexan
Target Milestone: ---Keywords: Regression, TestCaseProvided
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: retrace-server-1.12-2.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-15 18:58:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dave Wysochanski 2014-01-14 19:54:45 UTC
Description of problem:
I recently noticed this behaviour of "retrace-server-interact <taskid> crash" which can be problematic.
The vmcore I was working on was 35GB and I noticed it was just hanging and crash wasn't coming up.  In another window, I did this simple loop:
$ while true; do ps -efl | grep 476900695; sleep 1; done

and noticed the loading of crash was blocked behind the following:
crash --osrelease /cores/retrace/tasks/476900695/crash/vmcore

So it seems every time you invoke 'retrace-server-interact <taskid> crash" it will attempt to determine the vmcore version and load the correct symbols.  But I wonder why we need to do this.  Hasn't this already been done and can't the kernel version be saved in a file in the <taskid> directory?


Version-Release number of selected component (if applicable):
retrace-server-1.10-1.el6.noarch


How reproducible:
I believe every time though more noticeable for large vmcores (mine was 35GB).


Steps to Reproduce:
1. Load any vmcore via 'retrace-server-interact <taskid> crash'.  If you want time it with 'time'
2. In another window, observe via the following script: while true; do ps -efl | grep <taskid>; sleep 1; done
3. Note that getting to a crash prompt may take minutes due to being stuck in vmcore kernel version detection logic, such as 'crash --osrelease' command


Actual results:
Getting to the crash prompt may take an excessive amount of time using the recommended command of 'retrace-server-interact <taskid> crash'


Expected results:
Command should not need to detect the kernel version every time it is invoked.


Additional info:
I'm not sure why the command works this way.  It may have been assumed detection of the kernel version would not be so expensive.  Unfortunately for large vmcores this is not the case.

I'm surprised no one has noticed / complained about this.  It may be people have their own workarounds, or the frequency of largish vmcores is not too high.

Possible workaround is to take the kernel version in 'retrace_backtrace' file and construct a manual crash commandline to load it, avoiding the detection logic of the 'retrace-server-interact <taskid> crash' command.

I even timed it and found that it took over 19 minutes just to load and quit crash.  While this isn't super scientific, it does give an indication as to how much overhead we may be encountering.

$ time retrace-server-interact 476900695 crash
...
crash> cd /cores/retrace/tasks/476900695/misc
Working directory /cores/retrace/tasks/476900695/misc.
crash> quit

real    19m19.671s
user    18m59.388s
sys     0m6.018s


Manually loading via the crash command and full path took less than 5 minutes.

[dwysocha@optimus expect]$ time crash -i /cores/retrace/tasks/476900695/crashrc /cores/retrace/tasks/476900695/crash/vmcore /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.23.2.el6.x86_64/vmlinux

crash 7.0.1
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
quit
...
crash> cd /cores/retrace/tasks/476900695/misc
Working directory /cores/retrace/tasks/476900695/misc.
crash> quit

real    4m40.887s
user    4m36.856s
sys     0m3.521s

Comment 1 Michal Toman 2014-02-26 12:16:20 UTC
Fixed in upstream

commit 328f12e24d6d1f324b4a4e43d55ba068cff6f3e4
Author: Michal Toman <mtoman>
Date:   Wed Feb 26 13:14:58 2014 +0100

    vmcore: cache kernel version into task directory
    
    Signed-off-by: Michal Toman <mtoman>

Comment 2 Fedora Update System 2014-02-27 13:46:59 UTC
retrace-server-1.11-1.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/retrace-server-1.11-1.el6

Comment 3 Fedora Update System 2014-03-01 07:11:59 UTC
Package retrace-server-1.11-1.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing retrace-server-1.11-1.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2014-0687/retrace-server-1.11-1.el6
then log in and leave karma (feedback).

Comment 6 Dave Wysochanski 2014-03-07 18:05:50 UTC
I think this has been fixed.  I verified with one vmcore on a non-patched retrace-server system, and submitted the same vmcore to a patched retrace-server system.  The unpatched system took 5 mins to load.  The patched one, just a couple seconds.

Comment 7 Fedora Update System 2014-07-31 11:52:51 UTC
retrace-server-1.12-2.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/retrace-server-1.12-2.el6

Comment 8 Fedora Update System 2014-08-15 18:58:11 UTC
retrace-server-1.12-2.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.