Bug 1368807

Summary: Avoid looking for kernel-debuginfo and possibly failing the task, if extracted vmlinux and kernel modules exist in local cache
Product: [Fedora] Fedora EPEL Reporter: Dave Wysochanski <dwysocha>
Component: retrace-serverAssignee: Dave Wysochanski <dwysocha>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: el6CC: hmadhava, jberan, michal.toman, mmarusak, phelia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: retrace-server-1.17.0-1.fc26 retrace-server-1.17.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-03 16:09:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
0001-Avoid-circular-dependency-on-kernel-debuginfo-for-vm.patch none

Description Dave Wysochanski 2016-08-21 14:14:53 UTC
Description of problem:
Recently we had a few kernel-debuginfo files get moved from one NFS share to a different location.  These were older kernel-debuginfos from released errata kernels and at least some of these existed in our local extracted kernel store.  As a result we saw some recent vmcores get failed by retrace-server because it could not longer find kernel-debuginfo.  Fortunately this problem only affected a few kernel-debuginfo files, if this was not the case, we would have had a much larger outage.

As it turns out, unfortunately retrace-server is not smart enough to look for vmlinux and kernel module debug files in the local kernel-debuginfo extracted location (on x86_64 for example this is inside CONFIG["RepoDir"]/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64 ).  Instead for every vmcore that comes in, it must first find a matching kernel-debuginfo file or the task will fail.  This means if we ever lose our NFS share or kernel-debuginfos get moved in a more widespread fashion, retrace-server will start failing every task that comes in, which would be really bad.

I consider this a bug but one could argue it is an optimization or even RFE.  Implementation in theory should not be difficult but should be tested since it could possibly lead to a regression when setting up the vmcore for use with retrace-server.

Version-Release number of selected component (if applicable):
retrace-server-1.16-1.el6.noarch

How reproducible:
Every time

Steps to Reproduce:
1. Configure retrace-server so it cannot find kernel-debuginfo packages
2. Find a vmcore with kernel version matching some already extracted kernel-debuginfo

Actual results:
retrace-server will fail the task with "Unable to find debuginfo package" because it cannot find kernel-debuginfo.

Expected results:
retrace-server does not fail the task but the vmlinux and kernel module symbols from the cache are used


Additional info:

If a task fails due to this, there does exist a workaround:
1. Download kernel-debuginfo locally to retrace-server inside CONFIG["RepoDir"]/downloaded

2. Restart the retrace-server task: 
$ retrace-server-worker --restart 


I mark it medium priority due to to following factors:
- we've only seen this recently and it's existed for a long time (as far as I know)
- there is a workaround
- it is probably unlikely to have many kernel-debuginfos for recent kernels go missing
- if something happened to our nfs share with kernel-debuginfo it would be a severe problem.  NOTE: This is not an unlikely scenario, since one group maintains our vmcore / retrace-server system, and a different group is in charge of the NFS shares for kernel-debuginfos.  While the two groups should be coordinating, it is a clear place where things can break down.


I took a brief look at the code and I think the following code needs patched.  It's possible it's too hard to do something prior to calling prepare_debuginfo and instead we'll need some additional code inside prepare_debuginfo.  Or we might need to refactor prepare_debuginfo and add a new method prior to calling.

src/retrace/retrace_worker.py

   def start_vmcore(self, custom_kernelver=None):

...

            # no locks required, mock locks itself
            try:
                self.hook_pre_prepare_debuginfo()

/*
 * Look in the cache (CONFIG["RepoDir"]/kernel/) first - if vmlinux exists, we don't need to call prepare_debuginfo.
 * See prepare_debuginfo() and cache_files_from_debuginfo()
 */

                vmlinux = task.prepare_debuginfo(vmcore, cfgdir, kernelver=kernelver, crash_cmd=task.get_crash_cmd().split())
                self.hook_post_prepare_debuginfo()

                self.hook_pre_retrace()

Comment 1 Dave Wysochanski 2016-08-21 14:31:18 UTC
Looking a bit more I think this bug will definitely require refactoring of prepare_debuginfo()

Comment 2 Dave Wysochanski 2016-08-21 14:42:29 UTC
Not surprisingly, this might not be very easily fixable now that I look it at.

Unfortunately retrace-server uses the path inside the kernel-debuginfo file to store the vmlinux file locally.  This means even validation of the existence of the vmlinux file depends on the kernel-debuginfo file existing.

    def prepare_debuginfo(self, vmcore, chroot=None, kernelver=None, crash_cmd=["crash"]):
        log_info("Calling prepare_debuginfo with crash_cmd = " + str(crash_cmd))
        if kernelver is None:
            kernelver = get_kernel_release(vmcore, crash_cmd)

        if kernelver is None:
            raise Exception, "Unable to determine kernel version"

        debuginfo = find_kernel_debuginfo(kernelver)   <--------------- ideally we want to avoid this.  We should be able to since we have the kernel version.
        if not debuginfo:
            raise Exception, "Unable to find debuginfo package"

        if "EL" in kernelver.release:
            if kernelver.flavour is None:
                pattern = "EL/vmlinux"
            else:
                pattern = "EL%s/vmlinux" % kernelver.flavour
        else:
            pattern = "/vmlinux"

        vmlinux_path = None
        debugfiles = {}
        child = Popen(["rpm", "-qpl", debuginfo], stdout=PIPE)
        lines = child.communicate()[0].splitlines()
        for line in lines:
            if line.endswith(pattern):
                vmlinux_path = line   <------------- vmlinux_path depends on output from 'rpm -qlp' above
                continue

            match = KO_DEBUG_PARSER.match(line)
            if not match:
                continue

            # only pick the correct flavour for el4
            if "EL" in kernelver.release:
                if kernelver.flavour is None:
                    pattern2 = "EL/"
                else:
                    pattern2 = "EL%s/" % kernelver.flavour

                if not pattern2 in os.path.dirname(line):
                    continue

            # '-' in file name is transformed to '_' in module name
            debugfiles[match.group(1).replace("-", "_")] = line

        debugdir_base = os.path.join(CONFIG["RepoDir"], "kernel", kernelver.arch)
        if not os.path.isdir(debugdir_base):
            os.makedirs(debugdir_base)

        vmlinux = os.path.join(debugdir_base, vmlinux_path.lstrip("/"))  <---- vmlinux_path depends on kernel-debuginfo
        if not os.path.isfile(vmlinux):   <------------------ if vmlinux did not depend on vmlinux_path, we could move this check higher
            cache_files_from_debuginfo(debuginfo, debugdir_base, [vmlinux_path])
            if not os.path.isfile(vmlinux):
                raise Exception, "Caching vmlinux failed"

Comment 3 Dave Wysochanski 2016-08-22 18:21:56 UTC
More I thought about this - it's possible earlier versions of crash required the path to the vmlinux or modules to match what was in the kernel-debuginfo file.  Other that this, I'm not sure why retrace-server's local cache of the extracted kernel-debuginfo depends on the path format for the files in kernel-debuginfo.

Comment 4 Dave Wysochanski 2016-12-08 21:57:16 UTC
FWIW, this has potential to impact us now since eng-ops (IT) is moving the share where all the kernel-debuginfos are stored.  As a result I think our production retrace-server may be offline during their outage even though most of the debuginfos used for incoming vmcores are already extracted.

Comment 5 Dave Wysochanski 2016-12-12 22:35:25 UTC
Created attachment 1230977 [details]
0001-Avoid-circular-dependency-on-kernel-debuginfo-for-vm.patch

Comment 6 Dave Wysochanski 2016-12-12 22:36:20 UTC
https://github.com/abrt/retrace-server/pull/138

Comment 7 Dave Wysochanski 2017-02-21 21:53:34 UTC
Patch has been merged upstream which fixes the most common cases.
Unfortunately older kernels (RHEL5, etc) are not found in the cache due to differences in kernel-debuginfo.  I am looking into fixing this but right now I don't have another patch that fixes all cases.

Comment 8 Dave Wysochanski 2017-02-22 09:58:39 UTC
Ok so here's the remaining problems I see with the existing patch / code.  I should have another patch soon to address all known cases.  This code is a good example of why the detection code needs pulled out and tested separately.

1. With RHEL5, the 'Arch' is not present in the kernel-debuginfo path to vmlinux.  Example:
Where we look:
    2017-02-22 04:22:11 Version: '2.6.18'; Release: '412.el5'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False
    2017-02-22 04:22:11 Unable to find cached vmlinux at path: /retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-412.el5.x86_64/vmlinux - searching for kernel-debuginfo package

The correct path should be:
/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-412.el5/vmlinux


2. On RHEL5, we need to add the 'Flavour' for some unusual vmcores

    2017-02-22 04:22:15 Version: '2.6.18'; Release: '194.el5'; Arch: 'i386'; Flavour: 'PAE'; Realtime: False
    2017-02-22 04:22:15 Unable to find cached vmlinux at path: /retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.18-194.el5.i386.PAE/vmlinux - searching for kernel-debuginfo package

The correct path should be:
/retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.18-194.el5PAE/vmlinux


3. The code does some 'fixup' with the 'pattern' variable and the 'Flavour'.  I'm not sure this is correct, since on RHEL4 we end up with:

    2017-02-22 04:22:11 Version: '2.6.9'; Release: '89.0.7.EL'; Arch: 'i386'; Flavour: 'hugemem'; Realtime: False
    2017-02-22 04:22:11 Unable to find cached vmlinux at path: /retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.9-89.0.7.EL.i386.hugememELhugemem/vmlinux - searching for kernel-debuginfo package

The correct path should be:
/retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.9-89.0.7.ELhugemem/vmlinux

Again, we omit the 'Arch' on RHEL4, but we need the 'Flavour'

Here's the odd piece of code inside prepare_debuginfo

        if "EL" in kernelver.release:
            if kernelver.flavour is None:
                pattern = "EL/vmlinux"  <--- this looks wrong; won't we have "ELEL/vmlinux" for some vmcores?
            else:
                pattern = "EL%s/vmlinux" % kernelver.flavour
        else:
            pattern = "/vmlinux"

Comment 9 Dave Wysochanski 2017-02-22 12:26:11 UTC
Ok posted latest patch which handles all kernel-debuginfo variants
https://github.com/abrt/retrace-server/pull/145

Comment 10 Fedora Update System 2017-03-30 14:11:01 UTC
retrace-server-1.17.0-1.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-3d55370e77

Comment 11 Fedora Update System 2017-03-30 14:11:43 UTC
retrace-server-1.17.0-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-ffb8a84c9c

Comment 12 Fedora Update System 2017-03-30 14:12:00 UTC
retrace-server-1.17.0-1.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-9390d60e0d

Comment 13 Fedora Update System 2017-03-30 18:54:15 UTC
retrace-server-1.17.0-1.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-ffb8a84c9c

Comment 14 Dave Wysochanski 2017-03-30 20:11:49 UTC
This is not fixed or at least can cause regressions where kernel modules are failing to get extracted.  I'm not sure why I missed this earlier it is fairly obvious.

If you have two vmcores from the same kernel, but they have a different series of modules loaded:
vmcoreA: module1, module2
vmcoreB: module1, module2, module3, module4

If vmcoreA gets submitted first, the both the vmlinux and kernel modules will get extracted.  When vmcoreB gets submitted, it will find the vmlinux file but incorrectly assume all modules exist and it returns early from prepare_debuginfo.

The fix will be to avoid returning early, and look inside the cache area for any kernel modules, similar to the vmlinux file.  I'll have to refactor the code for this.

There is also a second problem I saw with 32-bit vmcores on 64-bit machine but it may be a subset of this problem.  For more info, see https://bugzilla.redhat.com/show_bug.cgi?id=1437637

Comment 15 Dave Wysochanski 2017-03-31 02:41:18 UTC
https://github.com/abrt/retrace-server/pull/149

Comment 16 Dave Wysochanski 2017-03-31 02:48:17 UTC
*** Bug 1437637 has been marked as a duplicate of this bug. ***

Comment 17 Fedora Update System 2017-03-31 03:47:30 UTC
retrace-server-1.17.0-1.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-9390d60e0d

Comment 18 Fedora Update System 2017-03-31 03:48:53 UTC
retrace-server-1.17.0-1.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-3d55370e77

Comment 19 Fedora Update System 2017-04-03 16:09:48 UTC
retrace-server-1.17.0-1.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 20 Fedora Update System 2017-04-18 21:18:27 UTC
retrace-server-1.17.0-1.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.