Description of problem: Recently we had a few kernel-debuginfo files get moved from one NFS share to a different location. These were older kernel-debuginfos from released errata kernels and at least some of these existed in our local extracted kernel store. As a result we saw some recent vmcores get failed by retrace-server because it could not longer find kernel-debuginfo. Fortunately this problem only affected a few kernel-debuginfo files, if this was not the case, we would have had a much larger outage. As it turns out, unfortunately retrace-server is not smart enough to look for vmlinux and kernel module debug files in the local kernel-debuginfo extracted location (on x86_64 for example this is inside CONFIG["RepoDir"]/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64 ). Instead for every vmcore that comes in, it must first find a matching kernel-debuginfo file or the task will fail. This means if we ever lose our NFS share or kernel-debuginfos get moved in a more widespread fashion, retrace-server will start failing every task that comes in, which would be really bad. I consider this a bug but one could argue it is an optimization or even RFE. Implementation in theory should not be difficult but should be tested since it could possibly lead to a regression when setting up the vmcore for use with retrace-server. Version-Release number of selected component (if applicable): retrace-server-1.16-1.el6.noarch How reproducible: Every time Steps to Reproduce: 1. Configure retrace-server so it cannot find kernel-debuginfo packages 2. Find a vmcore with kernel version matching some already extracted kernel-debuginfo Actual results: retrace-server will fail the task with "Unable to find debuginfo package" because it cannot find kernel-debuginfo. Expected results: retrace-server does not fail the task but the vmlinux and kernel module symbols from the cache are used Additional info: If a task fails due to this, there does exist a workaround: 1. Download kernel-debuginfo locally to retrace-server inside CONFIG["RepoDir"]/downloaded 2. Restart the retrace-server task: $ retrace-server-worker --restart I mark it medium priority due to to following factors: - we've only seen this recently and it's existed for a long time (as far as I know) - there is a workaround - it is probably unlikely to have many kernel-debuginfos for recent kernels go missing - if something happened to our nfs share with kernel-debuginfo it would be a severe problem. NOTE: This is not an unlikely scenario, since one group maintains our vmcore / retrace-server system, and a different group is in charge of the NFS shares for kernel-debuginfos. While the two groups should be coordinating, it is a clear place where things can break down. I took a brief look at the code and I think the following code needs patched. It's possible it's too hard to do something prior to calling prepare_debuginfo and instead we'll need some additional code inside prepare_debuginfo. Or we might need to refactor prepare_debuginfo and add a new method prior to calling. src/retrace/retrace_worker.py def start_vmcore(self, custom_kernelver=None): ... # no locks required, mock locks itself try: self.hook_pre_prepare_debuginfo() /* * Look in the cache (CONFIG["RepoDir"]/kernel/) first - if vmlinux exists, we don't need to call prepare_debuginfo. * See prepare_debuginfo() and cache_files_from_debuginfo() */ vmlinux = task.prepare_debuginfo(vmcore, cfgdir, kernelver=kernelver, crash_cmd=task.get_crash_cmd().split()) self.hook_post_prepare_debuginfo() self.hook_pre_retrace()
Looking a bit more I think this bug will definitely require refactoring of prepare_debuginfo()
Not surprisingly, this might not be very easily fixable now that I look it at. Unfortunately retrace-server uses the path inside the kernel-debuginfo file to store the vmlinux file locally. This means even validation of the existence of the vmlinux file depends on the kernel-debuginfo file existing. def prepare_debuginfo(self, vmcore, chroot=None, kernelver=None, crash_cmd=["crash"]): log_info("Calling prepare_debuginfo with crash_cmd = " + str(crash_cmd)) if kernelver is None: kernelver = get_kernel_release(vmcore, crash_cmd) if kernelver is None: raise Exception, "Unable to determine kernel version" debuginfo = find_kernel_debuginfo(kernelver) <--------------- ideally we want to avoid this. We should be able to since we have the kernel version. if not debuginfo: raise Exception, "Unable to find debuginfo package" if "EL" in kernelver.release: if kernelver.flavour is None: pattern = "EL/vmlinux" else: pattern = "EL%s/vmlinux" % kernelver.flavour else: pattern = "/vmlinux" vmlinux_path = None debugfiles = {} child = Popen(["rpm", "-qpl", debuginfo], stdout=PIPE) lines = child.communicate()[0].splitlines() for line in lines: if line.endswith(pattern): vmlinux_path = line <------------- vmlinux_path depends on output from 'rpm -qlp' above continue match = KO_DEBUG_PARSER.match(line) if not match: continue # only pick the correct flavour for el4 if "EL" in kernelver.release: if kernelver.flavour is None: pattern2 = "EL/" else: pattern2 = "EL%s/" % kernelver.flavour if not pattern2 in os.path.dirname(line): continue # '-' in file name is transformed to '_' in module name debugfiles[match.group(1).replace("-", "_")] = line debugdir_base = os.path.join(CONFIG["RepoDir"], "kernel", kernelver.arch) if not os.path.isdir(debugdir_base): os.makedirs(debugdir_base) vmlinux = os.path.join(debugdir_base, vmlinux_path.lstrip("/")) <---- vmlinux_path depends on kernel-debuginfo if not os.path.isfile(vmlinux): <------------------ if vmlinux did not depend on vmlinux_path, we could move this check higher cache_files_from_debuginfo(debuginfo, debugdir_base, [vmlinux_path]) if not os.path.isfile(vmlinux): raise Exception, "Caching vmlinux failed"
More I thought about this - it's possible earlier versions of crash required the path to the vmlinux or modules to match what was in the kernel-debuginfo file. Other that this, I'm not sure why retrace-server's local cache of the extracted kernel-debuginfo depends on the path format for the files in kernel-debuginfo.
FWIW, this has potential to impact us now since eng-ops (IT) is moving the share where all the kernel-debuginfos are stored. As a result I think our production retrace-server may be offline during their outage even though most of the debuginfos used for incoming vmcores are already extracted.
Created attachment 1230977 [details] 0001-Avoid-circular-dependency-on-kernel-debuginfo-for-vm.patch
https://github.com/abrt/retrace-server/pull/138
Patch has been merged upstream which fixes the most common cases. Unfortunately older kernels (RHEL5, etc) are not found in the cache due to differences in kernel-debuginfo. I am looking into fixing this but right now I don't have another patch that fixes all cases.
Ok so here's the remaining problems I see with the existing patch / code. I should have another patch soon to address all known cases. This code is a good example of why the detection code needs pulled out and tested separately. 1. With RHEL5, the 'Arch' is not present in the kernel-debuginfo path to vmlinux. Example: Where we look: 2017-02-22 04:22:11 Version: '2.6.18'; Release: '412.el5'; Arch: 'x86_64'; Flavour: 'None'; Realtime: False 2017-02-22 04:22:11 Unable to find cached vmlinux at path: /retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-412.el5.x86_64/vmlinux - searching for kernel-debuginfo package The correct path should be: /retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-412.el5/vmlinux 2. On RHEL5, we need to add the 'Flavour' for some unusual vmcores 2017-02-22 04:22:15 Version: '2.6.18'; Release: '194.el5'; Arch: 'i386'; Flavour: 'PAE'; Realtime: False 2017-02-22 04:22:15 Unable to find cached vmlinux at path: /retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.18-194.el5.i386.PAE/vmlinux - searching for kernel-debuginfo package The correct path should be: /retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.18-194.el5PAE/vmlinux 3. The code does some 'fixup' with the 'pattern' variable and the 'Flavour'. I'm not sure this is correct, since on RHEL4 we end up with: 2017-02-22 04:22:11 Version: '2.6.9'; Release: '89.0.7.EL'; Arch: 'i386'; Flavour: 'hugemem'; Realtime: False 2017-02-22 04:22:11 Unable to find cached vmlinux at path: /retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.9-89.0.7.EL.i386.hugememELhugemem/vmlinux - searching for kernel-debuginfo package The correct path should be: /retrace/repos/kernel/i386/usr/lib/debug/lib/modules/2.6.9-89.0.7.ELhugemem/vmlinux Again, we omit the 'Arch' on RHEL4, but we need the 'Flavour' Here's the odd piece of code inside prepare_debuginfo if "EL" in kernelver.release: if kernelver.flavour is None: pattern = "EL/vmlinux" <--- this looks wrong; won't we have "ELEL/vmlinux" for some vmcores? else: pattern = "EL%s/vmlinux" % kernelver.flavour else: pattern = "/vmlinux"
Ok posted latest patch which handles all kernel-debuginfo variants https://github.com/abrt/retrace-server/pull/145
retrace-server-1.17.0-1.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-3d55370e77
retrace-server-1.17.0-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-ffb8a84c9c
retrace-server-1.17.0-1.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-9390d60e0d
retrace-server-1.17.0-1.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-ffb8a84c9c
This is not fixed or at least can cause regressions where kernel modules are failing to get extracted. I'm not sure why I missed this earlier it is fairly obvious. If you have two vmcores from the same kernel, but they have a different series of modules loaded: vmcoreA: module1, module2 vmcoreB: module1, module2, module3, module4 If vmcoreA gets submitted first, the both the vmlinux and kernel modules will get extracted. When vmcoreB gets submitted, it will find the vmlinux file but incorrectly assume all modules exist and it returns early from prepare_debuginfo. The fix will be to avoid returning early, and look inside the cache area for any kernel modules, similar to the vmlinux file. I'll have to refactor the code for this. There is also a second problem I saw with 32-bit vmcores on 64-bit machine but it may be a subset of this problem. For more info, see https://bugzilla.redhat.com/show_bug.cgi?id=1437637
https://github.com/abrt/retrace-server/pull/149
*** Bug 1437637 has been marked as a duplicate of this bug. ***
retrace-server-1.17.0-1.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-9390d60e0d
retrace-server-1.17.0-1.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-3d55370e77
retrace-server-1.17.0-1.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.
retrace-server-1.17.0-1.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.