Description of problem: I noticed this problem when a very large vmcore was being retraced. Someone ran 'retrace-server-interact' on it while it was in the later stages of retrace but it the later stage was such that it had clearly detected the kernel version since it was running crash commands on the vmlinux file. Unfortunately I noticed the users command ended up firing off 'strings' on the very large vmcore again. Also there was no kernelver file in the task directory. I looked at the code and the bug is fairly obvious. When you queue a task the call chain goes RetraceWorker:start() download_remote() <assuming you execute makedumpfile> strip_vmcore() prepare_debuginfo() get_kernel_release() find_kernel_debuginfo() .... start_vmcore() set_kernelver() But at no point inside prepare_debuginfo() do we actually set the kernelver file. If the rest of prepare_debuginfo takes a long time, the kernelver file is not written for a long time even though we know what the kernelver is. The only place we call set_kernelver is inside retrace_worker.py: start_vmcore() def start_vmcore(self, custom_kernelver=None): ... if custom_kernelver is not None: kernelver = custom_kernelver kernelver_str = custom_kernelver.kernelver_str else: kernelver = get_kernel_release(vmcore, task.get_crash_cmd().split()) if not kernelver: raise Exception("Unable to determine kernel version") log_debug("Determined kernel version: %s" % kernelver) kernelver_str = kernelver.kernelver_str task.set_kernelver(str(kernelver)) As a result, if someone runs retrace-server-interact while the original retrace is still going in , they hit this code: elif task.get_type() == TASK_VMCORE_INTERACTIVE: vmcore = os.path.join(task.get_savedir(), "crash", "vmcore") if task.has_kernelver(): kernelver = KernelVer(task.get_kernelver()) else: kernelver = get_kernel_release(vmcore, task.get_crash_cmd().split()) And off to the races we go with "crash --osrelease" and more damaging, "strings -n" on the vmcore (which may take hours if it is a very large vmcore). We can fix this but we also need to fixup prepare_debuginfo to make sure we write the vmlinux file as fast as possible. Otherwise even though someone can get past the get_kernel_release() above, they still may end up calling prepare_debuginfo if args.action == "crash": if not task.use_mock(kernelver): crash_cmd = task.get_crash_cmd().split() if task.has_vmlinux(): vmlinux = task.get_vmlinux() else: vmlinux = task.prepare_debuginfo(vmcore, kernelver=kernelver, crash_cmd=crash_cmd) task.set_crash_cmd(' '.join(crash_cmd)) if task.has_crashrc(): cmdline = task.get_crash_cmd().split() + ["-i", task.get_crashrc_path(), vmcore, vmlinux] else: cmdline = task.get_crash_cmd().split() + [vmcore, vmlinux] Version-Release number of selected component (if applicable): retrace-server-1.17.0.111+gb07f-1.el6.noarch How reproducible: everytime Steps to Reproduce: 1. submit a large vmcore through ftp 2. wait for it to download and get further in the retrace process to running crash (completed get_kernel_release but still inside prepare_debuginfo) 3. run retrace-server-interace <task> crash Actual results: retrace-server-interact command will block detecting the kernel version Expected results: retrace-server-interact command should not block detecting kernel version, and assuming the vmlinux file exists in the cache, it should use the vmlinux file and run crash even if the full prepare_debuginfo is not completed Additional info:
This seems like quite a corner case but also like a simple fix (If I understood it correctly). We only need to save the kernelversion as soon as possible, right?
Created attachment 1358982 [details] patch under test to resolve this bug
(In reply to Matej Marušák from comment #2) > This seems like quite a corner case but also like a simple fix (If I > understood it correctly). > We only need to save the kernelversion as soon as possible, right? This actually comes up much more than you'd think, due to a few factors: - makedumpfile can take a long time - crash can take a long time (not just due to large file, but certain memory configurations, etc) and we run crash multiple times on each vmcore (one for the modules, one for each command output, etc) It's mostly a simple fix yes, though there's multiple places and we need to set both kernelver and vmlinux to be the most useful. I'm testing the patch in comment #3 now and it seems to work really well. For abrt use case (non-interactive), do things like the modules even matter? I am wondering about other refactorings (beyond the scope of this bug) that would greatly simplify the "base retrace" into just kerneversion detection and vmlinux file extraction. Then everything else can be put into a "interactive mode" type of secondary processing. A lot of people just care if they can run crash, and things like the modules and the commands are secondary.
The patch in comment #3 appears to contain a critical bug where it does not extract modules anymore for existing kernels. I had this problem in a prior patchset due to how prepare_debuginfo is written and the subtleties of the related code. I'll see if I can track it down. Unfortunately this is a fairly fragile area of code and has been this way a long time.
Ok I figured it out - actually this does not look like a problem with the patch but probably some other experimental patch that I was running on the devel / test machine. As it turned out, the permissions of certain extracted debuginfos were not set properly and as a result, if a new module debuginfo needed extracted, retrace could not write to the existing directory and silently failed. Probably there should be an error log inside cache_files_from_debuginfo() since apparently this silently fails.
Here's a more detailed explanation of what happened in comment #5 and #6. There is a bug / race condition that can occur if a vmcore is in the middle of being retraced for a new kernelver (which requires creation of a new directory structure under 'repos') and in parallel someone runs retrace-server-interact to load the vmcore. If the vmlinux file is not there, it will run the rpm2cpio command as the user and the directory could be created with the user and group of the user that ran the retrace-server-interact command, not retrace-server itself. The patch in comment #3 will close that race window significantly, since it writes the vmlinux file ASAP, but not fix it completely I don't think (it's still possible that the user's started rpm2cpio will win the race). The race bug has existed a long time though and is a separate problem. Checking our production retrace there's no instances of a non-retrace user owning a directory of an extracted kernelver so it may only have a reasonable chance of occurring on much slower processing machines (such as my devel / test machine).
https://github.com/abrt/retrace-server/pull/164
One more point about the bug / race. The directories in question had older dates on them so they were not created in testing of this patch - they were artifacts of prior patches / testing on my devel / test machine's 'repos' directory. In short, this patch should fix the problem and not add any new problems.
Pull request merged.
retrace-server-1.18.0-1.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-e5741ca105
retrace-server-1.18.0-1.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-bc35ca9028
retrace-server-1.18.0-1.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-bc35ca9028
retrace-server-1.18.0-1.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-e5741ca105
retrace-server-1.18.0-1.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.
retrace-server-1.18.0-1.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.