Bug 1516329 - set_kernelver should be called at the bottom of get_kernel_release upon successful kernelversion detection
Summary: set_kernelver should be called at the bottom of get_kernel_release upon succe...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: retrace-server
Version: el6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Dave Wysochanski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-22 13:50 UTC by Dave Wysochanski
Modified: 2018-02-20 16:23 UTC (History)
6 users (show)

Fixed In Version: retrace-server-1.18.0-1.fc27 retrace-server-1.18.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-13 17:52:00 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
patch under test to resolve this bug (5.74 KB, patch)
2017-11-25 13:41 UTC, Dave Wysochanski
no flags Details | Diff

Description Dave Wysochanski 2017-11-22 13:50:21 UTC
Description of problem:
I noticed this problem when a very large vmcore was being retraced.  Someone ran 'retrace-server-interact' on it while it was in the later stages of retrace but it the later stage was such that it had clearly detected the kernel version since it was running crash commands on the vmlinux file.  Unfortunately I noticed the users command ended up firing off 'strings' on the very large vmcore again.  Also there was no kernelver file in the task directory.  I looked at the code and the bug is fairly obvious.

When you queue a task the call chain goes 
RetraceWorker:start()
  download_remote()
   <assuming you execute makedumpfile>
    strip_vmcore()
      prepare_debuginfo()
        get_kernel_release()
        find_kernel_debuginfo()
        ....
  start_vmcore()
    set_kernelver()

But at no point inside prepare_debuginfo() do we actually set the kernelver file.  If the rest of prepare_debuginfo takes a long time, the kernelver file is not written for a long time even though we know what the kernelver is.

The only place we call set_kernelver is inside retrace_worker.py: start_vmcore()

    def start_vmcore(self, custom_kernelver=None):
...
        if custom_kernelver is not None:
            kernelver = custom_kernelver
            kernelver_str = custom_kernelver.kernelver_str
        else:
            kernelver = get_kernel_release(vmcore, task.get_crash_cmd().split())
            if not kernelver:
                raise Exception("Unable to determine kernel version")

            log_debug("Determined kernel version: %s" % kernelver)

        kernelver_str = kernelver.kernelver_str

        task.set_kernelver(str(kernelver))


As a result, if someone runs retrace-server-interact while the original retrace is still going in , they hit this code:
    elif task.get_type() == TASK_VMCORE_INTERACTIVE:
        vmcore = os.path.join(task.get_savedir(), "crash", "vmcore")
        if task.has_kernelver():
            kernelver = KernelVer(task.get_kernelver())
        else:
            kernelver = get_kernel_release(vmcore, task.get_crash_cmd().split())

And off to the races we go with "crash --osrelease" and more damaging, "strings -n" on the vmcore (which may take hours if it is a very large vmcore).


We can fix this but we also need to fixup prepare_debuginfo to make sure we write the vmlinux file as fast as possible.  Otherwise even though someone can get past the get_kernel_release() above, they still may end up calling prepare_debuginfo

        if args.action == "crash":
            if not task.use_mock(kernelver):
                crash_cmd = task.get_crash_cmd().split()
                if task.has_vmlinux():
                    vmlinux = task.get_vmlinux()
                else:
                    vmlinux = task.prepare_debuginfo(vmcore, kernelver=kernelver, crash_cmd=crash_cmd)
                    task.set_crash_cmd(' '.join(crash_cmd))
                if task.has_crashrc():
                    cmdline = task.get_crash_cmd().split() + ["-i", task.get_crashrc_path(), vmcore, vmlinux]
                else:
                    cmdline = task.get_crash_cmd().split() + [vmcore, vmlinux]



Version-Release number of selected component (if applicable):
retrace-server-1.17.0.111+gb07f-1.el6.noarch


How reproducible:
everytime

Steps to Reproduce:
1. submit a large vmcore through ftp
2. wait for it to download and get further in the retrace process to running crash (completed get_kernel_release but still inside prepare_debuginfo)
3. run retrace-server-interace <task> crash

Actual results:
retrace-server-interact command will block detecting the kernel version

Expected results:
retrace-server-interact command should not block detecting kernel version, and assuming the vmlinux file exists in the cache, it should use the vmlinux file and run crash even if the full prepare_debuginfo is not completed 


Additional info:

Comment 2 Matej Marušák 2017-11-23 09:10:30 UTC
This seems like quite a corner case but also like a simple fix (If I understood it correctly).
We only need to save the kernelversion as soon as possible, right?

Comment 3 Dave Wysochanski 2017-11-25 13:41:46 UTC
Created attachment 1358982 [details]
patch under test to resolve this bug

Comment 4 Dave Wysochanski 2017-11-25 13:47:02 UTC
(In reply to Matej Marušák from comment #2)
> This seems like quite a corner case but also like a simple fix (If I
> understood it correctly).
> We only need to save the kernelversion as soon as possible, right?

This actually comes up much more than you'd think, due to a few factors:
- makedumpfile can take a long time
- crash can take a long time (not just due to large file, but certain memory configurations, etc) and we run crash multiple times on each vmcore (one for the modules, one for each command output, etc)

It's mostly a simple fix yes, though there's multiple places and we need to set both kernelver and vmlinux to be the most useful.  I'm testing the patch in comment #3 now and it seems to work really well.

For abrt use case (non-interactive), do things like the modules even matter?  I am wondering about other refactorings (beyond the scope of this bug) that would greatly simplify the "base retrace" into just kerneversion detection and vmlinux file extraction.  Then everything else can be put into a "interactive mode" type of secondary processing.  A lot of people just care if they can run crash, and things like the modules and the commands are secondary.

Comment 5 Dave Wysochanski 2017-11-27 15:42:33 UTC
The patch in comment #3 appears to contain a critical bug where it does not extract modules anymore for existing kernels.  I had this problem in a prior patchset due to how prepare_debuginfo is written and the subtleties of the related code.  I'll see if I can track it down.  Unfortunately this is a fairly fragile area of code and has been this way a long time.

Comment 6 Dave Wysochanski 2017-11-27 16:29:10 UTC
Ok I figured it out - actually this does not look like a problem with the patch but probably some other experimental patch that I was running on the devel / test machine.  As it turned out, the permissions of certain extracted debuginfos were not set properly and as a result, if a new module debuginfo needed extracted, retrace could not write to the existing directory and silently failed.  Probably there should be an error log inside cache_files_from_debuginfo() since apparently this silently fails.

Comment 7 Dave Wysochanski 2017-11-27 16:43:22 UTC
Here's a more detailed explanation of what happened in comment #5 and #6.  There is a bug / race condition that can occur if a vmcore is in the middle of being retraced for a new kernelver (which requires creation of a new directory structure under 'repos') and in parallel someone runs retrace-server-interact to load the vmcore.  If the vmlinux file is not there, it will run the rpm2cpio command as the user and the directory could be created with the user and group of the user that ran the retrace-server-interact command, not retrace-server itself.

The patch in comment #3 will close that race window significantly, since it writes the vmlinux file ASAP, but not fix it completely I don't think (it's still possible that the user's started rpm2cpio will win the race).   The race bug has existed a long time though and is a separate problem.  Checking our production retrace there's no instances of a non-retrace user owning a directory of an extracted kernelver so it may only have a reasonable chance of occurring on much slower processing machines (such as my devel / test machine).

Comment 8 Dave Wysochanski 2017-11-27 17:46:37 UTC
https://github.com/abrt/retrace-server/pull/164

Comment 9 Dave Wysochanski 2017-11-27 17:58:32 UTC
One more point about the bug / race.  The directories in question had older dates on them so they were not created in testing of this patch - they were artifacts of prior patches / testing on my devel / test machine's 'repos' directory.  In short, this patch should fix the problem and not add any new problems.

Comment 10 Dave Wysochanski 2018-01-17 15:34:02 UTC
Pull request merged.

Comment 11 Fedora Update System 2018-02-01 15:51:56 UTC
retrace-server-1.18.0-1.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-e5741ca105

Comment 12 Fedora Update System 2018-02-01 15:52:29 UTC
retrace-server-1.18.0-1.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-bc35ca9028

Comment 13 Fedora Update System 2018-02-02 18:24:33 UTC
retrace-server-1.18.0-1.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-bc35ca9028

Comment 14 Fedora Update System 2018-02-02 18:47:50 UTC
retrace-server-1.18.0-1.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-e5741ca105

Comment 15 Fedora Update System 2018-02-13 17:52:00 UTC
retrace-server-1.18.0-1.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.

Comment 16 Fedora Update System 2018-02-20 16:23:56 UTC
retrace-server-1.18.0-1.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.