Bug 1073987
Summary: | crash-7.0.1-0.1.x86_64 built on RHEL6.4 system hung for 4 hours consuming CPU on ~2GB x86_64 vmcore | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dave Wysochanski <dwysocha> |
Component: | crash | Assignee: | Dave Anderson <anderson> |
Status: | CLOSED ERRATA | QA Contact: | Qiao Zhao <qzhao> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 6.4 | CC: | dwysocha, hamiller, harshula, jherrman, kwalker, rvokal, stalexan, streeter, tlavigne |
Target Milestone: | rc | Keywords: | TestCaseProvided |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | crash-7.1.0-1.el6 | Doc Type: | Bug Fix |
Doc Text: |
On certain system configurations, the "kmem -f", "kmem -F", and "kmem [address]" command options previously took a very long time to complete. This update increases the internal hash queue size used to store the address of each free page, and streamlines the free page search to only check the NUMA node that contains a specified address. As a result, the mentioned "kmem" options no longer have a negative impact on performance.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-07-22 06:27:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dave Wysochanski
2014-03-07 15:36:22 UTC
Yeah I'm not sure how to debug this. It's simply doing a readline() library call for a command entry, which should just block in the kernel in a read system call: Program terminated with signal 11, Segmentation fault. #0 0x00007f7543cc3630 in __read_nocancel () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6_4.5.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x00007f7543cc3630 in __read_nocancel () from /lib64/libc.so.6 #1 0x0000000000762d57 in rl_getc (stream=0x7f7543f756a0) at input.c:471 #2 0x00000000007631f6 in rl_read_key () at input.c:448 #3 0x0000000000750725 in readline_internal_char () at readline.c:517 #4 0x0000000000750b46 in readline_internal_charloop (prompt=<value optimized out>) at readline.c:579 #5 readline_internal (prompt=<value optimized out>) at readline.c:593 #6 e (prompt=<value optimized out>) at readline.c:342 #7 0x000000000051fb85 in process_command_line () at cmdline.c:116 #8 0x0000000000461295 in main_loop () at main.c:707 #9 0x000000000066ef23 in captured_command_loop (data=0x0) at main.c:258 #10 0x000000000066d7f4 in catch_errors (func=0x66ef10 <captured_command_loop>, func_args=0x0, errstring=0x8a2388 "", mask=6) at exceptions.c:557 #11 0x000000000066fb67 in captured_main (data=<value optimized out>) at main.c:1064 #12 0x000000000066d7f4 in catch_errors (func=0x66f000 <captured_main>, func_args=0x7fff769039d0, errstring=0x8a2388 "", mask=6) at exceptions.c:557 #13 0x000000000066ed24 in gdb_main (args=0x7fff76903647) at main.c:1079 #14 0x000000000066ed66 in gdb_main_entry (argc=<value optimized out>, argv=0x7fff76903647) at main.c:1099 #15 0x0000000000465045 in main (argc=5, argv=0x7fff76904cb8) at main.c:613 (gdb) And if a CTRL-z is entered, the process should block (call schedule) in do_signal_stop(). But I have no clue about whatever retrace-server or screen might be doing when they are the standard input. Maybe they're causing the read() to return constantly with zero-data or something like that? > The Red Hat engineer submitted the vmcore to our retrace-server system
> to strip and detect the proper kernel-debuginfo for the core.
What does the "strip and detect" procedure do?
(In reply to Dave Anderson from comment #5) > > The Red Hat engineer submitted the vmcore to our retrace-server system > > to strip and detect the proper kernel-debuginfo for the core. > > What does the "strip and detect" procedure do? The 'strip' procedure just runs makedumpfile on the vmcore. The 'detect' part runs 'crash --osrelease' to try to get the kernel version. If that fails, I think it tries 'strings' or something on the vmcore to try to pull a kernel version out. I don't think either one of these was hung. And as it turns out, I think it is reproducible with this vmcore. I submitted it to our development retrace-server system, and it looks hung in the same manner. It is still running in fact on that machine. [anderson@optimus ~]$ crash -s /retrace/tasks/180670073/crash/vmcore /retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.6.2.el6.x86_64/vmlinux crash: /retrace/tasks/180670073/crash/vmcore: No such file or directory Usage: crash [OPTION]... NAMELIST MEMORY-IMAGE (dumpfile form) crash [OPTION]... [NAMELIST] (live system form) Enter "crash -h" for details. [anderson@optimus ~]$ FWIW, yesterday I did try the original vmlinux/vmcore pair reported in comment #1 in a standalone interactive manner, and as expected there was no problem. I didn't try the vmcore on flopbox, but I highly doubt that it would fail while being invoked interactively on the command line. I don't know why a particular vmlinux/vmcore pair would cause this to happen, but I also don't know how crash is being invoked by the retrace-server script (or whatever it is). Is it piping input into the crash process? Or is using the "-i <input-file>" command line option? Can you show the invocation? > The core should also still be on flopobox as this filename: 01050648.vmcore.gz $ crash -s --osrelease 01050648.vmcore 2.6.32-358.6.2.el6.x86_64 $ crash -s /tmp/vmlinux 01050648.vmcore crash> q $ crash /tmp/vmlinux 01050648.vmcore crash 7.0.5 Copyright (C) 2002-2014 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /tmp/vmlinux DUMPFILE: 01050648.vmcore [PARTIAL DUMP] CPUS: 32 DATE: Thu Mar 6 04:28:15 2014 UPTIME: 18 days, 20:09:38 LOAD AVERAGE: 0.04, 0.05, 0.11 TASKS: 848 NODENAME: fl08lxccs01 RELEASE: 2.6.32-358.6.2.el6.x86_64 VERSION: #1 SMP Tue May 14 15:48:21 EDT 2013 MACHINE: x86_64 (2660 Mhz) MEMORY: 256 GB PANIC: "Oops: 0002 [#1] SMP " (check log for details) PID: 31382 COMMAND: "bash" TASK: ffff8810652f2040 [THREAD_INFO: ffff88050ab1a000] CPU: 3 STATE: TASK_RUNNING (PANIC) crash> I've tried to run the commands that Retrace Server uses and all but 'kmem -f' return quickly. 'kmem -f' seems to trigger the problem (I've killed it after ~15 minutes, it was using 100% CPU during that time). (In reply to Michal Toman from comment #12) > I've tried to run the commands that Retrace Server uses and all but 'kmem > -f' return quickly. 'kmem -f' seems to trigger the problem (I've killed it > after ~15 minutes, it was using 100% CPU during that time). The "kmem -f" command is somewhat labor-intensive, especially on a system that has 256GB of memory like in 01050648.vmcore, most of which is sitting on the various free list chains. The page lists (often containing multiple millions of pages each) are walked through, verified for correctness, and hashed internally to check for duplicates to catch wrap-arounds/corruptions and to prevent the command from entering an infinite loop. I started a "kmem -f" on 01050648.vmcore at around 9:00 EST, and it's still dutifully churning through the free lists at 11:00 EST. I'm predicting that it will eventually finish successfully. You guys might want to re-think the need for running that particular command by default in your retrace-server. There's not much I can do about that. That being said, the core file you gave me captured the crash utility in a readline library call, where it was gathering a crash command. In the case of an onerous "kmem -f" operation, if you had taken a core dump, you wouldn't see it operating there, i.e., because it would have read the "kmem -f" string from whatever the input source was, and then spent whatever time it takes to complete the command. So I don't even know if we're talking about the same issue here.
> The "kmem -f" command is somewhat labor-intensive, especially on a system
> that has 256GB of memory like in 01050648.vmcore, most of which is sitting
> on the various free list chains. The page lists (often containing multiple
> millions of pages each) are walked through, verified for correctness, and
> hashed internally to check for duplicates to catch wrap-arounds/corruptions
> and to prevent the command from entering an infinite loop.
It's the internal hashing that's consuming most of the time that
"kmem -f" is running. Every page found is hashed, and when walking
through each subsequent free page, its associated hash chain is searched
for duplicates. Currently there are only 128 hash chain heads being
used. (I wrote the hashing code back when 256 *MB* was a normal system
size, never mind 256 *GB*) I can look into seeing how much of an increase
in hash chain heads would alleviate the problem on a system of this size,
but it's still going to take a lot of time, given the sheer number of free
pages that are going to be hashed. And I don't want to go crazy allocating
huge amounts of static buffer space. I'll tinker around with this and see
what might be useful.
In the meantime...
Do this in your script:
...
set hash off
kmem -f
set hash on
...
Doing that, "kmem -f" takes about 5 minutes on 01050648.vmcore.
> I can look into seeing how much of an increase in hash chain heads
> would alleviate the problem on a system of this size, but it's
> still going to take a lot of time, given the sheer number of free
> pages that are going to be hashed.
Well, I could not be more wrong...
Increasing the hash chain head count reduces the time
by several orders of magnitude.
This particular system (01050648.vmcore) has 50331705 free
pages (~192GB out of 256GB).
Now, as a "best-case" benchmark, with hashing turned off,
"kmem -f" completes in:
real 4m42.759s
user 4m26.065s
sys 0m16.109s
The crash utility's hash queue head count has always been 128,
consuming 1K of memory for just the hash queue heads. And
because of its small size, the "kmem -f" command on a huge
system of this size takes forever because the number of
objects linked on each of the 128 queues is ridiculously large.
Increasing the hash queue head count from 128 to the following
sizes, I get the following results. For each [size], here is
the time spent, the number of hash those queues used, the
range of the number of objects per-queue, and the memory size
consumed by just the hash queue heads:
[4096]
real 34m04.579s
user 33m37.106s
sys 0m22.180s
queues in use: 4096 of 4096
queue length range: 3032 to 3478
size of hash queue heads: 32K
[8192]
real 19m58.683s
user 19m32.972s
sys 0m21.496s
queues in use: 8192 of 8192
queue length range: 1423 to 1810
size of hash queue heads: 64K
[16384]
real 12m33.526s
user 12m10.278s
sys 0m20.318s
queues in use: 16384 of 16384
queue length range: 690 to 950
size of hash queue heads: 128K
[32768]
real 8m44.229s
user 8m23.037s
sys 0m19.295s
queues in use: 32768 of 32768
queue length range: 292 to 506
size of hash queue heads: 256K
[65536]
real 6m41.822s
user 6m22.935s
sys 0m17.784s
queues in use: 65536 of 65536
queue length range: 131 to 280
size of hash queue heads: 512K
[131072]
real 5m42.404s
user 5m24.724s
sys 0m16.801s
queues in use: 131072 of 131072
queue length range: 46 to 158
size of hash queue heads: 1MB
I'm thinking that a reasonable compromise between the benefit
of a larger hash queue vs. the increase in the amount of memory
required would be a count of 16384, increasing the memory required
from 1KB up to 128KB.
On the other hand, going to 131072 brings the time remarkably
close to not hashing at all, with the expense of using 1MB of
memory for just the hash queue heads. So maybe I'm being
unnecessarily conservative?
And in certain circumstances, it may be helpful/useful to be
able to override the default count, e.g., to handle abnormally large
systems like this example. So it may also be beneficial to add a new
command line option "--hash <size>" that could be used to override
the default.
It should be noted that "kmem -f" is a fairly special case with respect
to hash queue usage, at least for the built-in commands.
Anyway, given that you guys are my prime customers, I'm interested in
your thoughts on the matter -- any comments?
I'm not a heavy user of kmem -f. I have encountered occasionally a vmcore on which kmem <address> took a very long time. Is that a related problem? I don't think using a MB of memory for kmem -f is bad. Especially if it can be allocated when needed. We've already put in a request to have our retrace server stop doing kmem -f by default, so this is not critical for us. (In reply to Guy Streeter from comment #16) > I'm not a heavy user of kmem -f. I have encountered occasionally a vmcore on > which kmem <address> took a very long time. Is that a related problem? Yes. The "kmem <address>" does a series of subsystem checks, and one of them is a check of the free list. I'll take a look to see if it can be stream-lined somewhat. > > I don't think using a MB of memory for kmem -f is bad. Especially if it can > be allocated when needed. Unfortunately it's not a when-needed scenario. The default size (or as specified by a new "--hash <size>" option) will be allocated and touched during initialization because the pid-gathering process uses the hash queue. > > We've already put in a request to have our retrace server stop doing kmem -f > by default, so this is not critical for us. OK. For now I've decided on a default of 32768: Increase the internal hash queue head count from 128 to 32768. The hash queue is used for gathering and verifying lists, and the original count of 128 may be overwhelmed if a list is extremely large. For example, on a 256GB system with 192GB of free pages, the "kmem -f" command takes hours to complete; with this patch, the time is reduced to a few minutes. In addition, a new command line option "--hash <count>" has been added to allow a user to override the default hash queue head count of 32768. (anderson) $ git clone git://github.com/crash-utility/crash.git $ cd crash $ make lzo (In reply to Guy Streeter from comment #16) > > I'm not a heavy user of kmem -f. I have encountered occasionally a > > vmcore on which kmem <address> took a very long time. Is that a > > related problem? > > Yes. The "kmem <address>" does a series of subsystem checks, and one > of them is a check of the free list. I'll take a look to see if it can > be stream-lined somewhat. And yes, "kmem [-f] <address>" can be stream-lined fairly significantly: commit ed28aca0c0281d44713f2e9a1a39169b908b18fd Author: Dave Anderson <anderson> Date: Fri Mar 14 10:30:35 2014 -0400 Fix for the "kmem -F" display and the "kmem -f <address>" or "kmem <address> options. Without the patch, the first page in a list of free page blocks is not displayed on its own line by "kmem -F", but rather at the end of the line showing the area number, block size, and free_area struct address that the page is linked to. Due to this error, both "kmem <address>" and "kmem -f <address>" would not find a page if it happened to be the first page, or any of the contiguous pages in a multi-page block, in the list of free pages or page blocks. (anderson) commit d3e5664ec6924c2e4d93b47c41fbc46dbf5c816c Author: Dave Anderson <anderson> Date: Mon Mar 17 16:42:13 2014 -0400 Created a new feature for the internal do_list() function if it is necessary to immediately perform a function for each entry in a list while the list is being traversed. A callback function, and an option callback data pointer, can be registered in the list_data structure. The address of each entry in the list along with the optional callback data pointer will be passed to the callback function. If desired, the callback function may also dictate that do_list() should stop the list traversal and return immediately to its caller. (anderson) commit 32aa6bbbf7becb90ba84d258f7baf19898099e1b Author: Dave Anderson <anderson> Date: Mon Mar 17 16:57:48 2014 -0400 Made the "kmem -f <address>" and "kmem <address>" options more efficient by using the new do_list() callback function feature as well as restricting the search to only the NUMA node that contains the address. (anderson) (In reply to Dave Anderson from comment #18) > > $ git clone git://github.com/crash-utility/crash.git > $ cd crash > $ make lzo A git-pull/rebuild is required for the 3 commits above. Thanks Dave A! I know I have run into situations where kmem takes a while, as do others. This set of patches looks like it will improve more than one scenario, and I will see if we can get this tested soon. Other than the one vmcore mentioned in the original filing for 'kmem -f' taking a long time, I wonder if we have a few vmcores which we know are slow with kmem. I will see what I can find but if others have vmcores they know are slow with 'kmem <addr>' checks for instance they might be interesting test cases. This issue is fixed in crash-7.1.0-1.el6, which was built today for the rebase approved for the rhel6.7 crash utility errata. (In reply to Dave Anderson from comment #21) > This issue is fixed in crash-7.1.0-1.el6, which was built today for the > rebase approved for the rhel6.7 crash utility errata. ok, set qa_ack+. Hi developer, RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8. If you want fix in rhel-6.7, please add exception and block flags. -- Thanks, Qiao (In reply to Qiao Zhao from comment #23) > Hi developer, > > RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8. > If you want fix in rhel-6.7, please add exception and block flags. > > -- > Thanks, > Qiao See comment #21. It's already fixed in the rhel-6.7 errata version. (In reply to Dave Anderson from comment #24) > (In reply to Qiao Zhao from comment #23) > > Hi developer, > > > > RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8. > > If you want fix in rhel-6.7, please add exception and block flags. > > > > -- > > Thanks, > > Qiao > > See comment #21. It's already fixed in the rhel-6.7 errata version. Sorry, i miss this comment. but this bug not been added to rhel-6.7 crash-utility errata https://errata.devel.redhat.com/advisory/19827. (In reply to Qiao Zhao from comment #25) > (In reply to Dave Anderson from comment #24) > > (In reply to Qiao Zhao from comment #23) > > > Hi developer, > > > > > > RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8. > > > If you want fix in rhel-6.7, please add exception and block flags. > > > > > > -- > > > Thanks, > > > Qiao > > > > See comment #21. It's already fixed in the rhel-6.7 errata version. > > Sorry, i miss this comment. but this bug not been added to rhel-6.7 > crash-utility errata https://errata.devel.redhat.com/advisory/19827. Correct -- it is not possible to add a BZ to an errata unless it has rhel-6.7+. And BTW, if the blocker/exception flag is a prerequisite for rhel-6.7+, I don't have permissions to set blocker? or exception?. (I don't even see those two flags when I click "more flags") Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1309.html |