Bug 1073987

Summary:	crash-7.0.1-0.1.x86_64 built on RHEL6.4 system hung for 4 hours consuming CPU on ~2GB x86_64 vmcore
Product:	Red Hat Enterprise Linux 6	Reporter:	Dave Wysochanski <dwysocha>
Component:	crash	Assignee:	Dave Anderson <anderson>
Status:	CLOSED ERRATA	QA Contact:	Qiao Zhao <qzhao>
Severity:	high	Docs Contact:
Priority:	medium
Version:	6.4	CC:	dwysocha, hamiller, harshula, jherrman, kwalker, rvokal, stalexan, streeter, tlavigne
Target Milestone:	rc	Keywords:	TestCaseProvided
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	crash-7.1.0-1.el6	Doc Type:	Bug Fix
Doc Text:	On certain system configurations, the "kmem -f", "kmem -F", and "kmem [address]" command options previously took a very long time to complete. This update increases the internal hash queue size used to store the address of each free page, and streamlines the free page search to only check the NUMA node that contains a specified address. As a result, the mentioned "kmem" options no longer have a negative impact on performance.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-07-22 06:27:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dave Wysochanski 2014-03-07 15:36:22 UTC

Description of problem:
Yesterday we had a vmcore come in from a customer system which was 256GB of RAM but the stripped core was only 2GB. The Red Hat engineer submitted the vmcore to our retrace-server system to strip and detect the proper kernel-debuginfo for the core. About 4 hours later, I got a ping from the engineer stating he thought his retrace-server task was hung. I logged into our retrace-server machine and saw a crash process which had been initiated by retrace-server still running with 100% cpu. It had been running for 4 hours.

I gave instructions to the admin of the machine to core the crash process so we can examine it, as well as the vmcore which seemed to trigger the problem. I will put in a separate comment with the location of all the data.

Unfortunately this may be a hard to reproduce problem. I took the same core and set it up manually (extracted debuginfo, etc) and successfully ran crash on it without any hang.

Version-Release number of selected component (if applicable):
crash-7.0.1-0.1.x86_64
NOTE: The system running crash is a RHEL6.4 system, but we've built the latest crash-7.x from the srpm. I'm not sure if this matters but I wanted to make it clear.

How reproducible:
Unsure.

Steps to Reproduce:
Unsure. May be able to trigger this with specific vmcore and the retrace-server version on optimus.

Actual results:
crash session goes to 100% cpu and does not come back.

Additional info:
I may have gotten the release wrong for this bug. I was not sure since we're running a RHEL6.4 box with a crash-7.x version. Feel free to change it if you feel it is wrong.

I have seen one other instance on this same system where crash processes go to 100% CPU and don't want to come back, though I'm not sure it is the same problem.

I use 'screen' often along with my crash sessions. While in my crash session, I may do a "CTRL-Z" to get to a shell prompt, do some command, and then 'fg' to get back. Occassionally I have done the "CTRL-Z" step, and then left my screen session without getting back into crash. I have seen this sometimes leads to a crash session going to 100% CPU, because someone asked me about a runaway crash session of mine. This may be a separate but similar bug. Unfortunately I do not have a core of this sort of hang right now. If it happens again, I'll get a core and either attach it to this bug, or open a new bug.

Comment 4 Dave Anderson 2014-03-07 16:20:14 UTC

Yeah I'm not sure how to debug this.  It's simply doing a readline() library
call for a command entry, which should just block in the kernel in a read
system call:  

Program terminated with signal 11, Segmentation fault.
#0  0x00007f7543cc3630 in __read_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6_4.5.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00007f7543cc3630 in __read_nocancel () from /lib64/libc.so.6
#1  0x0000000000762d57 in rl_getc (stream=0x7f7543f756a0) at input.c:471
#2  0x00000000007631f6 in rl_read_key () at input.c:448
#3  0x0000000000750725 in readline_internal_char () at readline.c:517
#4  0x0000000000750b46 in readline_internal_charloop (prompt=<value optimized out>) at readline.c:579
#5  readline_internal (prompt=<value optimized out>) at readline.c:593
#6  e (prompt=<value optimized out>) at readline.c:342
#7  0x000000000051fb85 in process_command_line () at cmdline.c:116
#8  0x0000000000461295 in main_loop () at main.c:707
#9  0x000000000066ef23 in captured_command_loop (data=0x0) at main.c:258
#10 0x000000000066d7f4 in catch_errors (func=0x66ef10 <captured_command_loop>, func_args=0x0, errstring=0x8a2388 "", mask=6)
    at exceptions.c:557
#11 0x000000000066fb67 in captured_main (data=<value optimized out>) at main.c:1064
#12 0x000000000066d7f4 in catch_errors (func=0x66f000 <captured_main>, func_args=0x7fff769039d0, errstring=0x8a2388 "", mask=6)
    at exceptions.c:557
#13 0x000000000066ed24 in gdb_main (args=0x7fff76903647) at main.c:1079
#14 0x000000000066ed66 in gdb_main_entry (argc=<value optimized out>, argv=0x7fff76903647) at main.c:1099
#15 0x0000000000465045 in main (argc=5, argv=0x7fff76904cb8) at main.c:613
(gdb) 

And if a CTRL-z is entered, the process should block (call schedule)
in do_signal_stop().

But I have no clue about whatever retrace-server or screen might be doing
when they are the standard input.  Maybe they're causing the read() to
return constantly with zero-data or something like that?

Comment 5 Dave Anderson 2014-03-07 16:50:30 UTC

> The Red Hat engineer submitted the vmcore to our retrace-server system
> to strip and detect the proper kernel-debuginfo for the core.

What does the "strip and detect" procedure do?

Comment 6 Dave Wysochanski 2014-03-07 23:09:52 UTC

(In reply to Dave Anderson from comment #5)
> > The Red Hat engineer submitted the vmcore to our retrace-server system
> > to strip and detect the proper kernel-debuginfo for the core.
> 
> What does the "strip and detect" procedure do?

The 'strip' procedure just runs makedumpfile on the vmcore.
The 'detect' part runs 'crash --osrelease' to try to get the kernel version.  If that fails, I think it tries 'strings' or something on the vmcore to try to pull a kernel version out.

I don't think either one of these was hung.

And as it turns out, I think it is reproducible with this vmcore.  I submitted it to our development retrace-server system, and it looks hung in the same manner.  It is still running in fact on that machine.

Comment 9 Dave Anderson 2014-03-08 15:57:06 UTC

[anderson@optimus ~]$ crash -s /retrace/tasks/180670073/crash/vmcore /retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.6.2.el6.x86_64/vmlinux
crash: /retrace/tasks/180670073/crash/vmcore: No such file or directory

Usage:

  crash [OPTION]... NAMELIST MEMORY-IMAGE  (dumpfile form)
  crash [OPTION]... [NAMELIST]             (live system form)

Enter "crash -h" for details.
[anderson@optimus ~]$

FWIW, yesterday I did try the original vmlinux/vmcore pair reported in comment #1 in a standalone interactive manner, and as expected there was no problem.  I didn't try the vmcore on flopbox, but I highly doubt that it would fail while being invoked interactively on the command line.

I don't know why a particular vmlinux/vmcore pair would cause this to happen, but I also don't know how crash is being invoked by the retrace-server script (or whatever it is).  Is it piping input into the crash process?  Or is using
the "-i <input-file>" command line option?  Can you show the invocation?

Comment 10 Dave Anderson 2014-03-08 16:10:04 UTC

> The core should also still be on flopobox as this filename: 01050648.vmcore.gz


$ crash -s --osrelease 01050648.vmcore
2.6.32-358.6.2.el6.x86_64
$ crash -s /tmp/vmlinux 01050648.vmcore
crash> q
$ crash /tmp/vmlinux 01050648.vmcore

crash 7.0.5
Copyright (C) 2002-2014  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /tmp/vmlinux
    DUMPFILE: 01050648.vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Thu Mar  6 04:28:15 2014
      UPTIME: 18 days, 20:09:38
LOAD AVERAGE: 0.04, 0.05, 0.11
       TASKS: 848
    NODENAME: fl08lxccs01
     RELEASE: 2.6.32-358.6.2.el6.x86_64
     VERSION: #1 SMP Tue May 14 15:48:21 EDT 2013
     MACHINE: x86_64  (2660 Mhz)
      MEMORY: 256 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)
         PID: 31382
     COMMAND: "bash"
        TASK: ffff8810652f2040  [THREAD_INFO: ffff88050ab1a000]
         CPU: 3
       STATE: TASK_RUNNING (PANIC)

crash>

Comment 12 Michal Toman 2014-03-10 10:05:03 UTC

I've tried to run the commands that Retrace Server uses and all but 'kmem -f' return quickly. 'kmem -f' seems to trigger the problem (I've killed it after ~15 minutes, it was using 100% CPU during that time).

Comment 13 Dave Anderson 2014-03-10 15:07:47 UTC

(In reply to Michal Toman from comment #12)
> I've tried to run the commands that Retrace Server uses and all but 'kmem
> -f' return quickly. 'kmem -f' seems to trigger the problem (I've killed it
> after ~15 minutes, it was using 100% CPU during that time).

The "kmem -f" command is somewhat labor-intensive, especially on a system
that has 256GB of memory like in 01050648.vmcore, most of which is sitting
on the various free list chains.  The page lists (often containing multiple
millions of pages each) are walked through, verified for correctness, and
hashed internally to check for duplicates to catch wrap-arounds/corruptions
and to prevent the command from entering an infinite loop.

I started a "kmem -f" on 01050648.vmcore at around 9:00 EST, and it's
still dutifully churning through the free lists at 11:00 EST.  I'm 
predicting that it will eventually finish successfully.

You guys might want to re-think the need for running that particular command
by default in your retrace-server.  There's not much I can do about that. 
 
That being said, the core file you gave me captured the crash utility
in a readline library call, where it was gathering a crash command.
In the case of an onerous "kmem -f" operation, if you had taken a core
dump, you wouldn't see it operating there, i.e., because it would have 
read the "kmem -f" string from whatever the input source was, and then
spent whatever time it takes to complete the command.  So I don't even
know if we're talking about the same issue here.

Comment 14 Dave Anderson 2014-03-10 18:27:23 UTC

> The "kmem -f" command is somewhat labor-intensive, especially on a system
> that has 256GB of memory like in 01050648.vmcore, most of which is sitting
> on the various free list chains.  The page lists (often containing multiple
> millions of pages each) are walked through, verified for correctness, and
> hashed internally to check for duplicates to catch wrap-arounds/corruptions
> and to prevent the command from entering an infinite loop.

It's the internal hashing that's consuming most of the time that
"kmem -f" is running.  Every page found is hashed, and when walking
through each subsequent free page, its associated hash chain is searched
for duplicates.  Currently there are only 128 hash chain heads being
used.  (I wrote the hashing code back when 256 *MB* was a normal system
size, never mind 256 *GB*)  I can look into seeing how much of an increase
in hash chain heads would alleviate the problem on a system of this size, 
but it's still going to take a lot of time, given the sheer number of free
pages that are going to be hashed.  And I don't want to go crazy allocating
huge amounts of static buffer space.  I'll tinker around with this and see
what might be useful.

In the meantime...

Do this in your script:

  ...
  set hash off
  kmem -f
  set hash on 
  ...

Doing that, "kmem -f" takes about 5 minutes on 01050648.vmcore.

Comment 15 Dave Anderson 2014-03-11 18:51:30 UTC

> I can look into seeing how much of an increase in hash chain heads
> would alleviate the problem on a system of this size, but it's
> still going to take a lot of time, given the sheer number of free
> pages that are going to be hashed.

Well, I could not be more wrong...

Increasing the hash chain head count reduces the time
by several orders of magnitude.

This particular system (01050648.vmcore) has 50331705 free
pages (~192GB out of 256GB).  

Now, as a "best-case" benchmark, with hashing turned off, 
"kmem -f" completes in:

  real	4m42.759s
  user	4m26.065s
  sys	0m16.109s

The crash utility's hash queue head count has always been 128, 
consuming 1K of memory for just the hash queue heads.  And 
because of its small size, the "kmem -f" command on a huge
system of this size takes forever because the number of 
objects linked on each of the 128 queues is ridiculously large.

Increasing the hash queue head count from 128 to the following
sizes, I get the following results.  For each [size], here is 
the time spent, the number of hash those queues used, the 
range of the number of objects per-queue, and the memory size
consumed by just the hash queue heads:

  [4096]
    real  34m04.579s
    user  33m37.106s
     sys   0m22.180s
         queues in use: 4096 of 4096
    queue length range: 3032 to 3478
    size of hash queue heads: 32K
  
  [8192]
    real  19m58.683s
    user  19m32.972s
    sys    0m21.496s
         queues in use: 8192 of 8192
    queue length range: 1423 to 1810
    size of hash queue heads: 64K
  
  [16384]
    real  12m33.526s
    user  12m10.278s
    sys    0m20.318s
         queues in use: 16384 of 16384
    queue length range: 690 to 950
    size of hash queue heads: 128K
  
  [32768]
    real  8m44.229s
    user  8m23.037s
    sys   0m19.295s
         queues in use: 32768 of 32768
    queue length range: 292 to 506
    size of hash queue heads: 256K
  
  [65536]
    real  6m41.822s
    user  6m22.935s
    sys   0m17.784s
         queues in use: 65536 of 65536
    queue length range: 131 to 280
    size of hash queue heads: 512K
  
  [131072]
    real  5m42.404s
    user  5m24.724s
    sys   0m16.801s
         queues in use: 131072 of 131072
    queue length range: 46 to 158
    size of hash queue heads: 1MB

I'm thinking that a reasonable compromise between the benefit
of a larger hash queue vs. the increase in the amount of memory
required would be a count of 16384, increasing the memory required
from 1KB up to 128KB.

On the other hand, going to 131072 brings the time remarkably
close to not hashing at all, with the expense of using 1MB of
memory for just the hash queue heads.  So maybe I'm being 
unnecessarily conservative?

And in certain circumstances, it may be helpful/useful to be
able to override the default count, e.g., to handle abnormally large 
systems like this example.  So it may also be beneficial to add a new
command line option "--hash <size>" that could be used to override 
the default.

It should be noted that "kmem -f" is a fairly special case with respect
to hash queue usage, at least for the built-in commands.

Anyway, given that you guys are my prime customers, I'm interested in 
your thoughts on the matter -- any comments?

Comment 16 Guy Streeter 2014-03-11 19:05:28 UTC

I'm not a heavy user of kmem -f. I have encountered occasionally a vmcore on which kmem <address> took a very long time. Is that a related problem?

I don't think using a MB of memory for kmem -f is bad. Especially if it can be allocated when needed.

We've already put in a request to have our retrace server stop doing kmem -f by default, so this is not critical for us.

Comment 17 Dave Anderson 2014-03-12 14:30:34 UTC

(In reply to Guy Streeter from comment #16)
> I'm not a heavy user of kmem -f. I have encountered occasionally a vmcore on
> which kmem <address> took a very long time. Is that a related problem?

Yes.  The "kmem <address>" does a series of subsystem checks, and one of them
is a check of the free list.  I'll take a look to see if it can be stream-lined
somewhat.

> 
> I don't think using a MB of memory for kmem -f is bad. Especially if it can
> be allocated when needed.

Unfortunately it's not a when-needed scenario.  The default size (or as
specified by a new "--hash <size>" option) will be allocated and touched
during initialization because the pid-gathering process uses the hash queue.

> 
> We've already put in a request to have our retrace server stop doing kmem -f
> by default, so this is not critical for us.

OK.

Comment 18 Dave Anderson 2014-03-13 17:15:50 UTC

For now I've decided on a default of 32768:

  Increase the internal hash queue head count from 128 to 32768.
  The hash queue is used for gathering and verifying lists, and the
  original count of 128 may be overwhelmed if a list is extremely
  large.  For example, on a 256GB system with 192GB of free pages,
  the "kmem -f" command takes hours to complete; with this patch,
  the time is reduced to a few minutes.  In addition, a new command
  line option "--hash <count>" has been added to allow a user to
  override the default hash queue head count of 32768.
  (anderson)

$ git clone git://github.com/crash-utility/crash.git
$ cd crash
$ make lzo

Comment 19 Dave Anderson 2014-03-18 15:15:28 UTC

(In reply to Guy Streeter from comment #16)
> > I'm not a heavy user of kmem -f. I have encountered occasionally a 
> > vmcore on which kmem <address> took a very long time. Is that a 
> > related problem?
>
> Yes.  The "kmem <address>" does a series of subsystem checks, and one
> of them is a check of the free list.  I'll take a look to see if it can
>  be stream-lined somewhat.

And yes, "kmem [-f] <address>" can be stream-lined fairly significantly:

 commit ed28aca0c0281d44713f2e9a1a39169b908b18fd
 Author: Dave Anderson <anderson>
 Date:   Fri Mar 14 10:30:35 2014 -0400

    Fix for the "kmem -F" display and the "kmem -f <address>"
    or "kmem <address> options.  Without the patch, the first
    page in a list of free page blocks is not displayed on its own
    line by "kmem -F", but rather at the end of the line showing the
    area number, block size, and free_area struct address that the
    page is linked to.  Due to this error, both "kmem <address>"
    and "kmem -f <address>" would not find a page if it happened
    to be the first page, or any of the contiguous pages in a multi-page
    block, in the list of free pages or page blocks.
    (anderson)

 commit d3e5664ec6924c2e4d93b47c41fbc46dbf5c816c
 Author: Dave Anderson <anderson>
 Date:   Mon Mar 17 16:42:13 2014 -0400

    Created a new feature for the internal do_list() function if it
    is necessary to immediately perform a function for each entry in a
    list while the list is being traversed.  A callback function, and an
    option callback data pointer, can be registered in the list_data
    structure.  The address of each entry in the list along with the
    optional callback data pointer will be passed to the callback
    function.  If desired, the callback function may also dictate that
    do_list() should stop the list traversal and return immediately to
    its caller.
    (anderson)

 commit 32aa6bbbf7becb90ba84d258f7baf19898099e1b
 Author: Dave Anderson <anderson>
 Date:   Mon Mar 17 16:57:48 2014 -0400

    Made the "kmem -f <address>" and "kmem <address>" options more
    efficient by using the new do_list() callback function feature
    as well as restricting the search to only the NUMA node that
    contains the address.
    (anderson)


(In reply to Dave Anderson from comment #18)
>
> $ git clone git://github.com/crash-utility/crash.git
> $ cd crash
> $ make lzo

A git-pull/rebuild is required for the 3 commits above.

Comment 20 Dave Wysochanski 2014-04-09 14:41:46 UTC

Thanks Dave A!  I know I have run into situations where kmem takes a while, as do others.  This set of patches looks like it will improve more than one scenario, and I will see if we can get this tested soon.

Other than the one vmcore mentioned in the original filing for 'kmem -f' taking a long time, I wonder if we have a few vmcores which we know are slow with kmem.  I will see what I can find but if others have vmcores they know are slow with 'kmem <addr>' checks for instance they might be interesting test cases.

Comment 21 Dave Anderson 2015-02-10 20:28:19 UTC

This issue is fixed in crash-7.1.0-1.el6, which was built today for the
rebase approved for the rhel6.7 crash utility errata.

Comment 22 Qiao Zhao 2015-05-12 05:51:31 UTC

(In reply to Dave Anderson from comment #21)
> This issue is fixed in crash-7.1.0-1.el6, which was built today for the
> rebase approved for the rhel6.7 crash utility errata.

ok, set qa_ack+.

Comment 23 Qiao Zhao 2015-06-09 07:30:07 UTC

Hi developer,

RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8.
If you want fix in rhel-6.7, please add exception and block flags.

--
Thanks,
Qiao

Comment 24 Dave Anderson 2015-06-09 12:53:27 UTC

(In reply to Qiao Zhao from comment #23)
> Hi developer,
> 
> RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8.
> If you want fix in rhel-6.7, please add exception and block flags.
> 
> --
> Thanks,
> Qiao

See comment #21.  It's already fixed in the rhel-6.7 errata version.

Comment 25 Qiao Zhao 2015-06-10 02:37:27 UTC

(In reply to Dave Anderson from comment #24)
> (In reply to Qiao Zhao from comment #23)
> > Hi developer,
> > 
> > RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8.
> > If you want fix in rhel-6.7, please add exception and block flags.
> > 
> > --
> > Thanks,
> > Qiao
> 
> See comment #21.  It's already fixed in the rhel-6.7 errata version.

Sorry, i miss this comment. but this bug not been added to rhel-6.7 crash-utility errata https://errata.devel.redhat.com/advisory/19827.

Comment 26 Dave Anderson 2015-06-10 13:02:54 UTC

(In reply to Qiao Zhao from comment #25)
> (In reply to Dave Anderson from comment #24)
> > (In reply to Qiao Zhao from comment #23)
> > > Hi developer,
> > > 
> > > RHEL-6.7 has been in sanp#4, i think we can move this bug to rhel-6.8.
> > > If you want fix in rhel-6.7, please add exception and block flags.
> > > 
> > > --
> > > Thanks,
> > > Qiao
> > 
> > See comment #21.  It's already fixed in the rhel-6.7 errata version.
> 
> Sorry, i miss this comment. but this bug not been added to rhel-6.7
> crash-utility errata https://errata.devel.redhat.com/advisory/19827.

Correct -- it is not possible to add a BZ to an errata unless it has rhel-6.7+.

Comment 27 Dave Anderson 2015-06-10 13:06:28 UTC

And BTW, if the blocker/exception flag is a prerequisite for rhel-6.7+,
I don't have permissions to set blocker? or exception?.  (I don't even
see those two flags when I click "more flags")

Comment 34 errata-xmlrpc 2015-07-22 06:27:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1309.html