Bug 1596305 - Profile and speed up 'search -k' where possible for common vmcore types
Summary: Profile and speed up 'search -k' where possible for common vmcore types
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: crash
Version: 7.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Dave Wysochanski
QA Contact: Emma Wu
URL:
Whiteboard:
Depends On:
Blocks: 1653509
TreeView+ depends on / blocked
 
Reported: 2018-06-28 14:46 UTC by Dave Wysochanski
Modified: 2019-06-26 02:44 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1723638 (view as bug list)
Environment:
Last Closed: 2019-06-25 04:01:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dave Wysochanski 2018-06-28 14:46:38 UTC
Description of problem:
The crash command 'search' is expensive but heavily used by a number of our more senior vmcore engineers.  It is an expensive command to run.  This is a placeholder to see what options there are for speeding up this command and how invasive they would be.

Version-Release number of selected component (if applicable):
crash-7.2.3

How reproducible:
easy

Steps to Reproduce:
1. 'search -k' on a fairly large (tens of GB at least) vmcore 


Actual results:
search command takes many hours or even days

Expected results:
TBD - improve search performance where reasonably possible

Additional info:
The first step is to take a sample of the most common vmcore types and search commands which should be fairly simple.  There's a lot of dump types, etc and I'm not suggesting this bug address them all.  I already have a couple examples I can start with.

I realize a lot of times the expense involves reading a lot from disk but the commands I've seen run by various users seem to be CPU bound, not disk bound. This makes me wonder whether we can add some algorithms to improve things.  I took a quick look at the code and I think there's possibilities for improvements.

Comment 4 Dave Wysochanski 2018-07-12 11:34:45 UTC
I made a first attempt at investigating a possible speedup for 'search' command.  I thought maybe zlib improvements would help (we are on rhel6 so tried rhel7) - but no difference seen after recompile of rhel7 srpm and installed on rhel6.  This is not a good investigation and I intend to setup and do a proper profile soon.

I did the below experiment after running the below in gdb and ctrl-c ing a few times and seeing we were executing inside the zlib decompression routines for a page.  I didn't get a real profile but this hinted at the fact it's likely the majority of time is spent in decompression of pages so it was a simple investigation.

Unfortunately there seemed to be no difference when using the RHEL7 version of zlib recompiled for RHEL6.  I tried rebuilding RHEL8 zlib srpm on RHEL6 but it failed.

Also I realize decompression / compression algorithms may be encumbered by patents so there may be not much we can do to speed up.

Comment 6 Dave Wysochanski 2018-07-22 02:59:54 UTC
Simple perf shows libz consuming most of the CPU - indeed we're in decompression for this 'search' command.
Will need to think about what may be possible.

[dwysocha@optimus-dev search1-487084416]$ ls -lh ./487084416-vmcore
-rw-r--r--. 1 dwysocha dwysocha 5.9G Jun 27 13:17 ./487084416-vmcore
[dwysocha@optimus-dev search1-487084416]$ cat ./487084416-crashrc
search ffffea003ba996a8 | cat -n
exit


[dwysocha@optimus-dev search1-487084416]$ perf record ~/crash-utility/crash -i ./487084416-crashrc ./487084416-vmcore ./487084416-vmlinux 
...
      KERNEL: ./487084416-vmlinux
    DUMPFILE: ./487084416-vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Wed Jun 27 03:55:18 2018
      UPTIME: 10 days, 23:19:41
LOAD AVERAGE: 7.18, 8.10, 9.61
       TASKS: 1656
...
     RELEASE: 2.6.32-642.11.1.el6.x86_64
     VERSION: #1 SMP Wed Oct 26 10:25:23 EDT 2016
     MACHINE: x86_64  (2599 Mhz)
      MEMORY: 1023.6 GB
       PANIC: "[948355.683417] BUG: unable to handle kernel paging request at 0000000d00000058"
         PID: 27515
     COMMAND: "q"
        TASK: ffff88d9c894f520  [THREAD_INFO: ffff88ee18ee8000]
         CPU: 9
       STATE: TASK_RUNNING (PANIC)
...
[dwysocha@optimus-dev search1-487084416]$ perf report --stdio
Warning:
Kernel address maps (/proc/{kallsyms,modules}) were restricted.

Check /proc/sys/kernel/kptr_restrict before running 'perf record'.

As no suitable kallsyms nor vmlinux was found, kernel samples
can't be resolved.

Samples in kernel modules can't be resolved as well.

# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 843283258660
#
# Overhead  Command  Shared Object     Symbol
# ........  .......  ................  ............................................
# 
    25.99%  crash    libz.so.1.2.3     [.] inflate_fast
    15.33%  crash    crash             [.] read_diskdump
    10.63%  crash    libz.so.1.2.3     [.] inflate_table
     7.25%  crash    libz.so.1.2.3     [.] inflate
     5.45%  crash    libz.so.1.2.3     [.] adler32
     5.15%  crash    crash             [.] readmem
     4.01%  crash    crash             [.] search_virtual
     3.15%  crash    crash             [.] x86_64_kvtop
     3.11%  crash    crash             [.] nr_to_section
     2.64%  crash    crash             [.] x86_64_is_kvaddr
     1.68%  crash    crash             [.] read_mem_section
     1.51%  crash    crash             [.] SIZE_verify
     1.10%  crash    libc-2.12.so      [.] memcpy
     1.05%  crash    crash             [.] OFFSET_verify
     0.96%  crash    crash             [.] pfn_to_map
     0.72%  crash    crash             [.] x86_64_IS_VMALLOC_ADDR
     0.56%  crash    [unknown]         [k] 0xffffffff81560206
     0.54%  crash    [unknown]         [k] 0xffffffff81560432
     0.53%  crash    crash             [.] section_has_mem_map
     0.53%  crash    crash             [.] kvtop
     0.45%  crash    crash             [.] phys_to_page
     0.34%  crash    crash             [.] sparse_decode_mem_map

[dwysocha@optimus-dev search1-487084416]$ rpm -qf /lib64/libz.so.1
zlib-1.2.3-29.el6.x86_64
[dwysocha@optimus-dev search1-487084416]$ rpm -qi zlib
Name        : zlib                         Relocations: (not relocatable)
Version     : 1.2.3                             Vendor: Red Hat, Inc.
Release     : 29.el6                        Build Date: Wed 03 Oct 2012 12:18:49 PM EDT
Install Date: Fri 23 Aug 2013 03:32:51 PM EDT      Build Host: x86-022.build.eng.bos.redhat.com
Group       : System Environment/Libraries   Source RPM: zlib-1.2.3-29.el6.src.rpm
Size        : 139089                           License: zlib and Boost
Signature   : RSA/8, Tue 27 Nov 2012 04:58:58 AM EST, Key ID 199e2f91fd431d51
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://www.gzip.org/zlib/
Summary     : The zlib compression and decompression library
Description :
Zlib is a general-purpose, patent-free, lossless data compression
library which is used by many different programs.
Name        : zlib                         Relocations: (not relocatable)
Version     : 1.2.3                             Vendor: Red Hat, Inc.
Release     : 29.el6                        Build Date: Wed 03 Oct 2012 12:18:37 PM EDT
Install Date: Thu 12 Jul 2018 07:14:37 AM EDT      Build Host: x86-022.build.eng.bos.redhat.com
Group       : System Environment/Libraries   Source RPM: zlib-1.2.3-29.el6.src.rpm
Size        : 152305                           License: zlib and Boost
Signature   : RSA/8, Tue 27 Nov 2012 04:58:54 AM EST, Key ID 199e2f91fd431d51
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://www.gzip.org/zlib/
Summary     : The zlib compression and decompression library
Description :
Zlib is a general-purpose, patent-free, lossless data compression
library which is used by many different programs.

Comment 7 Dave Wysochanski 2018-07-22 03:07:02 UTC
(In reply to Dave Wysochanski from comment #6)
> Simple perf shows libz consuming most of the CPU - indeed we're in
> decompression for this 'search' command.
> Will need to think about what may be possible.
> 

Pretty tiny vmcore file:
> -rw-r--r--. 1 dwysocha dwysocha 5.9G Jun 27 13:17 ./487084416-vmcore

Search commands are always expensive

> [dwysocha@optimus-dev search1-487084416]$ cat ./487084416-crashrc
> search ffffea003ba996a8 | cat -n
> exit
> 

But in comparison to the size of the vmcore, this macihne has 1TB of RAM
>       MEMORY: 1023.6 GB

> # Samples: 1M of event 'cycles'
> # Event count (approx.): 843283258660
> #
> # Overhead  Command  Shared Object     Symbol
> # ........  .......  ................ 
> ............................................
> # 
>     25.99%  crash    libz.so.1.2.3     [.] inflate_fast
>     15.33%  crash    crash             [.] read_diskdump
>     10.63%  crash    libz.so.1.2.3     [.] inflate_table
>      7.25%  crash    libz.so.1.2.3     [.] inflate
>      5.45%  crash    libz.so.1.2.3     [.] adler32
>      5.15%  crash    crash             [.] readmem
>      4.01%  crash    crash             [.] search_virtual
>      3.15%  crash    crash             [.] x86_64_kvtop
>      3.11%  crash    crash             [.] nr_to_section
>      2.64%  crash    crash             [.] x86_64_is_kvaddr
>      1.68%  crash    crash             [.] read_mem_section

Not surprising decompression is taking CPU.

Will run longer with perf and see if anything else interesting shows up or if this is the whole story.

NOTE: This is a small enough vmcore file I can run from page cache (RAM) so we should be totally avoiding any sort of disk reading code and this is more about how this machine with large (but sparse) memory was captured and we handle it when we decompress.

Comment 8 Dave Wysochanski 2018-07-22 03:20:17 UTC
Ok as it turns out, when run from RAM that command does not take very long at all so it may not be a good test.  Or else when it was run before it took a long time due to disk overhead or some other reason (other crash processes competing for RAM, etc).
 
In any case, a couple quick thoughts on optimizing 'search' for profiles like this (assuming there's no much we can do in decompression routine):
1. Use 'dd' to make sure the vmcore is loaded into page cache (RAM) before running crash
2. Break up the kernel address space into chunks based on the # of CPUs
3. Run multiple crash commands with '-s' and '-e' addresses
4. Ideally do the above in one "optimized" crash command (but might be too difficult so manual steps may be adequate)

I have some other ideas but I'll have to study the code more.

The perf shows the same thing when run to completion
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 1254104950924
#
# Overhead  Command  Shared Object     Symbol                                   
# ........  .......  ................  .........................................
#
    30.64%  crash    libz.so.1.2.3     [.] inflate_fast                         
    13.89%  crash    crash             [.] read_diskdump                        
    12.51%  crash    libz.so.1.2.3     [.] inflate_table                        
     8.49%  crash    libz.so.1.2.3     [.] inflate                              
     6.10%  crash    libz.so.1.2.3     [.] adler32                              
     4.08%  crash    crash             [.] search_virtual                       
     3.52%  crash    crash             [.] readmem                              
     2.27%  crash    crash             [.] nr_to_section                        
     2.14%  crash    crash             [.] x86_64_kvtop                         
     1.93%  crash    crash             [.] x86_64_is_kvaddr                     
     1.06%  crash    crash             [.] SIZE_verify                          
     0.98%  crash    crash             [.] read_mem_section                     
     0.88%  crash    libc-2.12.so      [.] memcpy                               
     0.71%  crash    crash             [.] OFFSET_verify                        
     0.63%  crash    crash             [.] pfn_to_map                           
     0.61%  crash    [unknown]         [k] 0xffffffff81560432                   
     0.61%  crash    [unknown]         [k] 0xffffffff81560206                   
     0.49%  crash    crash             [.] x86_64_IS_VMALLOC_ADDR               
     0.36%  crash    crash             [.] kvtop

Comment 9 Dave Wysochanski 2019-03-19 21:01:46 UTC
FWIW, RHEL8 RC2 code runs in half the time though different hardware as well - too many moving parts to draw any conclusions other than our new vmcore system should be much faster for expensive crash commands.

$ rpm -q crash zlib
crash-7.2.3-18.el8.x86_64
zlib-1.2.11-10.el8.x86_64

$ time crash -i 487084416-crashrc ./487084416-vmcore ./487084416-vmlinux

crash 7.2.3-18.el8
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
...
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: ./487084416-vmlinux
    DUMPFILE: ./487084416-vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Wed Jun 27 03:55:18 2018
      UPTIME: 10 days, 23:19:41
LOAD AVERAGE: 7.18, 8.10, 9.61
       TASKS: 1656
...
     RELEASE: 2.6.32-642.11.1.el6.x86_64
     VERSION: #1 SMP Wed Oct 26 10:25:23 EDT 2016
     MACHINE: x86_64  (2599 Mhz)
      MEMORY: 1023.6 GB
       PANIC: "[948355.683417] BUG: unable to handle kernel paging request at 0000000d00000058"
         PID: 27515
     COMMAND: "q"
        TASK: ffff88d9c894f520  [THREAD_INFO: ffff88ee18ee8000]
         CPU: 9
       STATE: TASK_RUNNING (PANIC)

     1  crash> search ffffea003ba996a8 | cat -n
crash> exit

real    4m36.532s
user    4m20.261s
sys     0m14.748s
$ 


Second run consistent time:
crash> exit

real    4m37.201s
user    4m21.070s
sys     0m14.922s


Perf run, looks similar to RHEL6

$ perf record crash -i ./487084416-crashrc ./487084416-vmcore ./487084416-vmlinux
...
     1  crash> search ffffea003ba996a8 | cat -n
crash> exit
[ perf record: Woken up 161 times to write data ]
[ perf record: Captured and wrote 40.346 MB perf.data (1057427 samples) ]

$ perf report --stdio                                                            # To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'cpu-clock:uhH'
# Event count (approx.): 264356750000
#
# Overhead  Command  Shared Object     Symbol
# ........  .......  ................  .................................................
# 
    35.21%  crash    libz.so.1.2.11    [.] inflate_fast
    12.73%  crash    crash             [.] read_diskdump
    11.07%  crash    libz.so.1.2.11    [.] inflate_table
     8.99%  crash    libz.so.1.2.11    [.] inflate
     6.48%  crash    crash             [.] search_virtual
     5.56%  crash    libz.so.1.2.11    [.] adler32_z
     3.50%  crash    crash             [.] readmem
     3.00%  crash    crash             [.] nr_to_section
     1.86%  crash    crash             [.] x86_64_kvtop
     1.77%  crash    libc-2.28.so      [.] __memmove_avx_unaligned_erms
     1.33%  crash    crash             [.] x86_64_is_kvaddr
     1.31%  crash    crash             [.] SIZE_verify
     0.76%  crash    libc-2.28.so      [.] read
     0.75%  crash    libc-2.28.so      [.] llseek.5
     0.68%  crash    crash             [.] x86_64_IS_VMALLOC_ADDR
     0.68%  crash    crash             [.] OFFSET_verify
     0.62%  crash    crash             [.] read_mem_section

Comment 12 Dave Wysochanski 2019-06-24 14:08:23 UTC
I started investigating ways to improve this because I think it would a be a large win for senior engineers that often end up using 'search' for more complicated vmcores / crashes such as memory corruptions.  I have some ideas but no concrete patches.  I'm not hopeful at this point I can have something for RHEL7.8 so it may need to be a RHEL8 only thing.

Comment 13 Dave Wysochanski 2019-06-24 14:12:45 UTC
Rather than go after zlib, here are some of my other thoughts:
1) divide the search space into pieces and make an search option to be multi-threaded
2) develop heuristics for common search patterns in common vmcores.  There's a variety of search algorithms and techniques I think that we might explore.

Any of the above may be quite involved and likely require extensive testing to ensure correctness, so I would not rush any patches.

Comment 14 Dave Young 2019-06-25 04:01:22 UTC
(In reply to Dave Wysochanski from comment #12)
> I started investigating ways to improve this because I think it would a be a
> large win for senior engineers that often end up using 'search' for more
> complicated vmcores / crashes such as memory corruptions.  I have some ideas
> but no concrete patches.  I'm not hopeful at this point I can have something
> for RHEL7.8 so it may need to be a RHEL8 only thing.

Thanks for the update,  close this one then, cloning a new bz for RHEL8


Note You need to log in before you can comment on or make changes to this bug.