Description of problem: The crash command 'search' is expensive but heavily used by a number of our more senior vmcore engineers. It is an expensive command to run. This is a placeholder to see what options there are for speeding up this command and how invasive they would be. Version-Release number of selected component (if applicable): crash-7.2.3 How reproducible: easy Steps to Reproduce: 1. 'search -k' on a fairly large (tens of GB at least) vmcore Actual results: search command takes many hours or even days Expected results: TBD - improve search performance where reasonably possible Additional info: The first step is to take a sample of the most common vmcore types and search commands which should be fairly simple. There's a lot of dump types, etc and I'm not suggesting this bug address them all. I already have a couple examples I can start with. I realize a lot of times the expense involves reading a lot from disk but the commands I've seen run by various users seem to be CPU bound, not disk bound. This makes me wonder whether we can add some algorithms to improve things. I took a quick look at the code and I think there's possibilities for improvements.
I made a first attempt at investigating a possible speedup for 'search' command. I thought maybe zlib improvements would help (we are on rhel6 so tried rhel7) - but no difference seen after recompile of rhel7 srpm and installed on rhel6. This is not a good investigation and I intend to setup and do a proper profile soon. I did the below experiment after running the below in gdb and ctrl-c ing a few times and seeing we were executing inside the zlib decompression routines for a page. I didn't get a real profile but this hinted at the fact it's likely the majority of time is spent in decompression of pages so it was a simple investigation. Unfortunately there seemed to be no difference when using the RHEL7 version of zlib recompiled for RHEL6. I tried rebuilding RHEL8 zlib srpm on RHEL6 but it failed. Also I realize decompression / compression algorithms may be encumbered by patents so there may be not much we can do to speed up.
Simple perf shows libz consuming most of the CPU - indeed we're in decompression for this 'search' command. Will need to think about what may be possible. [dwysocha@optimus-dev search1-487084416]$ ls -lh ./487084416-vmcore -rw-r--r--. 1 dwysocha dwysocha 5.9G Jun 27 13:17 ./487084416-vmcore [dwysocha@optimus-dev search1-487084416]$ cat ./487084416-crashrc search ffffea003ba996a8 | cat -n exit [dwysocha@optimus-dev search1-487084416]$ perf record ~/crash-utility/crash -i ./487084416-crashrc ./487084416-vmcore ./487084416-vmlinux ... KERNEL: ./487084416-vmlinux DUMPFILE: ./487084416-vmcore [PARTIAL DUMP] CPUS: 32 DATE: Wed Jun 27 03:55:18 2018 UPTIME: 10 days, 23:19:41 LOAD AVERAGE: 7.18, 8.10, 9.61 TASKS: 1656 ... RELEASE: 2.6.32-642.11.1.el6.x86_64 VERSION: #1 SMP Wed Oct 26 10:25:23 EDT 2016 MACHINE: x86_64 (2599 Mhz) MEMORY: 1023.6 GB PANIC: "[948355.683417] BUG: unable to handle kernel paging request at 0000000d00000058" PID: 27515 COMMAND: "q" TASK: ffff88d9c894f520 [THREAD_INFO: ffff88ee18ee8000] CPU: 9 STATE: TASK_RUNNING (PANIC) ... [dwysocha@optimus-dev search1-487084416]$ perf report --stdio Warning: Kernel address maps (/proc/{kallsyms,modules}) were restricted. Check /proc/sys/kernel/kptr_restrict before running 'perf record'. As no suitable kallsyms nor vmlinux was found, kernel samples can't be resolved. Samples in kernel modules can't be resolved as well. # To display the perf.data header info, please use --header/--header-only options. # # Samples: 1M of event 'cycles' # Event count (approx.): 843283258660 # # Overhead Command Shared Object Symbol # ........ ....... ................ ............................................ # 25.99% crash libz.so.1.2.3 [.] inflate_fast 15.33% crash crash [.] read_diskdump 10.63% crash libz.so.1.2.3 [.] inflate_table 7.25% crash libz.so.1.2.3 [.] inflate 5.45% crash libz.so.1.2.3 [.] adler32 5.15% crash crash [.] readmem 4.01% crash crash [.] search_virtual 3.15% crash crash [.] x86_64_kvtop 3.11% crash crash [.] nr_to_section 2.64% crash crash [.] x86_64_is_kvaddr 1.68% crash crash [.] read_mem_section 1.51% crash crash [.] SIZE_verify 1.10% crash libc-2.12.so [.] memcpy 1.05% crash crash [.] OFFSET_verify 0.96% crash crash [.] pfn_to_map 0.72% crash crash [.] x86_64_IS_VMALLOC_ADDR 0.56% crash [unknown] [k] 0xffffffff81560206 0.54% crash [unknown] [k] 0xffffffff81560432 0.53% crash crash [.] section_has_mem_map 0.53% crash crash [.] kvtop 0.45% crash crash [.] phys_to_page 0.34% crash crash [.] sparse_decode_mem_map [dwysocha@optimus-dev search1-487084416]$ rpm -qf /lib64/libz.so.1 zlib-1.2.3-29.el6.x86_64 [dwysocha@optimus-dev search1-487084416]$ rpm -qi zlib Name : zlib Relocations: (not relocatable) Version : 1.2.3 Vendor: Red Hat, Inc. Release : 29.el6 Build Date: Wed 03 Oct 2012 12:18:49 PM EDT Install Date: Fri 23 Aug 2013 03:32:51 PM EDT Build Host: x86-022.build.eng.bos.redhat.com Group : System Environment/Libraries Source RPM: zlib-1.2.3-29.el6.src.rpm Size : 139089 License: zlib and Boost Signature : RSA/8, Tue 27 Nov 2012 04:58:58 AM EST, Key ID 199e2f91fd431d51 Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> URL : http://www.gzip.org/zlib/ Summary : The zlib compression and decompression library Description : Zlib is a general-purpose, patent-free, lossless data compression library which is used by many different programs. Name : zlib Relocations: (not relocatable) Version : 1.2.3 Vendor: Red Hat, Inc. Release : 29.el6 Build Date: Wed 03 Oct 2012 12:18:37 PM EDT Install Date: Thu 12 Jul 2018 07:14:37 AM EDT Build Host: x86-022.build.eng.bos.redhat.com Group : System Environment/Libraries Source RPM: zlib-1.2.3-29.el6.src.rpm Size : 152305 License: zlib and Boost Signature : RSA/8, Tue 27 Nov 2012 04:58:54 AM EST, Key ID 199e2f91fd431d51 Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> URL : http://www.gzip.org/zlib/ Summary : The zlib compression and decompression library Description : Zlib is a general-purpose, patent-free, lossless data compression library which is used by many different programs.
(In reply to Dave Wysochanski from comment #6) > Simple perf shows libz consuming most of the CPU - indeed we're in > decompression for this 'search' command. > Will need to think about what may be possible. > Pretty tiny vmcore file: > -rw-r--r--. 1 dwysocha dwysocha 5.9G Jun 27 13:17 ./487084416-vmcore Search commands are always expensive > [dwysocha@optimus-dev search1-487084416]$ cat ./487084416-crashrc > search ffffea003ba996a8 | cat -n > exit > But in comparison to the size of the vmcore, this macihne has 1TB of RAM > MEMORY: 1023.6 GB > # Samples: 1M of event 'cycles' > # Event count (approx.): 843283258660 > # > # Overhead Command Shared Object Symbol > # ........ ....... ................ > ............................................ > # > 25.99% crash libz.so.1.2.3 [.] inflate_fast > 15.33% crash crash [.] read_diskdump > 10.63% crash libz.so.1.2.3 [.] inflate_table > 7.25% crash libz.so.1.2.3 [.] inflate > 5.45% crash libz.so.1.2.3 [.] adler32 > 5.15% crash crash [.] readmem > 4.01% crash crash [.] search_virtual > 3.15% crash crash [.] x86_64_kvtop > 3.11% crash crash [.] nr_to_section > 2.64% crash crash [.] x86_64_is_kvaddr > 1.68% crash crash [.] read_mem_section Not surprising decompression is taking CPU. Will run longer with perf and see if anything else interesting shows up or if this is the whole story. NOTE: This is a small enough vmcore file I can run from page cache (RAM) so we should be totally avoiding any sort of disk reading code and this is more about how this machine with large (but sparse) memory was captured and we handle it when we decompress.
Ok as it turns out, when run from RAM that command does not take very long at all so it may not be a good test. Or else when it was run before it took a long time due to disk overhead or some other reason (other crash processes competing for RAM, etc). In any case, a couple quick thoughts on optimizing 'search' for profiles like this (assuming there's no much we can do in decompression routine): 1. Use 'dd' to make sure the vmcore is loaded into page cache (RAM) before running crash 2. Break up the kernel address space into chunks based on the # of CPUs 3. Run multiple crash commands with '-s' and '-e' addresses 4. Ideally do the above in one "optimized" crash command (but might be too difficult so manual steps may be adequate) I have some other ideas but I'll have to study the code more. The perf shows the same thing when run to completion # To display the perf.data header info, please use --header/--header-only options. # # Samples: 1M of event 'cycles' # Event count (approx.): 1254104950924 # # Overhead Command Shared Object Symbol # ........ ....... ................ ......................................... # 30.64% crash libz.so.1.2.3 [.] inflate_fast 13.89% crash crash [.] read_diskdump 12.51% crash libz.so.1.2.3 [.] inflate_table 8.49% crash libz.so.1.2.3 [.] inflate 6.10% crash libz.so.1.2.3 [.] adler32 4.08% crash crash [.] search_virtual 3.52% crash crash [.] readmem 2.27% crash crash [.] nr_to_section 2.14% crash crash [.] x86_64_kvtop 1.93% crash crash [.] x86_64_is_kvaddr 1.06% crash crash [.] SIZE_verify 0.98% crash crash [.] read_mem_section 0.88% crash libc-2.12.so [.] memcpy 0.71% crash crash [.] OFFSET_verify 0.63% crash crash [.] pfn_to_map 0.61% crash [unknown] [k] 0xffffffff81560432 0.61% crash [unknown] [k] 0xffffffff81560206 0.49% crash crash [.] x86_64_IS_VMALLOC_ADDR 0.36% crash crash [.] kvtop
FWIW, RHEL8 RC2 code runs in half the time though different hardware as well - too many moving parts to draw any conclusions other than our new vmcore system should be much faster for expensive crash commands. $ rpm -q crash zlib crash-7.2.3-18.el8.x86_64 zlib-1.2.11-10.el8.x86_64 $ time crash -i 487084416-crashrc ./487084416-vmcore ./487084416-vmlinux crash 7.2.3-18.el8 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation ... This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: ./487084416-vmlinux DUMPFILE: ./487084416-vmcore [PARTIAL DUMP] CPUS: 32 DATE: Wed Jun 27 03:55:18 2018 UPTIME: 10 days, 23:19:41 LOAD AVERAGE: 7.18, 8.10, 9.61 TASKS: 1656 ... RELEASE: 2.6.32-642.11.1.el6.x86_64 VERSION: #1 SMP Wed Oct 26 10:25:23 EDT 2016 MACHINE: x86_64 (2599 Mhz) MEMORY: 1023.6 GB PANIC: "[948355.683417] BUG: unable to handle kernel paging request at 0000000d00000058" PID: 27515 COMMAND: "q" TASK: ffff88d9c894f520 [THREAD_INFO: ffff88ee18ee8000] CPU: 9 STATE: TASK_RUNNING (PANIC) 1 crash> search ffffea003ba996a8 | cat -n crash> exit real 4m36.532s user 4m20.261s sys 0m14.748s $ Second run consistent time: crash> exit real 4m37.201s user 4m21.070s sys 0m14.922s Perf run, looks similar to RHEL6 $ perf record crash -i ./487084416-crashrc ./487084416-vmcore ./487084416-vmlinux ... 1 crash> search ffffea003ba996a8 | cat -n crash> exit [ perf record: Woken up 161 times to write data ] [ perf record: Captured and wrote 40.346 MB perf.data (1057427 samples) ] $ perf report --stdio # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 1M of event 'cpu-clock:uhH' # Event count (approx.): 264356750000 # # Overhead Command Shared Object Symbol # ........ ....... ................ ................................................. # 35.21% crash libz.so.1.2.11 [.] inflate_fast 12.73% crash crash [.] read_diskdump 11.07% crash libz.so.1.2.11 [.] inflate_table 8.99% crash libz.so.1.2.11 [.] inflate 6.48% crash crash [.] search_virtual 5.56% crash libz.so.1.2.11 [.] adler32_z 3.50% crash crash [.] readmem 3.00% crash crash [.] nr_to_section 1.86% crash crash [.] x86_64_kvtop 1.77% crash libc-2.28.so [.] __memmove_avx_unaligned_erms 1.33% crash crash [.] x86_64_is_kvaddr 1.31% crash crash [.] SIZE_verify 0.76% crash libc-2.28.so [.] read 0.75% crash libc-2.28.so [.] llseek.5 0.68% crash crash [.] x86_64_IS_VMALLOC_ADDR 0.68% crash crash [.] OFFSET_verify 0.62% crash crash [.] read_mem_section
I started investigating ways to improve this because I think it would a be a large win for senior engineers that often end up using 'search' for more complicated vmcores / crashes such as memory corruptions. I have some ideas but no concrete patches. I'm not hopeful at this point I can have something for RHEL7.8 so it may need to be a RHEL8 only thing.
Rather than go after zlib, here are some of my other thoughts: 1) divide the search space into pieces and make an search option to be multi-threaded 2) develop heuristics for common search patterns in common vmcores. There's a variety of search algorithms and techniques I think that we might explore. Any of the above may be quite involved and likely require extensive testing to ensure correctness, so I would not rush any patches.
(In reply to Dave Wysochanski from comment #12) > I started investigating ways to improve this because I think it would a be a > large win for senior engineers that often end up using 'search' for more > complicated vmcores / crashes such as memory corruptions. I have some ideas > but no concrete patches. I'm not hopeful at this point I can have something > for RHEL7.8 so it may need to be a RHEL8 only thing. Thanks for the update, close this one then, cloning a new bz for RHEL8