1114088 – crash --osrelease will spin chewing cpu on certain input files

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1114088 - crash --osrelease will spin chewing cpu on certain input files

Summary: crash --osrelease will spin chewing cpu on certain input files

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	crash
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Dave Anderson
QA Contact:	Qiao Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-27 17:06 UTC by Dave Wysochanski
Modified:	2019-02-19 21:40 UTC (History)
CC List:	1 user (show)
Fixed In Version:	crash-7.0.8-1.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-03-05 14:09:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
0001-Fix-infinite-loop-in-crash-osrelease-with-damaged-vm.patch (756 bytes, patch) 2014-06-27 17:50 UTC, Dave Wysochanski	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0463	0	normal	SHIPPED_LIVE	crash bug fix and enhancement update	2015-03-05 17:36:08 UTC

Description Dave Wysochanski 2014-06-27 17:06:20 UTC

Created attachment 912866 [details]
sample 'vmcore' file which was submitted to our vmcore system and caused 'crash --osrelease' to spin at 100% CPU

Description of problem:
We found this on our rhel6 based vmcore analysis system, which has an 7.x based version of crash rebuilt (see below).  When a supposed vmcore file is submitted to our vmcore analysis system (retrace-server), part of the validation of the file is to run the following command to try to identify the kernel version:
$ crash --osrelease vmcore

With certain small files (doesn't look like a valid vmcore to me it's so tiny), the above crash command seems to spin consuming 100% CPU and never complete.

Version-Release number of selected component (if applicable):
Reproduced on crash-7.0.2-6.el7.x86_64 but other versions likely affected (perhaps upstream).

First observed on our rhel6 machine which was based on 7.0.6, and patched to include lzo and snappy compression - crash-7.0.6-1
$ rpm -q --changelog crash-7.0.6-1
* Thu May 22 2014 Dave Wysochanski <dwysocha> - 7.0.6-1
- Upstream 7.0.6 srpm plus lzo_snappy.patch

How reproducible:
Every time with certain files.

Steps to Reproduce:
$ crash --osrelease file

Actual results:
process spins with 100% CPU

Expected results:
command should complete

Additional info:
I'll attach sample input files.  We can't control what users upload and this is part of our validation of the file.  If for some reason this can't be fixed and there's a better way to identify the file before running this command, we can use that in our vmcore analysis system.  Otherwise the command just hangs and we'll have to put some logic to detect that, kill the crash process, and fail the 'vmcore'.

Comment 3 Dave Wysochanski 2014-06-27 17:21:51 UTC

(gdb) run --osrelease vmcore-triggers-crash-loop-optimus-task-403254474
Starting program: /usr/bin/crash --osrelease vmcore-triggers-crash-loop-optimus-task-403254474
^C
Program received signal SIGINT, Interrupt.
0x00007ffff6b6ec10 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:81
81      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.2-16.el7.x86_64 libstdc++-4.8.2-16.el7.x86_64
(gdb) bt
#0  0x00007ffff6b6ec10 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000005451f2 in read (__nbytes=6406144, __buf=0x7ffff5f4d010, __fd=<optimized out>) at /usr/include/bits/unistd.h:44
#2  read_dump_header (file=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at diskdump.c:622
#3  is_diskdump (file=file@entry=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at diskdump.c:812
#4  0x00000000004673b3 in get_osrelease (dumpfile=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at main.c:1757
#5  main (argc=3, argv=0x7fffffffdd08) at main.c:239

<install debuginfos>

(gdb) info reg
rax            0x0      0
rbx            0xec4fe0 15486944
rcx            0xffffffffffffffff       -1
rdx            0x61c000 6406144
rsi            0x7ffff5f4d010   140737319850000
rdi            0x7      7
rbp            0x7fffffffe0b0   0x7fffffffe0b0
rsp            0x7fffffffdb58   0x7fffffffdb58
r8             0xe58800 15042560
r9             0x0      0
r10            0x1      1
r11            0x246    582
r12            0xec5ff0 15491056
r13            0x0      0
r14            0x61c000 6406144
r15            0x7ffff5f4d010   140737319850000
rip            0x7ffff6b6ec10   0x7ffff6b6ec10 <__read_nocancel+7>
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) dis 0x7ffff6b6ec10
warning: bad breakpoint number at or near '0x7ffff6b6ec10'
(gdb) disassemble __read_nocancel
Dump of assembler code for function __read_nocancel:
   0x00007ffff6b6ec09 <+0>:     mov    $0x0,%eax
   0x00007ffff6b6ec0e <+5>:     syscall 
=> 0x00007ffff6b6ec10 <+7>:     cmp    $0xfffffffffffff001,%rax
   0x00007ffff6b6ec16 <+13>:    jae    0x7ffff6b6ec49 <read+73>
   0x00007ffff6b6ec18 <+15>:    retq   
End of assembler dump.

So we're actually spinning on 'syscall' instruction?

Comment 4 Dave Wysochanski 2014-06-27 17:38:10 UTC

It looks to me like the problem is in crash, perhaps due to a damaged file.
It looks like we're repeatedly calling 'read' syscall, ending up in a typical kernel path, and then calling it again.
I ran the below loop while crash was running in another window, which seemed to show we're doing read syscalls over and over (maybe to the same area, not sure).   Maybe next step is systemtap to see params of the reads (offset, len) are happening or just move up in the stack and see what crash is doing with the reads.  Maybe a bug in one of these routines:
#2  read_dump_header (file=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at diskdump.c:622
#3  is_diskdump (file=file@entry=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at diskdump.c:812


# while true; do date; cat /proc/26477/stack; sleep 1; done
Fri Jun 27 13:32:47 EDT 2014
[<ffffffff811b04a8>] SyS_read+0x58/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:48 EDT 2014
[<ffffffffa028d259>] xfs_file_aio_read+0x189/0x2e0 [xfs]
[<ffffffff811af29d>] do_sync_read+0x8d/0xd0
[<ffffffff811af97c>] vfs_read+0x9c/0x170
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:49 EDT 2014
[<ffffffffa028d259>] xfs_file_aio_read+0x189/0x2e0 [xfs]
[<ffffffff811af29d>] do_sync_read+0x8d/0xd0
[<ffffffff811af97c>] vfs_read+0x9c/0x170
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:50 EDT 2014
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:51 EDT 2014
[<ffffffff811b04a8>] SyS_read+0x58/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:52 EDT 2014
[<ffffffff811b04a8>] SyS_read+0x58/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:53 EDT 2014
[<ffffffffa028d259>] xfs_file_aio_read+0x189/0x2e0 [xfs]
[<ffffffff811af29d>] do_sync_read+0x8d/0xd0
[<ffffffff811af97c>] vfs_read+0x9c/0x170
[<ffffffff811b0482>] SyS_read+0x32/0xb0
[<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:54 EDT 2014
[<ffffffff811b04a8>] SyS_read+0x58/0xb0
[<ffffffff815f2240>] sysret_audit+0x17/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:55 EDT 2014
[<ffffffffa028d224>] xfs_file_aio_read+0x154/0x2e0 [xfs]
[<ffffffff811af29d>] do_sync_read+0x8d/0xd0
[<ffffffff811af97c>] vfs_read+0x9c/0x170
[<ffffffff811b04a8>] SyS_read+0x58/0xb0
[<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Fri Jun 27 13:32:56 EDT 2014

Comment 5 Dave Wysochanski 2014-06-27 17:48:07 UTC

It looks like this may be a simple case of not checking for end of file (read returning 0) on a really ldamaged vmcore.  We're stuck in the while(len) loop, repeatedly calling read() on the file with a length longer than the filesize.  The read() syscall is returning 0, indicating 'end of file', but the code doesn't check for that.  We probably just need a one byte patch.  Change this:
623                             if (bytes_read  < 0) {
To
623                             if (bytes_read  <= 0) {


gdb) frame 1
#1  read_dump_header (file=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at diskdump.c:622
622                             bytes_read = read(dd->dfd, bufptr, len);
(gdb) b 622
Breakpoint 1 at 0x5451f2: file diskdump.c, line 622.
(gdb) c
Continuing.

Breakpoint 1, read_dump_header (file=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at diskdump.c:623
623                             if (bytes_read  < 0) {
(gdb) p bytes_read
$3 = 0
(gdb) c
Continuing.

Breakpoint 1, read_dump_header (file=0x7fffffffe0b0 "vmcore-triggers-crash-loop-optimus-task-403254474") at diskdump.c:623
623                             if (bytes_read  < 0) {
(gdb) p bytes_read
$4 = 0
619                     bufptr = dd->bitmap;
620                     len = bitmap_len;
621                     while (len) {
622                             bytes_read = read(dd->dfd, bufptr, len);
623                             if (bytes_read  < 0) {
624                                     error(INFO, "%s: cannot read memory bitmap\n",
625                                             DISKDUMP_VALID() ? "diskdump"
626                                             : "compressed kdump");
627                                     goto err;
(gdb) 
628                             }
629                             len -= bytes_read;
630                             bufptr += bytes_read;
631                     }
632             }
633
634             if (dump_is_partial(header))
635                     memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,
636                            bitmap_len/2);
637             else
(gdb) p len
$5 = 6406144

$ stat vmcore-triggers-crash-loop-optimus-task-403254474
  File: ‘vmcore-triggers-crash-loop-optimus-task-403254474’
  Size: 14216           Blocks: 32         IO Block: 4096   regular file
Device: fd02h/64770d    Inode: 268496715   Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/dwysocha-local)   Gid: ( 1000/dwysocha-local)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2014-06-27 13:11:30.069132675 -0400
Modify: 2014-06-27 13:11:30.125132587 -0400
Change: 2014-06-27 13:11:30.125132587 -0400

We're trying to read more bytes than the filesize.

Comment 6 Dave Wysochanski 2014-06-27 17:50:59 UTC

Created attachment 912894 [details]
0001-Fix-infinite-loop-in-crash-osrelease-with-damaged-vm.patch

Comment 7 Dave Anderson 2014-06-27 18:45:10 UTC

Hi Dave,

I appreciate your doing all the legwork here -- thanks!

Note that it would affect any other attempt to use a corrupted dumpfile
whose bitmap data is not wholly contained with the file.

Your fix is committed to the upstream git tree:

https://github.com/crash-utility/crash/commit/1767e9d4f791c351fb3c21daa1f72fb585e31823

  If a compressed kdump is damaged/truncated such that the bitmap data
  in the dumpfile header is not contained within the file, attempts
  to analyze it with a vmlinux file, or using the "crash --osrelease"
  or "crash --log" options with just the vmcore, will result in the
  crash utility spinning forever, endlessly performing reads of 0 bytes
  from the file without recognizing the EOF condition.
  (dwysocha)

Comment 8 Dave Wysochanski 2014-06-27 19:03:09 UTC

Thank you Dave A!

I built a RHEL6 based crash package and verified the fix.  We now exit properly as expected.  QE can use the attached files to verify, so adding TestCaseProvided.  Unpatched crash will just hang, patched it'll exit as follows:

$ ./usr/bin/crash --osrelease vmcore-triggers-crash-osrelease-loop
crash: compressed kdump: cannot read memory bitmap
unknown
$ echo $?                                                                                                                  
1

Comment 15 errata-xmlrpc 2015-03-05 14:09:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0463.html

Note You need to log in before you can comment on or make changes to this bug.