Bug 98327

Summary: [x86_64] Reproducible kernel Oops in fs code (caused by a huge malloc?)
Product: [Retired] Red Hat Raw Hide Reporter: Aleksey Nogin <aleksey>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0CC: crt, jyh
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: http://people.redhat.com/arjanv/amd64/
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-01-07 05:31:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aleksey Nogin 2003-07-01 04:19:57 UTC
While trying to compile OCaml under the current x86_64 Rawhide tree and the
kernel-smp-2.4.20-18.9 from http://people.redhat.com/arjanv/amd64/, got the
following Oops:

ocamlrun[4543]: segfault at 000000000000002f rip 0000002a95c9d455 rsp
0000007fbfffd7b0 error 4
Unable to handle kernel paging request at virtual address ffffffffffffffff
 printing rip:
ffffffff80169bcb
PML4 103027 PGD 2067 PMD 0
Oops: 0002
CPU 0
Pid: 4543, comm: ocamlrun Not tainted
RIP: 0010:[<ffffffff80169bcb>]{inode_init_once+11}
RSP: 0000:00000103fa233a90  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: 000000000000005f
RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffffffffffffffff
RBP: ffffffffffffffff R08: 0000000000000000 R09: 00000100164fff98
R10: 0000000000000001 R11: 0000000000000000 R12: 000001001719a840
R13: 00000000000001f0 R14: 0000000000000000 R15: 00000103dfffe000
FS:  0000000000525e80(0000) GS:ffffffff804b8fc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffff CR3: 0000000000101000 CR4: 00000000000006e0

Call Trace: [<ffffffff80141faa>]{kmem_cache_grow+634}
       [<ffffffffa00157c5>]{:ext3:ext3_getblk+181}
[<ffffffff801427e1>]{kmem_cache_alloc+641}
       [<ffffffff80169a6d>]{alloc_inode+45} [<ffffffff8016b5a9>]{new_inode+9}
       [<ffffffff801509a2>]{end_buffer_io_sync+34}
[<ffffffffa001393b>]{:ext3:ext3_new_inode+75}
       [<ffffffffa0008f8c>]{:jbd:.rodata.str1.1+60}
[<ffffffffa0000332>]{:jbd:start_this_handle+306}
       [<ffffffffa0019634>]{:ext3:ext3_create+164}
[<ffffffff8015defe>]{vfs_create+430}
       [<ffffffff8015dbd3>]{lookup_hash+243} [<ffffffff8015e0fe>]{open_namei+350}
       [<ffffffff8014e333>]{filp_open+51} [<ffffffff8015aee1>]{do_coredump+289}
       [<ffffffff8012dc7a>]{collect_signal+202} [<ffffffff8010ee70>]{do_signal+768}
       [<ffffffff8015c2d0>]{permission+224}
[<ffffffff8010f99a>]{error_signal_test+0}

Process ocamlrun (pid: 4543, stackpage=103fa233000)
Stack: 00000103fa233a90 0000000000000000 ffffffff80141faa 000001001719a87c
       0000000000000001 0000000000000000 0000010001000048 0000000000000202
       ffffffffa00157c5 000001001719a87c 0000000000000000 00000103fceea928
       00000000000001f0 000001001719a840 00000103fceea928 0000000000000000
       00000103e07a4258 0000000000000180 ffffffff801427e1 0000000000000010
       0000000000000246 0000000000000001 00000103fa233c48 0000000000000000
       00000103fed75800 00000103fceea928 00000103fc0c3a60 00000103e07a4258
       ffffffff80169a6d 4e4d4c4b4a494847 0000000000008180 00000103fed75800
       ffffffff8016b5a9 3736353433323130 ffffffff801509a2 00000103fcb5b988
       ffffffffa001393b 000001007a797877 0000000000000001 0000000000000000
Call Trace: [<ffffffff80141faa>]{kmem_cache_grow+634}
       [<ffffffffa00157c5>]{:ext3:ext3_getblk+181}
[<ffffffff801427e1>]{kmem_cache_alloc+641}
       [<ffffffff80169a6d>]{alloc_inode+45} [<ffffffff8016b5a9>]{new_inode+9}
       [<ffffffff801509a2>]{end_buffer_io_sync+34}
[<ffffffffa001393b>]{:ext3:ext3_new_inode+75}
       [<ffffffffa0008f8c>]{:jbd:.rodata.str1.1+60}
[<ffffffffa0000332>]{:jbd:start_this_handle+306}
       [<ffffffffa0019634>]{:ext3:ext3_create+164}
[<ffffffff8015defe>]{vfs_create+430}
       [<ffffffff8015dbd3>]{lookup_hash+243} [<ffffffff8015e0fe>]{open_namei+350}
       [<ffffffff8014e333>]{filp_open+51} [<ffffffff8015aee1>]{do_coredump+289}
       [<ffffffff8012dc7a>]{collect_signal+202} [<ffffffff8010ee70>]{do_signal+768}
       [<ffffffff8015c2d0>]{permission+224}
[<ffffffff8010f99a>]{error_signal_test+0}


Code: f3 48 ab 48 8d 96 18 01 00 00 48 b9 01 00 00 00 ad 4e ad de


"cat /proc/version":

Linux version 2.4.20-18.9smp (bhcompile.redhat.com) (gcc version
3.2.2 20030222 (Red Hat Linux 3.2.2-5)) #1 SMP Thu May 29 06:45:34 EDT 2003

Comment 1 Aleksey Nogin 2003-07-01 04:31:39 UTC
This seems to be perfectly reproducible - after reboot I tried compiling again
and at the exact same place (sh ./runocamldoc true -man -d stdlib_man ...)
machine Oopsed and froze.

Comment 2 Aleksey Nogin 2003-07-02 03:29:10 UTC
The same problem exists in 2.5.69-ac1, but it exibits itself in a somewhat
different way. At the same place in OCaml compilation process, I get:

ocamlrun[4737] segfault at rip:2a95bda18d rsp:7fbfffe860 adr:fffffffffffffff7 err:4
Slab corruption: start=00000103dfffe000, expend=00000103dfffefff,
problemat=00000103dfffe000
Data: FF FF [... a huge number of FFs ...] FF
Next: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF
slab error in check_poison_obj(): cache `size-4096': object was modified after
freeing

Call Trace:<ffffffff801d7432>{journal_get_undo_access+130}
<ffffffff80164cc7>{kmalloc+231}
       <ffffffff801d7432>{journal_get_undo_access+130}
<ffffffff801c4775>{ext3_new_block+933}
       <ffffffff801c7101>{ext3_alloc_block+17}
<ffffffff801c74d5>{ext3_alloc_branch+85}
       <ffffffff801c7b7f>{ext3_get_block_handle+751}
<ffffffff80184d1f>{alloc_buffer_head+111}
       <ffffffff80181f86>{create_buffers+102}
<ffffffff8018310d>{__block_prepare_write+333}
       <ffffffff801c7c40>{ext3_get_block+0}
<ffffffff80183c0a>{block_prepare_write+26}
       <ffffffff801c842e>{ext3_prepare_write+302}
<ffffffff8015daa1>{generic_file_aio_write_nolock+1297}
       <ffffffff8016e47d>{do_anonymous_page+1661}
<ffffffff801c79ad>{ext3_get_block_handle+285}
       <ffffffff8015e01d>{generic_file_aio_write+109}
<ffffffff801c57f3>{ext3_file_write+35}
       <ffffffff8017f343>{do_sync_write+115}
<ffffffff801d7961>{journal_dirty_metadata+465}
       <ffffffff80164469>{cache_alloc_refill+1129}
<ffffffff80162b9c>{check_poison_obj+60}
       <ffffffff801b0956>{elf_core_dump+262} <ffffffff80164cc7>{kmalloc+231}
       <ffffffff801b02c2>{dump_write+18} <ffffffff801b0dce>{elf_core_dump+1406}
       <ffffffff801a57af>{__mark_inode_dirty+47}
<ffffffff8019f043>{notify_change+483}
       <ffffffff8017d385>{do_truncate+69} <ffffffff8018d2e4>{do_coredump+452}
       <ffffffff80147e18>{__dequeue_signal+392}
<ffffffff8014a85c>{get_signal_to_deliver+1548}
       <ffffffff801206bc>{do_page_fault+668} <ffffffff80111a8d>{do_signal+125}
       <ffffffff80171514>{do_brk+340} <ffffffff80112360>{retint_signal+62}

Assertion failure in ext3_new_block() at fs/ext3/balloc.c:562:
"!(__builtin_constant_p((ret_block)) ?
constant_test_bit(((ret_block)),((unsigned
long*)bh2jh(bitmap_bh)->b_committed_data)) :
variable_test_bit(((ret_block)),((unsigned
long*)bh2jh(bitmap_bh)->b_committed_data)))"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at balloc:562
invalid operand: 0000 [1]
CPU 0
Pid: 4737, comm: ocamlrun Not tainted
RIP: 0010:[<ffffffff801c4948>] <ffffffff801c4948>{ext3_new_block+1400}
RSP: 0018:00000103fc511688  EFLAGS: 00010212
RAX: 0000000000000119 RBX: 0000000000020875 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000103fe789af0
RBP: 00000103ffcc8400 R08: 0000000000000000 R09: 0000000000000720
R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000004 R14: 00000103ed92d080 R15: 00000103f05a7168
FS:  0000002a95571fe0(0000) GS:ffffffff8046f380(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: fffffffffffffff7 CR3: 0000000000101000 CR4: 00000000000006a0

Call Trace:<ffffffff801c4948>{ext3_new_block+1400}
<ffffffff801c7101>{ext3_alloc_block+17}
       <ffffffff801c74d5>{ext3_alloc_branch+85}
<ffffffff801c7b7f>{ext3_get_block_handle+751}
       <ffffffff80184d1f>{alloc_buffer_head+111}
<ffffffff80181f86>{create_buffers+102}
       <ffffffff8018310d>{__block_prepare_write+333}
<ffffffff801c7c40>{ext3_get_block+0}
       <ffffffff80183c0a>{block_prepare_write+26}
<ffffffff801c842e>{ext3_prepare_write+302}
       <ffffffff8015daa1>{generic_file_aio_write_nolock+1297}
       <ffffffff8016e47d>{do_anonymous_page+1661}
<ffffffff801c79ad>{ext3_get_block_handle+285}
       <ffffffff8015e01d>{generic_file_aio_write+109}
<ffffffff801c57f3>{ext3_file_write+35}
       <ffffffff8017f343>{do_sync_write+115}
<ffffffff801d7961>{journal_dirty_metadata+465}
       <ffffffff80164469>{cache_alloc_refill+1129}
<ffffffff80162b9c>{check_poison_obj+60}
       <ffffffff801b0956>{elf_core_dump+262} <ffffffff80164cc7>{kmalloc+231}
       <ffffffff801b02c2>{dump_write+18} <ffffffff801b0dce>{elf_core_dump+1406}
       <ffffffff801a57af>{__mark_inode_dirty+47}
<ffffffff8019f043>{notify_change+483}
       <ffffffff8017d385>{do_truncate+69} <ffffffff8018d2e4>{do_coredump+452}
       <ffffffff80147e18>{__dequeue_signal+392}
<ffffffff8014a85c>{get_signal_to_deliver+1548}
       <ffffffff801206bc>{do_page_fault+668} <ffffffff80111a8d>{do_signal+125}
       <ffffffff80171514>{do_brk+340} <ffffffff80112360>{retint_signal+62}

Process ocamlrun (pid: 4737, stackpage=103f99c0460)
Stack: 0000000000000005 00000103ffcc8498 00000103edc4f400 0000000000000001
       0000087500020875 00000103ee216320 00000103fc51174c 00000103ee009780
       00000103fa903c80 aaaaaaaaaaaaaaab
Call Trace:<ffffffff801c7101>{ext3_alloc_block+17}
<ffffffff801c74d5>{ext3_alloc_branch+85}
       <ffffffff801c7b7f>{ext3_get_block_handle+751}
<ffffffff80184d1f>{alloc_buffer_head+111}
       <ffffffff80181f86>{create_buffers+102}
<ffffffff8018310d>{__block_prepare_write+333}
       <ffffffff801c7c40>{ext3_get_block+0}
<ffffffff80183c0a>{block_prepare_write+26}
       <ffffffff801c842e>{ext3_prepare_write+302}
<ffffffff8015daa1>{generic_file_aio_write_nolock+1297}
       <ffffffff8016e47d>{do_anonymous_page+1661}
<ffffffff801c79ad>{ext3_get_block_handle+285}
       <ffffffff8015e01d>{generic_file_aio_write+109}
<ffffffff801c57f3>{ext3_file_write+35}
       <ffffffff8017f343>{do_sync_write+115}
<ffffffff801d7961>{journal_dirty_metadata+465}
       <ffffffff80164469>{cache_alloc_refill+1129}
<ffffffff80162b9c>{check_poison_obj+60}
       <ffffffff801b0956>{elf_core_dump+262} <ffffffff80164cc7>{kmalloc+231}
       <ffffffff801b02c2>{dump_write+18} <ffffffff801b0dce>{elf_core_dump+1406}
       <ffffffff801a57af>{__mark_inode_dirty+47}
<ffffffff8019f043>{notify_change+483}
       <ffffffff8017d385>{do_truncate+69} <ffffffff8018d2e4>{do_coredump+452}
       <ffffffff80147e18>{__dequeue_signal+392}
<ffffffff8014a85c>{get_signal_to_deliver+1548}
       <ffffffff801206bc>{do_page_fault+668} <ffffffff80111a8d>{do_signal+125}
       <ffffffff80171514>{do_brk+340} <ffffffff80112360>{retint_signal+62}


Code: 0f 0b b1 da 32 80 ff ff ff ff 32 02 48 8b 74 24 28 48 8b 7c


Comment 3 Aleksey Nogin 2003-07-02 03:41:59 UTC
I also filed http://bugme.osdl.org/show_bug.cgi?id=862 for the 2.5.69-ac1 crash.

Comment 4 Aleksey Nogin 2003-07-02 04:14:40 UTC
I tried (under 2.5.69-ac1) mounting the partition as ext2 and it still crashed.

Comment 5 Aleksey Nogin 2003-07-02 11:11:57 UTC
Acording to Xavier Leroy (OCaml author), this place of OCaml compilation is
probably doing a huge malloc:

> This is a problem we've seen on other 64-bit Linux platforms, and it
> is due to the fact that malloc() can return *widely* spaced pointers.
> Since OCaml likes to maintain a table of memory pages it has
> allocated, this causes the page table to become *huge* and its
> allocation fails.  
>
> The workaround is ...
>
> However, a failed malloc() request shouldn't cause a kernel oops 

Comment 6 Aleksey Nogin 2004-01-07 05:31:00 UTC
Turned out that machines had a buggy version of the BIOS. Upgrading
the BIOS solved a lot of problems - not sure if this particular one
was also solved, but it probably was.