Bug 154596

Summary: kernel panics when running memory test with 32gb ram
Product: Red Hat Enterprise Linux 3 Reporter: erik nguyen <erik.nguyen>
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: linux-bugs, peterm, petrides, tao, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-04 00:35:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lspci -vvvx output none

Description erik nguyen 2005-04-12 21:47:26 UTC
Description of problem:

- when running memory test from the rhr2-tests-0.9-14.2 on an smp opteron system
with dual core, rhel3 update 5 beta and 32gb ram, kernel panics 
- tried to install the kernel-hugemem-2.4.21-31.EL.i686.rpm and
kernel-hugemem-unsupported-2.4.21-31.EL.i686.rpm but kept getting depmod errors
like:

depmod: ELF file
/lib/modules/2.4.21-31.ELhugemem/kernel/drivers/parport/parport.o not for this
architecture
depmod: ELF file
/lib/modules/2.4.21-31.ELhugemem/kernel/drivers/parport/parport_pc.o not for
this architecture

----------------------------


Version-Release number of selected component (if applicable):
rhr2-tests-0.9-14.2

How reproducible:

repeatable everytime

Steps to Reproduce:
1. install r3u5 on an smp opteron with dual core (not sure if this is only
happen with dual core or also w/ single core as well) and 32gb ram
2. run rhr2's memory test

Actual results:


Starting MEMORY test... Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000078


printing rip:
ffffffff8011fd7b
PML4 71e394067 PGD 7c0f6e067 PMD 0
Oops: 0000
CPU 1
Pid: 30913, comm: diff Not tainted
RIP: 0010:[<ffffffff8011fd7b>]{schedule+571}
RSP: 0018:00000106948e1b28  EFLAGS: 00010046
RAX: 00000000000012c0 RBX: ffffffffffffffa0 RCX: ffffffff805ec5f0
RDX: ffffffff805ec5e8 RSI: 0000000000000000 RDI: ffffffff803f5cc0
RBP: 00000106948e1b58 R08: 00000000ffffffff R09: 000001066e1a2240
R10: 0000000000000017 R11: 0000000000000001 R12: 00000106948e0000
R13: ffffffff805ebcc0 R14: 0000000000000001 R15: 000000000000008c
FS:  0000002a95b010a0(0000) GS:ffffffff805e4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000078 CR3: 000000001e65e000 CR4: 00000000000006e0
                                                                 Call Trace:
[<ffffffff8013033f>]{schedule_timeout+175}
      [<ffffffff80130280>]{process_timeout+0}
[<ffffffff8012144a>]{io_schedule_timeout+42}
      [<ffffffff801f97a4>]{__get_request_wait+180}
[<ffffffff801fa17d>]{__make_request+893}
      [<ffffffff801fa45b>]{generic_make_request+331}
[<ffffffff801fa4d1>]{submit_bh_rsector+97}
      [<ffffffff8016339b>]{block_read_full_page+619}
[<ffffffffa004e010>]{:ext3:ext3_get_block+0}
      [<ffffffffa00508bd>]{:ext3:ext3_do_update_inode+701}
      [<ffffffff801431a7>]{page_cache_read+231}
[<ffffffff80143ea2>]{generic_file_readahead+274}
      [<ffffffff80144264>]{do_generic_file_read+740}
[<ffffffff80144a70>]{file_read_actor+0}
      [<ffffffff80144bc5>]{generic_file_new_read+165}
[<ffffffff8015fa52>]{sys_read+178}
      [<ffffffff801102a7>]{system_call+119}
Process diff (pid: 30913, stackpage=106948e1000)
Stack: 00000106948e1b28 0000000000000018 000001036419e000 0000000000000000
      000000000036633a 000000000036633a 0000000000000000 0000000000000000
      00000106948e1c18 00000106948e1bd8 ffffffff8013033f ffffffff803f6078
      0000010661eddb68 000000000036633a 00000106948e0000 ffffffff80130280
      0000000000000000 0000000000000000 ffffffff803f5cc0 00000107fff70830
      00000000000012c0 0000000000000000 ffffffff8012144a 0000000000000206
      0000000000000000 00000107fff70830 ffffffff801f97a4 0000000000000000
      00000106948e0000 0000000000000000 0000000000000000 0000000000000000
      00000101c95c1dc8 0000000000000000 00000106948e0000 000001036419fc28
      0000010661eddc28 0000000000000000 0000000800000002 0000000000000400
Call Trace: [<ffffffff8013033f>]{schedule_timeout+175}
      [<ffffffff80130280>]{process_timeout+0}
[<ffffffff8012144a>]{io_schedule_timeout+42}
      [<ffffffff801f97a4>]{__get_request_wait+180}
[<ffffffff801fa17d>]{__make_request+893}
      [<ffffffff801fa45b>]{generic_make_request+331}
[<ffffffff801fa4d1>]{submit_bh_rsector+97}
      [<ffffffff8016339b>]{block_read_full_page+619}
[<ffffffffa004e010>]{:ext3:ext3_get_block+0}
      [<ffffffffa00508bd>]{:ext3:ext3_do_update_inode+701}
      [<ffffffff801431a7>]{page_cache_read+231}
[<ffffffff80143ea2>]{generic_file_readahead+274}
      [<ffffffff80144264>]{do_generic_file_read+740}
[<ffffffff80144a70>]{file_read_actor+0}
      [<ffffffff80144bc5>]{generic_file_new_read+165}
[<ffffffff8015fa52>]{sys_read+178}
      [<ffffffff801102a7>]{system_call+119}
                                             Code: 48 8b bb d8 00 00 00 44 89 73
3c 4d 8b 8c 24 e0 00 00 00 48
                                                                 Kernel panic:
Fatal exception
NMI Watchdog detected LOCKUP on CPU0, rip ffffffff80121de9, registers:
CPU 0
Pid: 30677, comm: diff Not tainted
RIP: 0010:[<ffffffff80121de9>]{.text.lock.sched+93}
RSP: 0018:000001036419fab8  EFLAGS: 00000082
RAX: 00000000000000ff RBX: ffffffff805eaa00 RCX: 0000000000000020
RDX: 0000000000000000 RSI: 0000000000000007 RDI: ffffffff805f2d40
RBP: 000001036419fb18 R08: 00000000ffffffff R09: 00000000000000ff
R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
R13: 000001036419faec R14: 0000000000000000 R15: ffffffff805ebcc0
FS:  0000002a95b010a0(0000) GS:ffffffff805e3f80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a97a37000 CR3: 0000000000101000 CR4: 00000000000006e0
                                                                Call Trace: 
<EOE> [<ffffffff8011fca9>]{schedule+361}
      [<ffffffff8013033f>]{schedule_timeout+175}
[<ffffffff80130280>]{process_timeout+0}
      [<ffffffff8012144a>]{io_schedule_timeout+42}
[<ffffffff801f97a4>]{__get_request_wait+180}
      [<ffffffff801fa17d>]{__make_request+893}
[<ffffffff801fa45b>]{generic_make_request+331}
      [<ffffffff801fa4d1>]{submit_bh_rsector+97}
[<ffffffff8016339b>]{block_read_full_page+619}
      [<ffffffffa004e010>]{:ext3:ext3_get_block+0}
[<ffffffff80110b06>]{error_exit+0}
      [<ffffffff801431a7>]{page_cache_read+231}
[<ffffffff80143ea2>]{generic_file_readahead+274}
      [<ffffffff80144264>]{do_generic_file_read+740}
[<ffffffff80144a70>]{file_read_actor+0}
      [<ffffffff80144bc5>]{generic_file_new_read+165}
[<ffffffff8015fa52>]{sys_read+178}
      [<ffffffff801102a7>]{system_call+119}
Process diff (pid: 30677, stackpage=1036419f000)
Stack: 000001036419fab8 0000000000000018 0000000000000000 0000000000800000
      0000000000000000 0000000000000000 ffffffff803efb60 0000000000000002
      0000000000000000 0000000000000000 0000000000000000 000001001e664400
      0000000000000109 000001001e66d280 ffffff0000000000 000000fffffff000
      0000000000000000 000001001e665a80 0000000000000000 0000000000000000
      0000000100000001 0000000000000001 0000000000000000 0000000000000000
      0000000000000000 0000000000000000 0000000000000000 0000000000000000
      0000000000000000 00000101fffe3e58 0000000000000000 0000000000000000
      0000000000000000 0000000000000000 0000000000aaf5be 00000000000000ff
      0000000000000001 0000000000000000 0000000000000000 00000000000000fe
Call Trace:  <EOE> [<ffffffff8011fca9>]{schedule+361}
      [<ffffffff8013033f>]{schedule_timeout+175}
[<ffffffff80130280>]{process_timeout+0}
      [<ffffffff8012144a>]{io_schedule_timeout+42}
[<ffffffff801f97a4>]{__get_request_wait+180}
      [<ffffffff801fa17d>]{__make_request+893}
[<ffffffff801fa45b>]{generic_make_request+331}
      [<ffffffff801fa4d1>]{submit_bh_rsector+97}
[<ffffffff8016339b>]{block_read_full_page+619}
      [<ffffffffa004e010>]{:ext3:ext3_get_block+0}
[<ffffffff80110b06>]{error_exit+0}
      [<ffffffff801431a7>]{page_cache_read+231}
[<ffffffff80143ea2>]{generic_file_readahead+274}
      [<ffffffff80144264>]{do_generic_file_read+740}
[<ffffffff80144a70>]{file_read_actor+0}
      [<ffffffff80144bc5>]{generic_file_new_read+165}
[<ffffffff8015fa52>]{sys_read+178}
      [<ffffffff801102a7>]{system_call+119}
                                            Code: f3 90 7e f8 e9 75 d4 ff ff 80
3f 00 f3 90 7e f9 e9 d7 d6 ff
                                                                console shuts up ...

Expected results:


Additional info:

Comment 1 Warren Togami 2005-04-14 07:51:00 UTC
Why did you report this against memtest86+?  memtest86+ has nothing to do with
the kernel.

Also you cannot run an i686 hugemem kernels with a x86_64 64-bit userspace. 
i686 kernels can only run 32bit programs.

Comment 2 Christopher P Johnson 2005-04-14 17:52:37 UTC
Oops, sorry, thanks for moving it to kernel, where it belongs.

Please ignore the comments about the hugemem i686 kernel - this is happening
with x86_64 (the tester recalled the multiple kernels provided on i686, and
thought he should also test with the hugemem kernel provided as part of x86_64
bits).

The problem is simply 32gb memory + rhel 3 update 5 beta + v40z = kernel panic
with memtest.

Comment 3 erik nguyen 2005-04-15 22:05:35 UTC
Created attachment 113259 [details]
lspci -vvvx output

Comment 4 Warren Togami 2005-04-18 12:14:26 UTC
http://www.memtest.org/
Please test it with memtest86+ for at least 24 hours to be 100% sure all 32GB of
RAM is good and there exists no other hardware trouble.  Yes this is unlikely
but it is better to be sure than waste time of engineers, both yours and ours.


Comment 5 Christopher P Johnson 2005-04-19 02:32:28 UTC
The failure is proving elusive/difficult to reproduce - we've had 3 systems
rerunning tests for multiple days without recreating. Recently a processor
errata was announced about stale tlb entries - this may be a possible root
cause.

Comment 6 Ernie Petrides 2005-10-04 00:35:33 UTC
Closing due to expectation that this was not a kernel problem.