Bug 168605 - Fedora Core 4 Kernel Oops whenever any system activity on TYAN S2892 Thunder K8SE Motherboard
Summary: Fedora Core 4 Kernel Oops whenever any system activity on TYAN S2892 Thunder ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 4
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Dave Jones
QA Contact: Brian Brock
URL: http://www.kangin.org/FC4-TYAN_K8SE_S...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-09-17 21:05 UTC by Vladimir Kangin
Modified: 2015-01-04 22:22 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-05-05 01:33:23 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Vladimir Kangin 2005-09-17 21:05:16 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.8) Gecko/20050524 Fedora/1.0.4-4 Firefox/1.0.4

Description of problem:
After FC4 installation complete any high activity of the system cause Kernel Oops-ing with following dump: http://www.kangin.org/FC4-TYAN_K8SE_S2892-Oopsing.txt
That happens in any attempt to wget any iso file, or make an update of the system with 'yum update'.
Update of the Kernel didn't solve the problem, and the second equal system have same behaviour.

Version-Release number of selected component (if applicable):
kernel-smp-2.6.11-1.1369_FC4
kernel-smp-2.6.12-1.1447_FC4
kernel-2.6.12-1.1447_FC4
kernel-2.6.11-1.1369_FC4

How reproducible:
Always

Steps to Reproduce:
1. Install FC4 on TYAN S2892 Thunder K8SE with 2x AMD Opteron 252. Note that the rest of part are irrelevant - been tested with different set of memory, disks and etc. and always result the same.
2. Actual Oops-ing:
Option 1. Then run "yum update" and wait till "Apply Transuctions" and then it will crash.
Option 2. Just run wget http://www.gtlib.cc.gatech.edu/pub/fedora.redhat/linux/core/4/x86_64/iso/FC4-x86_64-DVD.iso
 and wait till 7-10% of download and then it will crash.

Actual Results:  System completely hungs with one of the two errors:
kernel: Unable to handle kernel paging request at 0000000000004b20 RIP:
(Cause hard crush in a moment!)
or
kernel: Unable to handle kernel paging request at 0000000000004b30 RIP:
(Cause crush but system still pingable for a period of time!)

Expected Results:  Do not crash :)

Additional info:

http://bugzilla.kernel.org/show_bug.cgi?id=5272

Comment 1 Vladimir Kangin 2005-09-17 22:15:26 UTC
Please note that problem do not appeir with kernels kernel-2.6.12-1.1447_FC4
and kernel-2.6.11-1.1369_FC4, but kernels kernel-smp-2.6.11-1.1369_FC4 and
kernel-smp-2.6.12-1.1447_FC4 with SMP support Oops-ing.

Vladimir Kangin

Comment 2 Warren Togami 2005-09-17 23:17:21 UTC
http://people.redhat.com/wtogami/temp/1398/
Can you do the same tests on 1398 and report back your findings?

http://people.redhat.com/davej/kernels/Fedora/FC4/
Please also try 1455+ from here.

Comment 3 Vladimir Kangin 2005-09-18 12:48:40 UTC
Dear Warren,

It seems working well with 1398. Would you like to get feedback with regards to
1455+ too?

Could you give a lights about faced problem please.

Thanking in advance,
Vladimir Kangin 

Comment 4 Vladimir Kangin 2005-09-18 13:07:51 UTC
Dear Warren,

I were to fast to comrfirm success :(

# wget
http://www.gtlib.cc.gatech.edu/pub/fedora.redhat/linux/core/4/x86_64/iso/FC4-x86_64-DVD.iso
--07:42:29-- 
http://www.gtlib.cc.gatech.edu/pub/fedora.redhat/linux/core/4/x86_64/iso/FC4-x86_64-DVD.iso
           => `FC4-x86_64-DVD.iso'
Resolving www.gtlib.cc.gatech.edu... 130.207.108.135, 130.207.108.136,
130.207.108.134
Connecting to www.gtlib.cc.gatech.edu[130.207.108.135]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,932,850,688 [text/plain]

11% [===========>                                                              
                                    ] 337,846,248  546.46K/s  ETA 1:17:06Killed

cat /var/log/messages

Sep 18 07:52:33 s1-ams kernel: Unable to handle kernel paging request at
0000000000004b30 RIP:
Sep 18 07:52:33 s1-ams kernel: <ffffffff80169c22>{free_pages+210}
Sep 18 07:52:33 s1-ams kernel: PGD 139231067 PUD 1396bb067 PMD 0
Sep 18 07:52:33 s1-ams kernel: Oops: 0000 [1] SMP
Sep 18 07:52:33 s1-ams kernel: CPU 1
Sep 18 07:52:33 s1-ams kernel: Modules linked in: md5 ipv6 parport_pc lp parport
autofs4 rfcomm l2cap bluetooth sunrpc pcmcia yenta_socket rsrc_nonstatic
Sep 18 07:52:33 s1-ams kernel: Pid: 2944, comm: wget Not tainted
2.6.12-1.1398_FC4smp
Sep 18 07:52:33 s1-ams kernel: RIP: 0010:[<ffffffff80169c22>]
<ffffffff80169c22>{free_pages+210}
Sep 18 07:52:33 s1-ams kernel: RSP: 0018:ffff81013f569db0  EFLAGS: 00010216
Sep 18 07:52:33 s1-ams kernel: RAX: 0000000000000003 RBX: 0000000000000000 RCX:
0000000000000003
Sep 18 07:52:33 s1-ams kernel: RDX: 0000000000120c00 RSI: 0000000000000000 RDI:
ffff810120c00000
Sep 18 07:52:33 s1-ams kernel: RBP: ffff810120c00010 R08: 0000000000000018 R09:
0000000000000000
Sep 18 07:52:33 s1-ams kernel: R10: 0000000000000000 R11: ffffffff80319c70 R12:
ffff810120c00010
Sep 18 07:52:33 s1-ams kernel: R13: ffff810120c00000 R14: 0000000000000041 R15:
0000000000000104
Sep 18 07:52:33 s1-ams kernel: FS:  00002aaaaaab6c40(0000)
GS:ffffffff8050d800(0000) knlGS:0000000000000000
Sep 18 07:52:33 s1-ams kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 18 07:52:33 s1-ams kernel: CR2: 0000000000004b30 CR3: 0000000137d4f000 CR4:
00000000000006e0
Sep 18 07:52:33 s1-ams kernel: Process wget (pid: 2944, threadinfo
ffff81013f568000, task ffff81013ef2e840)
Sep 18 07:52:33 s1-ams kernel: Stack: ffffffff8019d9d5 0000000000000008
ffff8101366fc7c0 0000000000000010
Sep 18 07:52:33 s1-ams kernel:        0000000000000004 000000000000003c
ffffffff8019df7b 412262b752f1a9fc
Sep 18 07:52:33 s1-ams kernel:        ffff81013f569f40 ffff81013f569f08
Sep 18 07:52:33 s1-ams kernel: Call Trace:<ffffffff8019d9d5>{poll_freewait+85}
<ffffffff8019df7b>{do_select+1179}
Sep 18 07:52:33 s1-ams kernel:        <ffffffff8019d9f0>{__pollwait+0}
<ffffffff8019e5cd>{sys_select+637}
Sep 18 07:52:33 s1-ams kernel:        <ffffffff8010ebf6>{tracesys+209}
Sep 18 07:52:33 s1-ams kernel:
Sep 18 07:52:33 s1-ams kernel: Code: 49 8b 89 30 4b 00 00 48 39 ca 72 49 48 b8
ff ff ff 7f ff ff
Sep 18 07:52:33 s1-ams kernel: RIP <ffffffff80169c22>{free_pages+210} RSP
<ffff81013f569db0>
Sep 18 07:52:33 s1-ams kernel: CR2: 0000000000004b30
Sep 18 07:52:34 s1-ams kernel:  <3>Debug: sleeping function called from invalid
context at include/linux/rwsem.h:43
Sep 18 07:52:34 s1-ams kernel: in_atomic():0, irqs_disabled():1
Sep 18 07:52:34 s1-ams kernel:
Sep 18 07:52:34 s1-ams kernel: Call
Trace:<ffffffff8013abd5>{profile_task_exit+21} <ffffffff8013bff2>{do_exit+34}
Sep 18 07:52:34 s1-ams kernel:        <ffffffff80265f18>{do_unblank_screen+40}
<ffffffff80124286>{do_page_fault+1926}
Sep 18 07:52:34 s1-ams kernel:        <ffffffff8035ac32>{thread_return+0}
<ffffffff8010f5b5>{error_exit+0}
Sep 18 07:52:34 s1-ams kernel:        <ffffffff80319c70>{tcp_poll+0}
<ffffffff80169c22>{free_pages+210}
Sep 18 07:52:34 s1-ams kernel:        <ffffffff8019d9d5>{poll_freewait+85}
<ffffffff8019df7b>{do_select+1179}
Sep 18 07:52:34 s1-ams kernel:        <ffffffff8019d9f0>{__pollwait+0}
<ffffffff8019e5cd>{sys_select+637}
Sep 18 07:52:34 s1-ams kernel:        <ffffffff8010ebf6>{tracesys+209}

Will try 1455+ in a  minute.

Regards,
Vladimir Kangin





Comment 5 Vladimir Kangin 2005-09-18 13:17:03 UTC
Dear Warren,

Sep 18 08:14:28 s1-ams kernel: Unable to handle kernel paging request at
0000000000004b30 RIP:
Sep 18 08:14:29 s1-ams kernel: <ffffffff8016239a>{free_pages+210}
Sep 18 08:14:29 s1-ams kernel: PGD 13fa2e067 PUD 1382b5067 PMD 0
Sep 18 08:14:29 s1-ams kernel: Oops: 0000 [1] SMP

The same problem but system last bit longer :) 13% out of DVD iso file. 

Please advice,
Vladimir Kangin

Comment 6 Dave Jones 2005-09-24 01:29:48 UTC
Please retry with the latest kernel errata released yesterday (2.6.12-1.1456_FC4)

There's also a more experimental newer test kernel at
http://people.redhat.com/davej/kernels/Fedora/FC4/

Comment 7 Vladimir Kangin 2005-09-24 12:05:38 UTC
Dear Dave,

The same problem with kernel-smp 2.6.12-1.1456_FC4 ;(

Vladimir Kangin

Comment 8 Dave Jones 2005-09-30 06:45:08 UTC
Mass update to all FC4 bugs:

An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.


Comment 9 Dan Carpenter 2005-10-12 05:15:10 UTC
I saw this bug on a couple new dual opteron systems today.

One system was running a stock FC4 kernel and the other was running a
2.6.11-1.35 FC3 kernel and also it happens on the FC2 kernel as well.

The unique thing was that both systems had 2 sticks of 2G RAM on the first CPU.

The systems do not crash in FC3 and FC4 when I use "numa=off" but they still
crash in FC2 when I use that.

The systems do not crash when one stick of RAM is on the first CPU and one is on
the second CPU.

The systems do not crash when I run 2,4 or 6 sticks of 1G RAM.

I did try to install the newest errata kernel but it failed for an unrelated
reason and I didn't have time to debug it.

I searched through bugzilla and bug 165285, and bug 168907 also have a similar
error message with the where it failed paging at 0000000000004b30.

I'll try figure out the issues with the errata kernel tomorrow but I just wanted
to post this before I forgot.  Could the other people seeing this bug post what
RAM they were using?



Comment 10 Vladimir Kangin 2005-10-12 06:24:09 UTC
Hi Dan,

It is a 2 sticks of 2G RAM on the first CPU. The memory sticks from Kingston are
following: KVR400D4R3A/2G @GB PC3200 REG CL3 ECC 183 - Pin DIMM

I will try to use "numa=off" on one of the system for test.

Comment 11 Dan Carpenter 2005-10-12 08:44:59 UTC
Rock on.

This bug reminds me of bug 160135.  In bug 160135 the error address was
00000000000018f0.

The comment in #13 says "the crash here is happening in 'pfn_to_page' which is
called by virt_to_page. It seems that NODE_DATA(nid) is NULL and thus
node_start_pfn is NULL."  A null nid would probably explains all the leading
zeros in this bug as well.

The comment #66 from 160135 is someone with our bug here.  ;)

Anyway...  It might already be fixed in the newest errata kernel if I can make
that work.



Comment 12 Vladimir Kangin 2005-10-12 16:09:39 UTC
Hi Dan,

with numa=off the kernel is stable! What is the impact of this parametr to
kernel performance?

Vladimir

Comment 13 Dave Jones 2005-11-10 19:49:20 UTC
2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.


Comment 14 Dave Jones 2006-02-03 05:44:11 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 15 John Thacker 2006-05-05 01:33:23 UTC
Closing per previous comment.


Note You need to log in before you can comment on or make changes to this bug.