Bug 155711

Summary:

Kernel (several versions) hangs and oops on AMD64

Product:

[Fedora] Fedora

Reporter:

Need Real Name <clido01>

Component:

kernel

Assignee:

Dave Jones <davej>

Status:

CLOSED CANTFIX

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

CC:

pfrields, redhat

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-10-03 00:22:38 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Dmesg for 04/22 oops and diagnostic referenced 1st above	none
Original OOPS and RC2 message after new kernel tests	none
General Fault after attempted reboots and before poweroff new kernel	none

Description Need Real Name 2005-04-22 14:28:33 UTC

kernel-2.6.11-1-14_FC3
All versions released of the AMD64 kernels have been causing sporadic oops and
hanging entirely.  This includes kernels:
title Fedora Core (2.6.11-1.14_FC3)
title Fedora Core (2.6.10-1.770_FC3)
title Fedora Core (2.6.10-1.766_FC3)
title Fedora Core (2.6.10-1.760_FC3)
title Fedora Core (2.6.9-1.667)
When the system hands, it has done so on boot up a few times... once while
invoking CUPS, other times I'm not sure where the hang occured.  With the latest
kernel 2.6.11-1.14_FC3, a few times during boot up the sendmail process has
cored with a segv.  After rebooting a few times, it will boot without error and
will generate an occasional oops and kernel diagnostic.  It will almost always
continue to run after the oops/diagnostic, but will normally hang completely in
some random amount of time thereafter.  I believe it has also hung without ever
receiving the oops/diagnostic, but I can't be sure of that now.  I've had it
happen while attempting an up2date run (fun to correct), and sometimes when no
real activity has been occurring on the system.  Sometimes it will hang within a
short period of time (whether in use or on screen save) and sometimes when
actually trying to use the system.  There does not seem to be any
timing/activity cause that I can find consistently.

Reproducibility:  Hangs - unpredictable
                  oops/kernel diagnostics - most times after reboot...

1.Nothing special to reproduce it.  Following this is today's console and dmesg
file contents is attached.  I've tried to intentionally cause it by leaving it
on for several days and sometimes it works fine and sometimes it will hang very
shortly after the screen awakens (and sometimes not)

Message from syslogd@argonaut at Fri Apr 22 07:06:14 2005 ...
argonaut kernel: Oops: 0000 [1]

Message from syslogd@argonaut at Fri Apr 22 07:06:14 2005 ...
argonaut kernel: CR2: 0000000000002000

Dmesg file attached

----------------------------------------------------------
Following is an earlier message from the kernel:

Unable to handle kernel NULL pointer dereference at 0000000000000078 RIP:
<ffffffff801a54d5>{clear_inode+241}
PML4 31dd4067 PGD 31a83067 PMD 31380067 PTE 0
Oops: 0000 [1]
CPU 0
Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc pcmcia
yenta_socket pcmcia_core md5 ipv6 vfat fat dm_mod video button battery ac
ohci1394 ieee1394 ohci_hcd ehci_hcd snd_emu10k1 snd_rawmidi snd_seq_device
snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc
snd_util_mem snd_hwdep snd soundcore forcedeth floppy ext3 jbd sata_sil libata
sd_mod scsi_mod
Pid: 196, comm: kswapd0 Not tainted 2.6.10-1.770_FC3
RIP: 0010:[<ffffffff801a54d5>] <ffffffff801a54d5>{clear_inode+241}
RSP: 0018:0000010037c1fd98  EFLAGS: 00010206
RAX: 0000000000000000 RBX: 000001002fe69238 RCX: 000001002fee3c00
RDX: 0000000000000002 RSI: 000000000000004e RDI: 000001002fe69538
RBP: 000001003c8ac5d8 R08: 0000000000000000 R09: ffffffff804949e8
R10: 7fffffffffffffff R11: 0000000000000000 R12: 0000000000000079
R13: 0000000000000020 R14: 00000000000000d0 R15: 0000000000000838
FS:  0000002a9558a3e0(0000) GS:ffffffff804ff980(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000078 CR3: 0000000000101000 CR4: 00000000000006e0
Process kswapd0 (pid: 196, threadinfo 0000010037c1e000, task 000001003fda8030)
Stack: 000001002fe69238 ffffffff801a81eb 000001003c8ac630 ffffffff801a17de
       00000000000004b7 000001003ffe9440 00000000000000ca ffffffff801a293d     
      
       0000000000000000 ffffffff8016ccb7
Call Trace:<ffffffff801a81eb>{generic_drop_inode+665}
<ffffffff801a17de>{prune_dcache+950}
       <ffffffff801a293d>{shrink_dcache_memory+21}
<ffffffff8016ccb7>{shrink_slab+188}
       <ffffffff8016ea0d>{balance_pgdat+510} <ffffffff8016ec1f>{kswapd+224}
       <ffffffff801511dc>{autoremove_wake_function+0}
<ffffffff801511dc>{autoremove_wake_function+0}
       <ffffffff8012fa5c>{schedule_tail+11} <ffffffff8010f303>{child_rip+8}
       <ffffffff8016eb3f>{kswapd+0} <ffffffff8010f2fb>{child_rip+0}


Code: 48 8b 40 78 48 85 c0 74 05 48 89 df ff d0 48 83 bb d8 02 00
RIP <ffffffff801a54d5>{clear_inode+241} RSP <0000010037c1fd98>
CR2: 0000000000000078

Comment 1 Need Real Name 2005-04-22 14:28:34 UTC

Created attachment 113558 [details]
Dmesg for 04/22 oops and diagnostic referenced 1st above

Comment 2 Matt Olson 2005-05-03 18:46:38 UTC

I have seen hangs as well on 2.6.9-1.681_FC3 (caps lock starts blinking).  
Recently, I upgraded to 2.6.11-1.14_FC3 yesterday and received my first one of 
these messages.  I run the binary nvidia driver as well 
(NVIDIA-Linux-x86_64-1.0-7174-pkg2.run).  No lock ups on 2.6.11, yet.  Compaq 
R3240US laptop.   
 
Let me know if I can help with testing. 
 
------ 
console message: 
Message from syslogd@seti at Tue May  3 10:41:46 2005 ... 
seti kernel: Oops: 0010 [1] 
 
Message from syslogd@seti at Tue May  3 10:41:47 2005 ... 
seti kernel: CR2: 0000000000000000 
 
/var/log/messages: 
 
May  3 10:41:46 seti kernel: Unable to handle kernel NULL pointer dereference 
at 0000000000000000 RIP: 
May  3 10:41:46 seti kernel: [<0000000000000000>] 
May  3 10:41:46 seti kernel: PGD 4130d067 PUD 41301067 PMD 0 
May  3 10:41:46 seti kernel: Oops: 0010 [1] 
May  3 10:41:46 seti kernel: CPU 0 
May  3 10:41:46 seti kernel: Modules linked in: nvidia(U) md5 ipv6 parport_pc 
lp parport autofs4 pcmcia ipt_REJECT ipt_state ip_conntrack iptable_filter 
ip_tables dm_mod video button battery ac ohci1394 ieee1394 yenta_socket 
rsrc_nonstatic pcmcia_core ohci_hcd ehci_hcd i2c_nforce2 i2c_core 
snd_intel8x0m snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm 
snd_timer snd soundcore snd_page_alloc orinoco hermes 8139too mii ext3 jbd 
May  3 10:41:46 seti kernel: Pid: 4786, comm: pam-panel-icon Tainted: P      
2.6.11-1.14_FC3 
May  3 10:41:46 seti kernel: RIP: 0010:[<0000000000000000>] 
[<0000000000000000>] 
May  3 10:41:46 seti kernel: RSP: 0000:ffff8100412d1ef0  EFLAGS: 00010282 
May  3 10:41:46 seti kernel: RAX: ffffffff804aa1a0 RBX: 0000000000000145 RCX: 
00000000c0000100 
May  3 10:41:46 seti kernel: RDX: 0000000000000000 RSI: ffff81004121a440 RDI: 
ffff8100412f90c0 
May  3 10:41:46 seti kernel: RBP: ffff8100412f90c0 R08: ffff8100412d0000 R09: 
00000000001e709b 
May  3 10:41:46 seti kernel: R10: 000000004277b7da R11: 0000000000000000 R12: 
ffff81002c45da4c 
May  3 10:41:46 seti kernel: R13: 0000000000000000 R14: ffff81002c45da40 R15: 
0000000000000003 
May  3 10:41:46 seti kernel: FS:  00002aaaaaad4e80(0000) GS:ffffffff80550700
(0000) knlGS:000000005b50dbb0 
May  3 10:41:46 seti kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
May  3 10:41:46 seti kernel: CR2: 0000000000000000 CR3: 0000000041319000 CR4: 
00000000000006e0 
May  3 10:41:46 seti kernel: Process pam-panel-icon (pid: 4786, threadinfo 
ffff8100412d0000, task ffff8100477397e0) 
May  3 10:41:47 seti kernel: Stack: ffffffff801b2549 ffff81002c45da4c 
7fffffffffffffff ffff81002c45da40 
May  3 10:41:47 seti kernel:        0000000000544b70 00000000412f90c0 
0000000000000000 ffffffff801b16b0 
May  3 10:41:47 seti kernel:        ffff81002a8f3000 00007fff00000000 
May  3 10:41:47 seti kernel: Call Trace:<ffffffff801b2549>{sys_poll+489} 
<ffffffff801b16b0>{__pollwait+0} 
May  3 10:41:47 seti kernel:        <ffffffff801b0ffa>{sys_ioctl+106} 
<ffffffff8010ec0a>{system_call+126} 
May  3 10:41:47 seti kernel: 
May  3 10:41:47 seti kernel: 
May  3 10:41:47 seti kernel: Code:  Bad RIP value. 
May  3 10:41:47 seti kernel: RIP [<0000000000000000>] RSP <ffff8100412d1ef0> 
May  3 10:41:47 seti kernel: CR2: 0000000000000000

Comment 3 Matt Olson 2005-05-26 00:22:22 UTC

Another one while running unixbench (4.1.0) on kernel-2.6.11-1.27_FC3: 
 
May 25 17:06:01 seti kernel: Unable to handle kernel NULL pointer dereference 
at 0000000000000040 RIP: 
May 25 17:06:01 seti kernel: <ffffffff80320983>{sock_poll+19} 
May 25 17:06:01 seti kernel: PGD 422e2067 PUD 422dc067 PMD 0 
May 25 17:06:01 seti kernel: Oops: 0000 [1] 
May 25 17:06:01 seti kernel: CPU 0 
May 25 17:06:01 seti kernel: Modules linked in: nls_utf8 lp autofs4 vmnet(U) 
parport_pc parport vmmon(U) pcmcia ipt_REJECT ipt_state ip_conntrack 
iptable_filter ip_tables dm_mod video button battery ac nvidia(U) md5 ipv6 
cdc_acm ohci1394 ieee1394 yenta_socket rsrc_nonstatic pcmcia_core ohci_hcd 
ehci_hcd i2c_nforce2 i2c_core snd_intel8x0m snd_intel8x0 snd_ac97_codec 
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc 
orinoco hermes 8139too mii ext3 jbd 
May 25 17:06:01 seti kernel: Pid: 5032, comm: X Tainted: P      
2.6.11-1.27_FC3 
May 25 17:06:01 seti kernel: RIP: 0010:[<ffffffff80320983>] 
<ffffffff80320983>{sock_poll+19} 
May 25 17:06:01 seti kernel: RSP: 0000:ffff810042253de0  EFLAGS: 00010246 
May 25 17:06:01 seti kernel: RAX: 0000000000000000 RBX: ffff810042b4f680 RCX: 
0000000000000000 
May 25 17:06:01 seti kernel: RDX: 0000000000000000 RSI: ffff810041a70918 RDI: 
ffff810042b4f680

Comment 4 Sitsofe Wheeler 2005-05-26 07:41:54 UTC

Comment #2:
Ooopses while using the nvidia binary drivers can't be investigated. Can you
reproduce the problem with the binary drivers?

(the orignial oops from Comment #0 was not tainted though)

Comment 5 Sitsofe Wheeler 2005-05-26 07:43:03 UTC

Same goes for if you are using vmware binary modules...

Comment 6 Need Real Name 2005-05-26 16:12:00 UTC

The original post is not using the nvidia binary video drivers if that's what
you mean.  It does have an nvidia based graphics card, but it is running the
xfree86 provided drivers.  It is also an nvidia nforce 3 based motherboard
(Gigabyte brand) if you mean that too..just not sure what your reference is to.

Comment 7 Matt Olson 2005-05-26 16:36:41 UTC

I've pulled the binary-only nvidia and vmware kernel modules and will try to 
reproduce.  I don't get oopses every day, so it may take a few days.   
 
Is there anything else I can to to gather more info if I have another 
occurrence?

Comment 8 Matt Olson 2005-05-26 17:14:49 UTC

Well, that didn't take long . . .  
 
May 26 10:01:58 seti kernel:  <1>Unable to handle kernel NULL pointer 
dereference at 0000000000000040 RIP: 
May 26 10:01:58 seti kernel: <ffffffff80320983>{sock_poll+19} 
May 26 10:01:58 seti kernel: PGD 452b6067 PUD 452b0067 PMD 44e76067 PTE 0 
May 26 10:01:58 seti kernel: Oops: 0000 [2] 
May 26 10:01:58 seti kernel: CPU 0 
May 26 10:01:58 seti kernel: Modules linked in: parport_pc lp parport autofs4 
pcmcia ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod video 
button battery ac md5 ipv6 ohci1394 ieee1394 yenta_socket rsrc_nonstatic 
pcmcia_core ohci_hcd ehci_hcd i2c_nforce2 i2c_core snd_intel8x0m snd_intel8x0 
snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore 
snd_page_alloc orinoco hermes 8139too mii ext3 jbd 
May 26 10:01:58 seti kernel: Pid: 4385, comm: X Not tainted 2.6.11-1.27_FC3 
May 26 10:01:58 seti kernel: RIP: 0010:[<ffffffff80320983>] 
<ffffffff80320983>{sock_poll+19} 
May 26 10:01:58 seti kernel: RSP: 0018:ffff81004515bde0  EFLAGS: 00010246 
May 26 10:01:58 seti kernel: RAX: 0000000000000000 RBX: ffff8100453125c0 RCX: 
0000000000000000 
May 26 10:01:58 seti kernel: RDX: 0000000000000000 RSI: ffff810044b90918 RDI: 
ffff8100453125c0 
May 26 10:01:58 seti kernel: RBP: 0000000000000002 R08: ffff81004515a000 R09: 
0000000000000000 
May 26 10:01:58 seti kernel: R10: 0000000000000118 R11: 0000000000000002 R12: 
0000000000000001 
May 26 10:01:58 seti gconfd (molson-4514): Received signal 15, shutting down 
cleanly 
May 26 10:01:59 seti kernel: R13: 0000000000000001 R14: 0000000000000145 R15: 
0000001fffffff8a 
May 26 10:02:01 seti gconfd (molson-4514): Exiting 
May 26 10:02:03 seti kernel: FS:  00002aaaaaacb3e0(0000) GS:ffffffff80543380
(0000) knlGS:0000000000000000 
May 26 10:02:05 seti crond(pam_unix)[26982]: session opened for user molson by 
(uid=0) 
May 26 10:02:08 seti kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
May 26 10:02:11 seti kernel: CR2: 0000000000000040 CR3: 0000000044d17000 CR4: 
00000000000006e0 
May 26 10:02:15 seti crond(pam_unix)[26982]: session closed for user molson 
May 26 10:02:16 seti kernel: Process X (pid: 4385, threadinfo 
ffff81004515a000, task ffff810044e12030) 
May 26 10:02:19 seti kernel: Stack: ffffffff801b1b1b 0000000000000000 
ffff81004515be88 0000000000000000 
May 26 10:02:20 seti kernel:        0000000000040000 0000001fffffff8a 
0000000000000000 0000000000000000 
May 26 10:02:21 seti kernel:        0000000000000000 ffff8100481e4708 
May 26 10:02:21 seti kernel: Call Trace:<ffffffff801b1b1b>{do_select+1307} 
<ffffffff801b1510>{__pollwait+0} 
May 26 10:02:22 seti kernel:        <ffffffff801b201a>{sys_select+890} 
<ffffffff8010ec0a>{system_call+126} 
May 26 10:02:22 seti kernel: 
May 26 10:02:22 seti kernel: 
May 26 10:02:22 seti kernel: Code: 4c 8b 58 40 41 ff e3 66 66 90 66 66 90 48 
8b 47 10 48 89 f2 
May 26 10:02:22 seti kernel: RIP <ffffffff80320983>{sock_poll+19} RSP 
<ffff81004515bde0> 
May 26 10:02:22 seti kernel: CR2: 0000000000000040

Comment 9 Matt Olson 2005-05-26 21:36:32 UTC

This may be a hardware problem on my end.  The occurrence of this roughly  
correlates to a memory upgrade (512M - 1280M) put into this machine.  I 
updated the kernel from 2.6.9-1.681 at that time as well, so, I'm not sure 
which may be the cause.  I'm going to replace the old memory and obtaining an 
RMA on the new memory.  Until I do that, you may want to hold off on any 
further investigation.  In the mean time I'll try and reproduce on 512M of 
memory.   
 
It's worth asking if this could be somehow related to have > 1024M of memory.  
I doubt it though as you would have seem more reports of problems.   
 
I'll follow up in a week or two after I get the replacement memory installed.

Comment 10 Need Real Name 2005-06-03 17:27:25 UTC

Created attachment 115131 [details]
Original OOPS and RC2 message after new kernel tests

Comment 11 Need Real Name 2005-06-03 17:28:57 UTC

Created attachment 115132 [details]
General Fault after attempted reboots and before poweroff new kernel

Comment 12 Matt Olson 2005-06-16 23:34:18 UTC

I think in my case this was a hardware problem.  I've been re-testing with a 
new memory module for the past week and have been unable to reproduce the 
error.  Sorry for the false alarm.

Comment 13 Need Real Name 2005-06-25 19:38:31 UTC

There has as yet been no fix or temporary workaround for the the original and
subsequent posting.  There does not appear to be any hardware problems on the
original system.  This version of linux will be removed shortly due to the
continued problems and inability to use for any current need.  If there are any
diagnostic info needed before removal, please advise.

Comment 14 Richard Hill 2005-06-27 08:01:56 UTC

I have this (very similar) problem after installing FC4. I have two installations
on different partitions (32 & 64 bit for AMD)
32-bit kernel seems fine (although it has once hung on my machine at work).  
64-bit randomly hangs, or reboots without notification text mode or X. I have
an nVidia card. Booting rescue mode from install disk is the same.
Always gets as far as login, but hang occurs as a result of one or more commands.
No messages in syslog.  When hung, system is completely dead. Hardware reset needed.

Comment 15 Dave Jones 2005-07-15 19:41:13 UTC

An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 16 Dave Jones 2005-10-03 00:22:38 UTC

This bug has been automatically closed as part of a mass update.
It had been in NEEDINFO state since July 2005.
If this bug still exists in current errata kernels, please reopen this bug.

There are a large number of inactive bugs in the database, and this is the only
way to purge them.

Thank you.