Bug 187470

Summary:

kernel-smp-2.6.16-1.2069_FC4 crashes with invalid opcode

Product:

[Fedora] Fedora

Reporter:

Piotr Gackiewicz <gacek>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

CC:

adrian, axel.thimm, bookreviewer, jonstanley, wtogami

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

MassClosed

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-01-20 04:38:54 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
photo of console after crash on boot	none

Description Piotr Gackiewicz 2006-03-31 07:32:24 UTC

Description of problem:
After booting kernel-smp-2.6.16-1.2069_FC4 kernel on Tyan-based dual Opteron,
server crashed after several minutes. kernel-smp-2.6.15-1.1833_FC4 works well.

Version-Release number of selected component (if applicable):
kernel-smp-2.6.16-1.2069_FC4

How reproducible:
Two different servers crashed after several minutes, they are stable on
kernel-smp-2.6.15-1.1833_FC4. Thus, it is reproducible :-)


Additional info:

I have got only netconsole crash log of both servers:

Mar 31 08:05:24 mail  Kernel BUG at include/linux/list.h:168 
Mar 31 08:05:24 mail  invalid opcode: 0000 [1] SMP  
Mar 31 08:05:24 mail  last sysfs file:
/devices/system/cpu/cpu1/cpufreq/scaling_setspeed 
Mar 31 08:05:24 mail  CPU 0  
Mar 31 08:05:24 mail  Modules linked in: ipv6 netconsole eeprom lm85 w83781d
hwmon_vid hwmon i2c_isa i2c_amd756 pcmcia yenta_socket rsrc_nonstatic
pcmcia_core ip_conntrack_ftp xt_comment ipt_LOG xt_limit xt_tcpudp xt_state
ip_conntrack nfnetlink iptable_filter ip_tables x_tables video button battery ac
i2o_config ohci_hcd i2c_amd8111 i2c_core hw_random eepro100 e100 mii tg3 ext3
jbd dm_mod i2o_block i2o_core sata_sil libata sd_mod scsi_mod 
Mar 31 08:05:24 mail  Pid: 16078, comm: procmail_logger Not tainted
2.6.16-1.2069_FC4smp #1 
Mar 31 08:05:24 mail  RIP: 0010:[<ffffffff8017e27a>]
<ffffffff8017e27a>{free_block+134} 
Mar 31 08:05:24 mail  RSP: 0000:ffffffff8048efe8  EFLAGS: 00010012 
Mar 31 08:05:24 mail  RAX: ffff810000000000 RBX: ffff81007fb43078 RCX:
000000000000001e 
Mar 31 08:05:24 mail  RDX: ffff81007ee65000 RSI: ffff81007fb43000 RDI:
ffff81007ffeeb40 
Mar 31 08:05:24 mail  RBP: ffff81000202d440 R08: ffff81007ffc3428 R09:
000000000000000c 
Mar 31 08:05:24 mail  R10: 0000000000000001 R11: 0000ffff0000ffff R12:
ffff81004135c8e8 
Mar 31 08:05:24 mail  R13: 0000000000000000 R14: 000000000000000c R15:
0000000000000001 
Mar 31 08:05:24 mail  FS:  00002aaaab864e20(0000) GS:ffffffff8051e000(0000)
knlGS:0000000000000000 
Mar 31 08:05:24 mail  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
Mar 31 08:05:24 mail  CR2: 0000000000752018 CR3: 000000006a6de000 CR4:
00000000000006e0 
Mar 31 08:05:24 mail  Process procmail_logger (pid: 16078, threadinfo
ffff81006ce60000, task ffff8100282380c0) 
Mar 31 08:05:24 mail  Stack: 000000000000003c ffff81007ffeeb90 ffff81004135c8c0
0000000000000001  
Mar 31 08:05:24 mail         ffff81000202d440 0000000000000001 00000000000000a0
ffffffff8017e37a  
Mar 31 08:05:24 mail         ffff81000202d440 ffff81004135c8c0  
Mar 31 08:05:24 mail  Call Trace: <IRQ> <ffffffff8017e37a>{__drain_alien_cache+68} 
Mar 31 08:05:24 mail         <ffffffff8017e07f>{kmem_cache_free+197}
<ffffffff80147a2e>{__rcu_process_callbacks+303} 
Mar 31 08:05:24 mail         <ffffffff80147ade>{rcu_process_callbacks+35}
<ffffffff8013ac8d>{tasklet_action+102} 
Mar 31 08:05:24 mail         <ffffffff8013addd>{__do_softirq+96}
<ffffffff8010be6a>{call_softirq+30} 
Mar 31 08:05:24 mail         <ffffffff8010cdbc>{do_softirq+44}
<ffffffff8010b7c4>{apic_timer_interrupt+132} <EOI> 
Mar 31 08:05:25 mail   
Mar 31 08:05:25 mail  Code: 0f 0b 68 98 9e 37 80 c2 a8 00 48 89 50 08 48 89 02
48 c7 06  
Mar 31 08:05:25 mail   <3>Debug: sleeping function called from invalid context
at include/linux/rwsem.h:43 
Mar 31 08:05:25 mail  in_atomic():1, irqs_disabled():1 
Mar 31 08:05:25 mail   
Mar 31 08:05:25 mail  Call Trace: <IRQ> <ffffffff80136719>{profile_task_exit+21} 
Mar 31 08:05:25 mail         <ffffffff80137e1d>{do_exit+34}
<ffffffff803520f8>{_spin_lock_irqsave+9} 
Mar 31 08:05:25 mail         <ffffffff8010c51f>{kernel_math_error+0}
<ffffffff8010cb05>{do_invalid_op+163} 
Mar 31 08:05:25 mail         <ffffffff8017e27a>{free_block+134}
<ffffffff80326e91>{tcp_v4_rcv+1665} 
Mar 31 08:05:25 mail         <ffffffff8030b728>{ip_local_deliver_finish+0}
<ffffffff8010b961>{error_exit+0} 
Mar 31 08:05:25 mail         <ffffffff8017e27a>{free_block+134}
<ffffffff8017e252>{free_block+94} 


And another one:
Mar 31 08:00:33 www kernel: Kernel BUG at include/linux/list.h:167 
Mar 31 08:00:33 www kernel: invalid opcode: 0000 [1] SMP  
Mar 31 08:00:33 www kernel: last sysfs file: /class/vc/vcsa6/dev 
Mar 31 08:00:33 www kernel: CPU 1  
Mar 31 08:00:33 www kernel: Modules linked in: nfsd exportfs lockd nfs_acl ipv6
w83627hf eeprom lm85 w83781d hwmon_vid hwmon i2c_isa i2c_amd756 sunrpc pcmcia
yenta_socket rsrc_nonstatic pcmcia_core ip_conntrack_ftp iptable_mangle
xt_comment ipt_owner ipt_LOG xt_limit xt_tcpudp 
xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink ip_tables
x_tables video button battery ac i2o_config ohci_hcd i2c_amd8111 i2c_core
hw_random eepro100 e100 mii tg3 ext3 jbd dm_mod i2o_block i2o_core sata_sil
libata sd_mod scsi_mod 
Mar 31 08:00:33 www kernel: Pid: 354, comm: kswapd1 Not tainted
2.6.16-1.2069_FC4smp #1 
Mar 31 08:00:33 www kernel: RIP: 0010:[<ffffffff8017e267>]
<ffffffff8017e267>{free_block+115} 
Mar 31 08:00:33 www kernel: RSP: 0018:ffff81003fc79a58  EFLAGS: 00010006 
Mar 31 08:00:33 www kernel: RAX: ffff8100407b9c40 RBX: ffff81005eab1da0 RCX:
000000000000001e 
Mar 31 08:00:33 www kernel: RDX: ffff81007565e000 RSI: ffff81005eab10c0 RDI:
ffff810002022740 
Mar 31 08:00:33 www kernel: RBP: ffff81000203b4c0 R08: 0000000000000000 R09:
000000000000000c 
Mar 31 08:00:33 www kernel: R10: 0000000000000000 R11: ffff81003fc79bc8 R12:
ffff81007ffd42e0 
Mar 31 08:00:33 www kernel: R13: 0000000000000007 R14: 000000000000000c R15:
0000000000000000 
Mar 31 08:00:33 www kernel: FS:  00002aaaac045700(0000)
GS:ffff81007ff9c540(0000) knlGS:00000000f7eaf6c0 
Mar 31 08:00:33 www kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b

Comment 1 Danny Yee 2006-04-03 01:53:57 UTC

I have a mail server which isn't coping with the 2.6.16 kernel upgrade.  First
it printed warnings about SELinux filesystem labels being missing and quota
files being missing, and hung rebuilding those.  We rebooted with SELinux
disabled and it ran ok for a bit over an hour before crashing with a kernel
panic.  So we've reverted to 2.6.15-1.1833 for the moment.

I can't get much debug info as it's a production server.

Comment 2 Danny Yee 2006-04-06 07:26:12 UTC

I've had a second FC4 server barf on 2.6.16-1.2069smp - it starts printing
errors immediately after the Nash message, along with a "resume in 119
seconds... resume in 118 seconds" countdown.  I've never seen that before.

The new kernel runs fine on my desktop - the only unusual thing I can think of
about our servers is that they both have everything on i2o raid arrays.

Comment 3 Adrian Reber 2006-04-20 15:36:08 UTC

I had the same error (invalid opcode: 0000 [1] SMP) today upgrading to
kernel-smp-2.6.16-1.2096_FC5. The machine is a quad Xeon with 6GB RAM and also
an i2o raid arrays.

Last entry of /proc/cpuinfo:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Xeon(TM) CPU 1.40GHz
stepping        : 1
cpu MHz         : 1400.222
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 2800.17

Comment 4 Dan Carpenter 2006-04-22 00:15:30 UTC

Danny Yee, your bug is probable something else.  Please take a digital photo of
the screen and enter a new bugzilla.

Adrian, the key line from the first bug report was:  "Kernel BUG at
include/linux/list.h:168"  Does your crash have that line?

list.h line 168 is list_del().  Probably there is some kind of race condition.

Comment 5 Adrian Reber 2006-04-22 06:43:06 UTC

I have the "Kernel BUG at include/linux/list.h:168" line in my oops.

I should probably also mention that the error happens during bringup of eth0
(e1000). At least that is the last message of the initscripts before the crash.

Comment 6 Dan Carpenter 2006-04-22 08:46:58 UTC

The way to verify for sure whether it's the e1000 is to boot with the kernel
parameter "single".  In single user mode rename your e1000.ko file to e1000.ko.ORIG.

cd /lib/modules/kernel-smp-2.6.16-1.2096_FC5/kernel/drivers/net/e1000/
mv e1000.ko e1000.ko.ORIG

Start normal startup by typing `init 3`.  If it still crashes you know that it's
not an e1000 bug.

Comment 7 Adrian Reber 2006-04-22 16:34:01 UTC

This is a production server and I am not very often at the location the server
is hosted. So I cannot do this test very soon if at all.

Comment 8 Danny Yee 2006-05-11 03:32:51 UTC

Created attachment 128870 [details]
photo of console after crash on boot

Comment 9 Danny Yee 2006-05-11 03:46:05 UTC

I still can't get any 2.6.16 kernel to work.  I've just tried it with
2.6.16-1.2107_FC4, 2.6.16-1.2107_FC4smp, 2.6.16-1.2108_FC4, and
2.6.16-1.2108_FC4smp.

Comment 10 Dan Carpenter 2006-05-11 04:56:25 UTC

Danny Lee, I still think you're bug is not related to the list_del() race condition.

Please create a new bugzilla entry.  Post the dmesg from 2.6.15 and your lspci
and what motherboard you are using and attach that photo again.

Comment 11 Piotr Gackiewicz 2006-08-04 07:39:16 UTC

kernel-smp-2.6.17-1.2142_FC4 seems to run OK.
I had a look on ChangeLog, but did not spot any changes related to this bug.
Moreover, I did upgrade BIOS on my Tyan motherboard.

Can someone comment that?

Was that faulty Tyan BIOS or SMP race as someone suspected?

Comment 12 Dan Carpenter 2006-08-04 07:58:01 UTC

It was a bug in the i2o code.  It got fixed.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=189570

Oddly enough Danny Yee did have the same bug as you did.  As soon as I saw his
dmesg I realized it was the i2o issue...
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=191357

Comment 13 Dave Jones 2006-09-17 02:03:43 UTC

[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.

Comment 14 Dave Jones 2006-10-16 18:08:03 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 15 Jon Stanley 2008-01-20 04:38:54 UTC

(this is a mass-close to kernel bugs in NEEDINFO state)

As indicated previously there has been no update on the progress of this bug
therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue
still occurs for you and I will try to assist in its resolution. Thank you for
taking the time to report the initial bug.

If you believe that this bug was closed in error, please feel free to reopen
this bug.