Bug 455691

Summary: [5.2] kernel-debug panic
Product: Red Hat Enterprise Linux 5 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Jiri Pirko <jpirko>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2.zCC: agospoda, bruce.w.allan, jesse.brandeburg, john.ronciak, rkhan, vmayatsk
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-12 14:01:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
full boot log none

Description Qian Cai 2008-07-17 05:56:44 UTC
Description of problem:
When booting kernel-debug on a x86_64 machine (ibm-palmetto.rhts.bos.redhat.com
), it failed at staring udev. Booting other variant of Kernels looks fine.

+ /sbin/start_udev
Starting udev: Unable to handle kernel paging request at ffffffff8405e2d9 RIP: 
 [<ffffffff8405e2d9>]
PGD 203067 PUD 205063 PMD 0 
Oops: 0010 [1] SMP 
last sysfs file: /class/tty/ttyS1/dev
CPU 1 
Modules linked in: dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata sd_mod
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 543, comm: udevd Tainted: G   M  2.6.18-92.1.6.el5debug #1
RIP: 0010:[<ffffffff8405e2d9>]  [<ffffffff8405e2d9>]
RSP: 0000:ffff810037c13ef8  EFLAGS: 00010282
RAX: ffff81003fc7aa70 RBX: 0000000000000008 RCX: 0000000000000000
RDX: 000000000000000c RSI: ffff810001000038 RDI: ffff81003fc7aa70
RBP: ffff810002600d20 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000108 R12: ffff81003eb5d6d0
R13: 0000000000000008 R14: 00002b512bee5b90 R15: 00002b512bee5a80
FS:  00002b5123670710(0000) GS:ffff810037cdd300(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff8405e2d9 CR3: 000000003d952000 CR4: 00000000000006e0
Process udevd (pid: 543, threadinfo ffff81003d21e000, task ffff81003e984100)
Stack:  ffffffff800a0a96 ffff810037c00000 ffff810002600e20 0000000000000000
 000000000000000a 0000000000000001 ffffffff800a0b4e ffffffff80425fd0
 ffffffff800971d2 ffffffff80425f90 0000000000000001 ffffffff80425fd0
Call Trace:
 <IRQ>  [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI> 

Code:  Bad RIP value.
RIP  [<ffffffff8405e2d9>]
 RSP <ffff810037c13ef8>
CR2: ffffffff8405e2d9
 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
RHEL5-Server-U2 GA tree
kernel-debug-2.6.18-92.1.6.el5

How reproducible:
Sometimes, it happens for 5 or 6 boots; sometimes, it works all the time.

Comment 1 Qian Cai 2008-07-17 05:56:44 UTC
Created attachment 312009 [details]
full boot log

Comment 2 Qian Cai 2008-07-17 11:00:41 UTC
Looks like I found a way to reproduce it. Boot a 32bit kernel-debug on a x86_64
machine first, it will hang at,

...
Write protecting the kernel read-only data: 412k
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c
...

Then, remove 32bit kernel-debug and install a 64bit one, and then it finally
failed in udev,

+ /sbin/udevsettle
Unable to handle kernel paging request at ffffffff8405e2d9 RIP: 
 [<ffffffff8405e2d9>]
PGD 203067 PUD 205063 PMD 0 
Oops: 0010 [1] SMP 
last sysfs file: /class/net/lo/type
CPU 1 
Modules linked in: floppy i6300esb pcspkr sg shpchp ide_cd cdrom i2c_i801
i2c_core e1000 serio_raw dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata
sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1106, comm: pam_console_app Tainted: G   M  2.6.18-92.1.6.el5debug #1
RIP: 0010:[<ffffffff8405e2d9>]  [<ffffffff8405e2d9>]
RSP: 0000:ffff810037c13ef8  EFLAGS: 00010282
RAX: ffff81003db41eb0 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 000000000000000e RSI: ffff810001000038 RDI: ffff81003db41eb0
RBP: ffff810002600d20 R08: ffff810037c8a6b0 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000158 R12: ffff81003cf9b6a0
R13: 0000000000000003 R14: 0000000000000014 R15: ffff81003cb7e540
FS:  00002b223bc063e0(0000) GS:ffff810037cdd300(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff8405e2d9 CR3: 000000003c88d000 CR4: 00000000000006e0
Process pam_console_app (pid: 1106, threadinfo ffff81003cc20000, task
ffff81003cb7e540)
Stack:  ffffffff800a0a96 0000000000000282 ffff810002600e20 0000000000000000
 0000000000000008 0000000000000001 ffffffff800a0b4e ffffffff80425fd0
 ffffffff800971d2 ffffffff80425f90 0000000000000001 ffffffff80425fd0
Call Trace:
 <IRQ>  [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8006a77f>] do_page_fault+0x424/0x835
 [<ffffffff8000c021>] find_vma+0x23/0x5d
 [<ffffffff8006a7b1>] do_page_fault+0x456/0x835
 [<ffffffff80028908>] do_filp_open+0x1c/0x38
 [<ffffffff80013274>] poison_obj+0x26/0x2f
 [<ffffffff80032ff3>] cache_free_debugcheck+0x208/0x217
 [<ffffffff80060e9d>] error_exit+0x0/0x96


Code:  Bad RIP value.
RIP  [<ffffffff8405e2d9>]
 RSP <ffff810037c13ef8>
CR2: ffffffff8405e2d9
 <0>Kernel panic - not syncing: Fatal exception
 <1>Unable to handle kernel NULL pointer dereference at 0000000000000005 RIP: 
 [<0000000000000005>]
PGD 3fd90067 PUD 3d798067 PMD 0 
Oops: 0010 [2] SMP 
last sysfs file: /class/net/lo/type
CPU 1 
Modules linked in: floppy i6300esb pcspkr sg shpchp ide_cd cdrom i2c_i801
i2c_core e1000 serio_raw dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata
sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1106, comm: pam_console_app Tainted: G   M  2.6.18-92.1.6.el5debug #1
RIP: 0010:[<0000000000000005>]  [<0000000000000005>]
RSP: 0000:ffff810037c13b90  EFLAGS: 00010002
RAX: ffff81003cb7e540 RBX: 00000000000000bf RCX: ffff810037c13e48
RDX: 0000000000000452 RSI: 00000000000000bf RDI: 00000000000000bf
RBP: ffff810037c13bc0 R08: 0000000000000001 R09: 000000000000003f
R10: ffff81003d5d4188 R11: ffffffff8016968b R12: 0000000000000005
R13: 00000000ffffff03 R14: 0000000000000010 R15: ffff81003cb7e540
FS:  00002b223bc063e0(0000) GS:ffff810037cdd300(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000005 CR3: 000000003c88d000 CR4: 00000000000006e0
Process pam_console_app (pid: 1106, threadinfo ffff81003cc20000, task
ffff81003cb7e540)
Stack:  ffffffff8002426f ffffffff80299c38 ffffffff8405e2d9 ffff810037c13e48
 ffffffff80060ccb ffff810037c13bc0 ffff81003d5d4188 ffffffff8016968b
 ffff81003d5d4188 000000000000003f 0000000000000001 00000000000000ff
Call Trace:
 <IRQ>  [<ffffffff8002426f>] smp_call_function_interrupt+0x66/0x84
 [<ffffffff80060ccb>] call_function_interrupt+0x6b/0x70
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff80079800>] smp_send_stop+0x62/0x64
 [<ffffffff800797e9>] smp_send_stop+0x4b/0x64
 [<ffffffff80093b07>] panic+0x94/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff8006bc38>] bad_gs+0x26/0x2089
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20
 [<ffffffff80150451>] __next_cpu+0x19/0x28
 [<ffffffff8008de33>] find_busiest_group+0x20d/0x621
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8006a77f>] do_page_fault+0x424/0x835
 [<ffffffff8000c021>] find_vma+0x23/0x5d
 [<ffffffff8006a7b1>] do_page_fault+0x456/0x835
 [<ffffffff80028908>] do_filp_open+0x1c/0x38
 [<ffffffff80013274>] poison_obj+0x26/0x2f
 [<ffffffff80032ff3>] cache_free_debugcheck+0x208/0x217
 [<ffffffff80060e9d>] error_exit+0x0/0x96


Code:  Bad RIP value.
RIP  [<0000000000000005>]
 RSP <ffff810037c13b90>
CR2: 0000000000000005
 <0>Kernel panic - not syncing: Fatal exception
 BUG: warning at kernel/panic.c:137/panic() (Tainted: G   M )

Call Trace:
 <IRQ>  [<ffffffff80093c52>] panic+0x1df/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff801aa752>] vt_console_print+0x213/0x229
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff8002426f>] smp_call_function_interrupt+0x66/0x84
 [<ffffffff80060ccb>] call_function_interrupt+0x6b/0x70
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff80079800>] smp_send_stop+0x62/0x64
 [<ffffffff800797e9>] smp_send_stop+0x4b/0x64
 [<ffffffff80093b07>] panic+0x94/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff8006bc38>] bad_gs+0x26/0x2089
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20
 [<ffffffff80150451>] __next_cpu+0x19/0x28
 [<ffffffff8008de33>] find_busiest_group+0x20d/0x621
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8006a77f>] do_page_fault+0x424/0x835
 [<ffffffff8000c021>] find_vma+0x23/0x5d
 [<ffffffff8006a7b1>] do_page_fault+0x456/0x835
 [<ffffffff80028908>] do_filp_open+0x1c/0x38
 [<ffffffff80013274>] poison_obj+0x26/0x2f
 [<ffffffff80032ff3>] cache_free_debugcheck+0x208/0x217
 [<ffffffff80060e9d>] error_exit+0x0/0x96

BUG: warning at drivers/input/serio/i8042.c:846/i8042_panic_blink() (Tainted: G
  M )

Call Trace:
 <IRQ>  [<ffffffff802015c8>] i8042_panic_blink+0x112/0x2a5
 [<ffffffff80093bf8>] panic+0x185/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff801aa752>] vt_console_print+0x213/0x229
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff8002426f>] smp_call_function_interrupt+0x66/0x84
 [<ffffffff80060ccb>] call_function_interrupt+0x6b/0x70
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff80079800>] smp_send_stop+0x62/0x64
 [<ffffffff800797e9>] smp_send_stop+0x4b/0x64
 [<ffffffff80093b07>] panic+0x94/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff8006bc38>] bad_gs+0x26/0x2089
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20
 [<ffffffff80150451>] __next_cpu+0x19/0x28
 [<ffffffff8008de33>] find_busiest_group+0x20d/0x621
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8006a77f>] do_page_fault+0x424/0x835
 [<ffffffff8000c021>] find_vma+0x23/0x5d
 [<ffffffff8006a7b1>] do_page_fault+0x456/0x835
 [<ffffffff80028908>] do_filp_open+0x1c/0x38
 [<ffffffff80013274>] poison_obj+0x26/0x2f
 [<ffffffff80032ff3>] cache_free_debugcheck+0x208/0x217
 [<ffffffff80060e9d>] error_exit+0x0/0x96

BUG: warning at drivers/input/serio/i8042.c:849/i8042_panic_blink() (Tainted: G
  M )

Call Trace:
 <IRQ>  [<ffffffff802016b1>] i8042_panic_blink+0x1fb/0x2a5
 [<ffffffff80093bf8>] panic+0x185/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff801aa752>] vt_console_print+0x213/0x229
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff8002426f>] smp_call_function_interrupt+0x66/0x84
 [<ffffffff80060ccb>] call_function_interrupt+0x6b/0x70
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff80079800>] smp_send_stop+0x62/0x64
 [<ffffffff800797e9>] smp_send_stop+0x4b/0x64
 [<ffffffff80093b07>] panic+0x94/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff8006bc38>] bad_gs+0x26/0x2089
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20
 [<ffffffff80150451>] __next_cpu+0x19/0x28
 [<ffffffff8008de33>] find_busiest_group+0x20d/0x621
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8006a77f>] do_page_fault+0x424/0x835
 [<ffffffff8000c021>] find_vma+0x23/0x5d
 [<ffffffff8006a7b1>] do_page_fault+0x456/0x835
 [<ffffffff80028908>] do_filp_open+0x1c/0x38
 [<ffffffff80013274>] poison_obj+0x26/0x2f
 [<ffffffff80032ff3>] cache_free_debugcheck+0x208/0x217
 [<ffffffff80060e9d>] error_exit+0x0/0x96

BUG: warning at drivers/input/serio/i8042.c:851/i8042_panic_blink() (Tainted: G
  M )

Call Trace:
 <IRQ>  [<ffffffff8020172e>] i8042_panic_blink+0x278/0x2a5
 [<ffffffff80093bf8>] panic+0x185/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff801aa752>] vt_console_print+0x213/0x229
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff8002426f>] smp_call_function_interrupt+0x66/0x84
 [<ffffffff80060ccb>] call_function_interrupt+0x6b/0x70
 [<ffffffff8016968b>] vgacon_cursor+0x0/0x1a5
 [<ffffffff80079800>] smp_send_stop+0x62/0x64
 [<ffffffff800797e9>] smp_send_stop+0x4b/0x64
 [<ffffffff80093b07>] panic+0x94/0x1f0
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff80017812>] release_console_sem+0x43/0x228
 [<ffffffff8006bc38>] bad_gs+0x26/0x2089
 [<ffffffff80068eed>] oops_end+0x64/0x66
 [<ffffffff8006aac6>] do_page_fault+0x76b/0x835
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20
 [<ffffffff80150451>] __next_cpu+0x19/0x28
 [<ffffffff8008de33>] find_busiest_group+0x20d/0x621
 [<ffffffff80060e9d>] error_exit+0x0/0x96
 [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8006a77f>] do_page_fault+0x424/0x835
 [<ffffffff8000c021>] find_vma+0x23/0x5d
 [<ffffffff8006a7b1>] do_page_fault+0x456/0x835
 [<ffffffff80028908>] do_filp_open+0x1c/0x38
 [<ffffffff80013274>] poison_obj+0x26/0x2f
 [<ffffffff80032ff3>] cache_free_debugcheck+0x208/0x217
 [<ffffffff80060e9d>] error_exit+0x0/0x96


Comment 3 Jiri Pirko 2008-07-22 08:08:05 UTC
Machine is currently reserved and will be free hopefully next week.

Comment 4 Jiri Pirko 2008-07-24 07:09:08 UTC
I tried to reproduce this on ibm-palmetto.rhts.bos.redhat.com host. I used cron
to reboot machine every 5 minutes. I tried kernel-debug 2.6.18-92.1.10.el5 (for
about 3 hours) and 2.6.18-92.1.6.el5 (for about 16 hours) and I had no luck. I
suggest that this approach may lead to mentioned oops. Or am I wrong?

Comment 5 Qian Cai 2008-08-04 07:21:16 UTC
I got a vmcore. Although backtrace slightly different from original reported, it still failed in /etc/rc.sysinit -> start_udev -> /sbin/udevsettle

http://porkchop.devel.redhat.com/qa/qa/qcai/vmcores/vmcore-455691

+ ^M/sbin/udevsettle
<66<1>Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: ^M
 [<0000000000000028>]^M
PGD 36a8b067 PUD 3618f067 PMD 0 ^M
Oops: 0010 [1] SMP ^M 
last sysfs file: /class/usb_device/usbdev3.1/dev^M
CPU 0 ^M
Modules linked in: sg i2c_i801 shpchp i2c_core pcspkr i6300esb e1000 ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M
Pid: 2078, comm: modprobe Tainted: G   M  2.6.18-92.el5debug #1^M
RIP: 0010:[<0000000000000028>]  [<0000000000000028>]^M
RSP: 0018:ffff81003458fa60  EFLAGS: 00010003^M
RAX: 0000000000004208 RBX: ffff81003458fad8 RCX: 0000000000020000^M
RDX: 0000000000004208 RSI: 0000000000020000 RDI: 0000000000000036^M
RBP: ffffffff8029bd71 R08: 00000000ffffffff R09: 0000000000000020^M
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246^M
R13: ffffffff8051e5a2 R14: 0000000000000029 R15: 0000000000000016^M
FS:  00002b44650ab240(0000) GS:ffffffff8040c000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
CR2: 0000000000000028 CR3: 000000003d780000 CR4: 00000000000006e0^M
Process modprobe (pid: 2078, threadinfo ffff81003458e000, task ffff810034fde1c0)^M
Stack:  0000004488180cf2 ffffffff800cff31 ffff810037ffb178 0000000000000286^M
 ffffffff8000aa62 0000000000000000 000000440000e000 ffff810000353834^M
 000200d000000000 0000000000000015 0000000000000015 00000000000000ef^M
Call Trace:^M
 [<ffffffff800cff31>] zone_statistics+0x3e/0x6d^M
 [<ffffffff8000aa62>] get_page_from_freelist+0x232/0x450^M
 [<ffffffff800943b5>] printk+0x52/0xbd^M
 [<ffffffff8007bc14>] assign_irq_vector+0xb9/0xbf^M
 [<ffffffff8007c0da>] gsi_irq_sharing+0xf6/0xfd^M
 [<ffffffff8007d1e2>] physflat_cpu_mask_to_apicid+0x0/0x27^M
 [<ffffffff8007c940>] io_apic_set_pci_routing+0xff/0x261^M
 [<ffffffff8007cd8c>] mp_register_gsi+0x154/0x15e^M
 [<ffffffff80078ef2>] acpi_register_gsi+0x36/0x51^M
 [<ffffffff8019169d>] acpi_pci_irq_enable+0x13b/0x1e5^M
 [<ffffffff8015a83e>] pci_enable_device_bars+0x27/0x34^M
 [<ffffffff8015a869>] pci_enable_device+0x1e/0x3e^M
 [<ffffffff88171f3b>] :e1000:e1000_probe+0x1d/0xca3^M
 [<ffffffff8015c953>] pci_device_probe+0x100/0x180^M
 [<ffffffff801bd08e>] driver_probe_device+0x52/0xaa^M
 [<ffffffff801bd1ef>] __driver_attach+0x86/0xd4^M
 [<ffffffff801bd169>] __driver_attach+0x0/0xd4^M
 [<ffffffff801bca93>] bus_for_each_dev+0x43/0x6e^M
 [<ffffffff801bc6d6>] bus_add_driver+0x81/0x133^M
 [<ffffffff8015cb43>] __pci_register_driver+0x5b/0x7c^M
 [<ffffffff88194045>] :e1000:e1000_init_module+0x45/0x7b^M
 [<ffffffff800ade86>] sys_init_module+0xb1/0x1ee^M
 [<ffffffff80060116>] system_call+0x7e/0x83^M
^M
^M
Code:  Bad RIP value.^M
RIP  [<0000000000000028>]^M
 RSP <ffff81003458fa60>^M

Comment 6 Qian Cai 2008-08-04 08:11:31 UTC
Looks like I am hitting another similar crash here,

http://porkchop.devel.redhat.com/qa/qa/qcai/vmcores/vmcore-455691-2

+ /sbin/udevsettle^M
<66<1>Unable to handle kernel NULL pointer dereference at 0000000000000080 RIP: ^M
 [<0000000000000080>]^M
PGD 3c95b067 PUD 3c947067 PMD 0 ^M
Oops: 0010 [1] SMP ^M
last sysfs file: /class/misc/mcelog/dev^M
CPU 0 ^M
Modules linked in: i2c_i801 ide_cd i6300esb shpchp e1000 cdrom i2c_core serio_raw dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M
Pid: 666, comm: modprobe Tainted: G   M  2.6.18-92.el5debug #1^M
RIP: 0010:[<0000000000000080>]  [<0000000000000080>]^M
RSP: 0000:ffff81003c953c30  EFLAGS: 00010007^M
RAX: 000000000000416c RBX: ffff81003c953ca8 RCX: 0000000000020000^M
RDX: 000000000000416c RSI: 0000000000020000 RDI: 0000000000000036^M
RBP: ffffffff8817ac80 R08: 00000000ffffffff R09: ffffffff80094105^M
R10: ffffffff80094105 R11: ffff81003ed58a58 R12: 0000000000000246^M
R13: ffffffff8051e5a2 R14: 0000000000000038 R15: ffff810037cfef60^M
FS:  00002aad1cffe240(0000) GS:ffffffff8040c000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
CR2: 0000000000000080 CR3: 000000003c971000 CR4: 00000000000006e0^M
Process modprobe (pid: 666, threadinfo ffff81003c952000, task ffff81003c950240)^M
Stack:  ffffffff804e3f20 0000000000000001 0000000000000002 ffff81003c953fd8^M
 ffff81003c952000 ffff81003c950240 ffffffff804e3f98 000000000000000d^M
 ffffffff800606a8 ffff81003ed58a58 ffff81003ed58a58 ffff81003ed58000^M
Call Trace:^M
 [<ffffffff800606a8>] restore_args+0x0/0x30^M
 [<ffffffff800943b5>] printk+0x52/0xbd^M
 [<ffffffff8000cfc8>] __delay+0x8/0x10^M
 [<ffffffff88170819>] :e1000:e1000_release_eeprom+0x5f/0x80^M
 [<ffffffff88170d67>] :e1000:e1000_read_eeprom+0x2af/0x2c7^M
 [<ffffffff8816d9bb>] :e1000:e1000_probe+0xa9d/0xca3^M
 [<ffffffff8015c953>] pci_device_probe+0x100/0x180^M
 [<ffffffff801bd08e>] driver_probe_device+0x52/0xaa^M
 [<ffffffff801bd1ef>] __driver_attach+0x86/0xd4^M
 [<ffffffff801bd169>] __driver_attach+0x0/0xd4^M
 [<ffffffff801bca93>] bus_for_each_dev+0x43/0x6e^M
 [<ffffffff801bc6d6>] bus_add_driver+0x81/0x133^M
 [<ffffffff8015cb43>] __pci_register_driver+0x5b/0x7c^M
 [<ffffffff88153045>] :e1000:e1000_init_module+0x45/0x7b^M
 [<ffffffff800ade86>] sys_init_module+0xb1/0x1ee^M
 [<ffffffff80060116>] system_call+0x7e/0x83^M

Code:  Bad RIP value.^M
RIP  [<0000000000000080>]^M
 RSP <ffff81003c953c30>^M

Comment 7 Qian Cai 2008-08-04 08:21:14 UTC
Looks like random crash to me. Let me know if you need vmcore of it.

Scanning logical volumes^M
  Reading all physical volumes.  This may take a while...^M
  Found volume group "VolGroup00" using metadata type lvm2^M
Activating logical volumes^M
Unable to handle kernel paging request at ffffffffbb00863e RIP: ^M
 [<ffffffffbb00863e>]^M
PGD 1003067 PUD 1005063 PMD 0 ^M
Oops: 0010 [1] SMP ^M
last sysfs file: /block/ram0/dev^M
CPU 0 ^M
Modules linked in: dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd ata_piix libata sd_mod scsi_mod^M
Pid: 926, comm: lvm Tainted: G   M  2.6.18-92.1.6.el5debug #1^M
RIP: 0010:[<ffffffffbb00863e>]  [<ffffffffbb00863e>]^M
RSP: 0018:ffff81000816fd20  EFLAGS: 00010246^M
RAX: 0000000000000000 RBX: ffff8100081977a8 RCX: ffff81000816fddc^M
RDX: ffffffff800e88e2 RSI: ffff81000816fddc RDI: ffff8100081977a8^M
RBP: ffff8100081977a8 R08: 0000000000000002 R09: 0000000000000000^M
R10: ffffffff8005c61e R11: ffff8100081226c0 R12: ffff81000816fddc^M
R13: ffffffff800e88e2 R14: ffff81000217aa98 R15: ffff8100021c32f8^M
FS:  00000000088e4880(0063) GS:ffffffff8040e000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
CR2: ffffffffbb00863e CR3: 000000000871d000 CR4: 00000000000006e0^M
Process lvm (pid: 926, threadinfo ffff81000816e000, task ffff8100081226c0)^M
Stack:  ffffffff45131ceb ffffffff8005c61e ffff81000816fddc 0000000000000001^M
 ffffffff800e88e2 ffff81000217aa98 ffff8100021c32f8 ffffffff800e88e2^M
 ffffffff8005c62f ffff8100081a16c0 0000000000000000 ffff81000217aa98^M
Call Trace:^M
 [<ffffffff8005c61e>] ifind+0x23/0xab^M
 [<ffffffff800e88e2>] bdev_test+0x0/0xf^M
 [<ffffffff800e88e2>] bdev_test+0x0/0xf^M
 [<ffffffff8005c62f>] ifind+0x34/0xab^M
 [<ffffffff800eed2a>] iget5_locked+0x6c/0x19b^M
 [<ffffffff800e784a>] bdev_set+0x0/0xb^M
 [<ffffffff800e80fd>] bdget+0x40/0x112^M
 [<ffffffff800e82de>] do_open+0x10f/0x313^M
 [<ffffffff800e873c>] blkdev_open+0x0/0x4f^M
 [<ffffffff800e873c>] blkdev_open+0x0/0x4f^M
 [<ffffffff800e875f>] blkdev_open+0x23/0x4f^M
 [<ffffffff8001f484>] __dentry_open+0xd9/0x1e2^M
 [<ffffffff80028916>] do_filp_open+0x2a/0x38^M
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20^M
 [<ffffffff80016702>] get_unused_fd+0xf9/0x107^M
 [<ffffffff8001a65f>] do_sys_open+0x44/0xbe^M
 [<ffffffff80060116>] system_call+0x7e/0x83^M
^M
^M
Code:  Bad RIP value.^M
RIP  [<ffffffffbb00863e>]^M
 RSP <ffff81000816fd20>^M
CR2: ffffffffbb00863e^M
 <0>Kernel panic - not syncing: Fatal exception^M

Comment 8 Qian Cai 2008-08-04 08:24:33 UTC
Please disregard Comment #7. It is irrelevant here.

Comment 9 Qian Cai 2008-08-04 08:42:12 UTC
It is possible to reproduce the original crash stack traces on 92.1.6.el5 Kernel, but seems there is no way to save a vmcore, because Kdump Kernel crashed as well,

Unable to handle kernel paging request at ffffffff8405e2d9 RIP: 
 [<ffffffff8405e2d9>]
PGD 203067 PUD 205063 PMD 0 
Oops: 0010 [1] SMP 
last sysfs file: /devices/pnp0/00:09/id
CPU 1 
Modules linked in: dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 622, comm: modprobe Tainted: G   M  2.6.18-92.1.6.el5debug #1
RIP: 0010:[<ffffffff8405e2d9>]  [<ffffffff8405e2d9>]
RSP: 0000:ffff810037c13ef8  EFLAGS: 00010286
RAX: ffff81003e54ed50 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 000000000000000c RSI: ffff810009000038 RDI: ffff81003e54ed50
RBP: ffff81000a600d20 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000158 R12: ffff81003d36beb0
R13: 0000000000000003 R14: 0000000000000007 R15: 00007fff1f53b8b7
FS:  00002af78b57edc0(0000) GS:ffff81003ff60300(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff8405e2d9 CR3: 000000003cd3f000 CR4: 00000000000006e0
Process modprobe (pid: 622, threadinfo ffff81003fc18000, task ffff81003cd76580)
Stack:  ffffffff800a0a96 ffff810037cbc000 ffff81000a600e20 0000000000000000
 000000000000000a 0000000000000001 ffffffff800a0b4e ffffffff80425fd0
 ffffffff800971d2 ffffffff80425f90 0000000000000001 ffffffff80425fd0
Call Trace:
 <IRQ>  [<ffffffff800a0a96>] __rcu_process_callbacks+0x140/0x1d5
 [<ffffffff800a0b4e>] rcu_process_callbacks+0x23/0x43
 [<ffffffff800971d2>] tasklet_action+0x6c/0xc3
 [<ffffffff80012a0e>] __do_softirq+0x67/0xf3
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff8007066a>] do_softirq+0x35/0xa0
 [<ffffffff80060d3b>] apic_timer_interrupt+0x6b/0x70
 <EOI> 

Code:  Bad RIP value.
RIP  [<ffffffff8405e2d9>]
 RSP <ffff810037c13ef8>
Linux version 2.6.18-92.1.6.el5debug (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Fri Jun 20 03:07:33 EDT 2008
Command line: ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices memmap=exactmap memmap=640K@0K memmap=9476K@16384K memmap=120940K@26500K elfcorehdr=147440K memmap=24K#1048000K memmap=40K#1048024K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003ff70000 (usable)
 BIOS-e820: 000000003ff70000 - 000000003ff76000 (ACPI data)
 BIOS-e820: 000000003ff76000 - 000000003ff80000 (ACPI NVS)
 BIOS-e820: 000000003ff80000 - 0000000040000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 0000000001941000 (usable)
 user: 00000000019e1000 - 0000000008ffc000 (usable)
 user: 000000003ff70000 - 000000003ff76000 (ACPI data)
 user: 000000003ff76000 - 000000003ff80000 (ACPI data)
DMI present.
No NUMA configuration found
Faking a node at 0000000000000000-0000000008ffc000
Bootmem setup node 0 0000000000000000-0000000008ffc000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
ACPI: PM-Timer IO Port: 0x1008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:4 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:4 APIC version 20
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfec10000] gsi_base[24])
IOAPIC[1]: apic_id 3, version 32, address 0xfec10000, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 00000000000a0000 - 0000000001000000
Nosave address range: 0000000001941000 - 00000000019e1000
Allocating PCI resources starting at 40000000 (gap: 3ff80000:c0080000)
SMP: Allowing 2 CPUs, 0 hotplug CPUs
Built 1 zonelists.  Total pages: 31893
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices memmap=exactmap memmap=640K@0K memmap=9476K@16384K memmap=120940K@26500K elfcorehdr=147440K memmap=24K#1048000K memmap=40K#1048024K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
Initializing CPU#0
PID hash table entries: 512 (order: 9, 4096 bytes)
Console: colour VGA+ 80x25
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:    8
... MAX_LOCK_DEPTH:          30
... MAX_LOCKDEP_KEYS:        2048
... CLASSHASH_SIZE:           1024
... MAX_LOCKDEP_ENTRIES:     8192
... MAX_LOCKDEP_CHAINS:      8192
... CHAINHASH_SIZE:          4096
 memory used by lock dependency info: 1392 kB
 per task-struct memory footprint: 2160 bytes
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
Memory: 110836k/147440k available (2525k kernel code, 20220k reserved, 1623k data, 196k init)
Calibrating delay using timer specific routine.. 2904.44 BogoMIPS (lpj=1452220)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
using mwait in idle threads.
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
lockdep: not fixing up alternatives.
ACPI: Core revision 20060707
Using local APIC timer interrupts.
result 6251239
Detected 6.251 MHz APIC timer.
Brought up 1 CPUs
testing NMI watchdog ... OK.
Disabling vsyscall due to use of PM timer
time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
time.c: Detected 1400.285 MHz processor.
checking if image is initramfs... it is
Freeing initrd memory: 3002k freed
NET: Registered protocol family 16
No dock devices found.
ACPI: bus type pci registered
PCI: Using configuration type 1
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI quirk: region 1000-107f claimed by ICH4 ACPI/GPIO/TCO
PCI quirk: region 1180-11bf claimed by ICH4 GPIO
PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.2
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Link [LNKA] (IRQs *3 10 11 14 15), disabled.
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 10 11 14 15) *5, disabled.
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 10 11 14 15) *5, disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 10 *11 14 15), disabled.
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 10 *11 14 15), disabled.
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 10 11 14 15) *7, disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 12 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
PCI-GART: No AMD northbridge found.
PCI: Bridge: 0000:00:03.0
  IO window: 2000-2fff
  MEM window: d0100000-d01fffff
  PREFETCH window: 50000000-500fffff
PCI: Bridge: 0000:00:1c.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:1e.0
  IO window: disabled.
  MEM window: d0200000-d02fffff
  PREFETCH window: e0000000-efffffff
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 1, 8192 bytes)
TCP established hash table entries: 4096 (order: 6, 262144 bytes)
TCP bind hash table entries: 2048 (order: 5, 131072 bytes)
TCP: Hash tables configured (established 4096 bind 2048)
TCP reno registered
Simple Boot Flag at 0x38 set to 0x1
audit: initializing netlink socket (disabled)
audit(1217838876.420:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
- Added public key 5C9DD0FD4CC9C295
- User ID: Red Hat, Inc. (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: CPU1 (power states: C1[C1] C3[C3])
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: CPU0 (power states: C1[C1] C3[C3])
ACPI: Processor [CPU1] (supports 8 throttling states)
Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
agpgart: Detected an Intel i875 Chipset.
agpgart: AGP aperture is 32M @ 0xd2000000
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:09: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 16384K size 4096 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
hda: CD-224E, ATAPI CD/DVD-ROM drive
ide1: I/O resource 0x170-0x177 not free.
ide1: ports already in use, skipping probe
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:KBC0,PNP0f13:MSE0] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
ACPI: (supports S0 S1 S4 S5)
Freeing unused kernel memory: 196k freed
Write protecting the kernel read-only data: 496k
Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
Loading scsi_mod.ko module
SCSI subsystem initialized
Loading sd_mod.ko module
Loading libata.ko module
Loading ata_piix.ko module
ata_piix 0000:00:1f.2: MAP [ IDE IDE P0 P1 ]
GSI 16 sharing vector 0xA9 and IRQ 16
ACPI: PCI Interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 169
ata_piix 0000:00:1f.2: 0x1F0 IDE port busy
scsi0 : ata_piix
scsi1 : ata_piix
ata1: DUMMY
ata2: SATA max UDMA/133 bmdma 0x1478 irq 15
ata2.00: ATA-6: WDC WD400JD-75HKA1, 14.03G14, max UDMA/133
ata2.00: 78125000 sectors, multi 16: LBA 
ata2.00: configured for UDMA/133
  Vendor: ATA       Model: WDC WD400JD-75HK  Rev: 14.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 78125000 512-byte hdwr sectors (40000 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
SCSI device sda: 78125000 512-byte hdwr sectors (40000 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
 sda: sda1 sda2
sd 1:0:0:0: Attached scsi disk sda
Loading jbd.ko module
Loading ext3.ko module
Loading dm-mod.ko module
device-mapper: uevent: version 1.0.3
device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel
Loading dm-mirror.ko module
Loading dm-zero.ko module
Loading dm-snapshot.ko module
Creating Block Devices
Creating block device ram0
Creating block device ram1
Creating block device ram10
Creating block device ram11
Creating block device ram12
Creating block device ram13
Creating block device ram14
Creating block device ram15
Creating block device ram2
Creating block device ram3
Creating block device ram4
Creating block device ram5
Creating block device ram6
Creating block device ram7
Creating block device ram8
Creating block device ram9
Creating block device sda
Making device-mapper control node
Scanning logical volumes
  Reading all physical volumes.  This may take a while...
Unable to handle kernel paging request at ffffffffbaf0863f RIP: 
 [<ffffffffbaf0863f>]
PGD 1003067 PUD 1005063 PMD 0 
Oops: 0010 [1] SMP 
last sysfs file: /block/ram0/dev
CPU 0 
Modules linked in: dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd ata_piix libata sd_mod scsi_mod
Pid: 924, comm: lvm Tainted: G   M  2.6.18-92.1.6.el5debug #1
RIP: 0010:[<ffffffffbaf0863f>]  [<ffffffffbaf0863f>]
RSP: 0018:ffff810008123d80  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff810008198828 RCX: ffff810008123e0c
RDX: ffffffff800e88e2 RSI: ffff810008123e0c RDI: ffff810008198828
RBP: ffff810008198828 R08: 0000000000000002 R09: 0000000000000000
R10: ffffffff800eed56 R11: ffff810008dd6580 R12: ffff810008123e0c
R13: ffffffff800e88e2 R14: ffff81000217cf00 R15: ffff8100021c32f8
FS:  00000000163cc880(0063) GS:ffffffff8040e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffbaf0863f CR3: 0000000008728000 CR4: 00000000000006e0
Process lvm (pid: 924, threadinfo ffff810008122000, task ffff810008dd6580)
Stack:  ffffffff45231cea ffffffff800eed56 ffff8100081b3328 0000000000000000
 ffff81000217cf00 ffff8100021c32f8 ffff810008123e0c ffffffff800e88e2
 ffffffff800eed67 ffffffff800e784a 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff800eed56>] iget5_locked+0x98/0x19b
 [<ffffffff800e88e2>] bdev_test+0x0/0xf
 [<ffffffff800eed67>] iget5_locked+0xa9/0x19b
 [<ffffffff800e784a>] bdev_set+0x0/0xb
 [<ffffffff800e873c>] blkdev_open+0x0/0x4f
 [<ffffffff800e80fd>] bdget+0x40/0x112
 [<ffffffff800e864b>] bd_acquire+0x42/0xaa
 [<ffffffff800e874f>] blkdev_open+0x13/0x4f
 [<ffffffff8001f484>] __dentry_open+0xd9/0x1e2
 [<ffffffff80028916>] do_filp_open+0x2a/0x38
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20
 [<ffffffff80016702>] get_unused_fd+0xf9/0x107
 [<ffffffff8001a65f>] do_sys_open+0x44/0xbe
 [<ffffffff80060116>] system_call+0x7e/0x83


Code:  Bad RIP value.
RIP  [<ffffffffbaf0863f>]
 RSP <ffff810008123d80>
CR2: ffffffffbaf0863f
 <0>Kernel panic - not syncing: Fatal exception

Comment 10 Vitaly Mayatskikh 2008-08-04 21:22:39 UTC
So, both times kernel was crashed with 'modprobe e1000'. Upstream commit 3c34ac36ac1084e571ef9b6fb1d6a5b10ccc1fd0 looks very promising for me with regard to comment 2.

Comment 11 Jiri Pirko 2008-08-06 10:33:29 UTC
With qcai we figured out that the best (most likely) way to reproduce this issue is using kernel-debug-2.6.18-92.1.6.el5.x86_64 on palmetto machine. To trigger the oops you must cold-reset the machine via RHTS web. Following boot has about 30% chance to oops.

Comment 12 Jiri Pirko 2008-08-06 12:57:03 UTC
So I used 2.6.18-92.1.6.el5 and patched it with upstream patch from comment #10. no change, still gotting oopses.

Comment 13 John Ronciak 2008-08-07 15:47:40 UTC
So can you bring up the kernel with out loading the e1000 driver and then load it manually after the system is up?  Once the system is up without the driver loaded can the driver be loaded and unloaded multiple times?  It seems very odd that only this one configuration is broke.

Comment 14 Andy Gospodarek 2008-08-07 18:14:05 UTC
So this only happens with the debug kernel and seems most common with kernel-debug-2.6.18-92.1.6.el5.x86_64?

It would be interesting to blacklist the e1000 module and see if that changes any behavior, but it seems unlikely.

You could also consider building the kernel with some of these options disabled if we think it is likely to be related to any options added for debug kernels:

CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_LOCK_STAT=y

Comment 15 Jiri Pirko 2008-08-08 08:38:00 UTC
(In reply to comment #14)
> So this only happens with the debug kernel and seems most common with
> kernel-debug-2.6.18-92.1.6.el5.x86_64?
Exactly.

> 
> It would be interesting to blacklist the e1000 module and see if that changes
> any behavior, but it seems unlikely.
Same issues still appear when e1000 is blacklisted. So it ruled out e1000. Thanks guys and feel free to unCC yourself.

> 
> You could also consider building the kernel with some of these options disabled
> if we think it is likely to be related to any options added for debug kernels:
> 
> CONFIG_DEBUG_SLAB=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_PROVE_LOCKING=y
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_SPINLOCK_SLEEP=y
> CONFIG_LOCK_STAT=y
I'm going to try this.

Comment 16 Jiri Pirko 2008-08-12 09:43:35 UTC
When compiling kernel on palmetto, this (and similar) appeared:

  CC      init/initramfs.o
init/initramfs.c: In function ‘unpack_to_rootfs’:
init/initramfs.c:441: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://bugzilla.redhat.com/bugzilla> for instructions.
The bug is not reproducible, so it is likely a hardware or OS problem.
make[1]: *** [init/initramfs.o] Error 1
make: *** [init] Error 2

Now I'm unable to compile kernel on this machine at all. Using original 2.6.18-92.el5.

Comment 17 Jiri Pirko 2008-08-12 09:58:29 UTC
another try to compile kernel caused this:

Red Hat Enterprise Linux Server release 5.2 (Tikanga)
Kernel 2.6.18-92.el5 on an x86_64

ibm-palmetto.rhts.bos.redhat.com login: 
HARDWARE ERROR
CPU 1: Machine Check Exception:                6 Bank 3: fe00000000010100
RIP 00:<00000000004b423b> 
TSC 7bd3ebdba2 ADDR 3a8 MISC 1d400c0802a80 
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
CPU 0: Machine Check Exception:                6 Bank 3: fe00000000010100
RIP 00:<ffffffff80056c9a> 
TSC 7bd3ebde58 ADDR 3a8 MISC 1d400c0802a80 
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check

Comment 18 Jiri Pirko 2008-08-12 10:29:23 UTC
I also noticed that palmetto is according to bios info table running 3.4GHz P4 but sometimes this info says 1.4GHz or even 6GHz...