Bug 746312 - kernel panic during LTP filesystem test run on ext3
Summary: kernel panic during LTP filesystem test run on ext3
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.2
Hardware: i686
OS: Linux
high
high
Target Milestone: rc
: 6.2
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-14 18:14 UTC by Mike Gahagan
Modified: 2011-11-02 17:07 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-11-02 17:07:59 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Mike Gahagan 2011-10-14 18:14:19 UTC
Description of problem:
BUG: unable to handle kernel paging request at 000b77a6 
IP: [<c06041fb>] __list_add+0xb/0xb0 
*pdpt = 00000000338db001 *pde = 000000014de3c067  
Oops: 0000 [#1] SMP  
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.7/0000:03:00.0/0000:04:00.0/local_cpus 
Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
 
Pid: 28111, comm: awk Tainted: G    B   W  ----------------   2.6.32-207.el6.i686 #1 Intel Corporation SandyBridge Platform/LosLunas CRB 
EIP: 0060:[<c06041fb>] EFLAGS: 00010246 CPU: 4 
EIP is at __list_add+0xb/0xb0 
EAX: f4757cc0 EBX: 000b77a6 ECX: ebace5b0 EDX: 000b77a6 
ESI: f4757cb0 EDI: ebace5a4 EBP: f464e9b4 ESP: dea6de88 
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 
Process awk (pid: 28111, ti=dea6c000 task=d8e58570 task.ti=dea6c000) 
Stack: 
 00000000 00000000 00000000 00000246 00000000 f46ede3c c050be8f df6f5800 
<0> f464e9f0 ebace5a4 c050be8f df597d9c f464e9b4 c050c564 df597d9c f464ea18 
<0> df597e00 c0452b47 e3e87fb0 da3dbe40 df458740 00000001 df5973d8 df5973e4 
Call Trace: 
 [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 
 [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 
 [<c050c564>] ? anon_vma_fork+0x84/0xa0 
 [<c0452b47>] ? dup_mm+0x1c7/0x420 
 [<c045380a>] ? copy_process+0xa1a/0x1010 
 [<c05a219c>] ? security_file_alloc+0xc/0x10 
 [<c0453e7a>] ? do_fork+0x7a/0x3e0 
 [<c05336b9>] ? do_pipe_flags+0xb9/0x120 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c04082c3>] ? sys_clone+0x33/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 
Code: c7 44 24 04 33 00 00 00 c7 04 24 64 b9 98 c0 e8 fc 09 e5 ff 8b 44 24 14 8b 10 eb 89 8d 74 26 00 53 83 ec 24 8b 59 04 39 d3 75 15 <8b> 1a 39 d9 75 51 89 41 04 89 08 89 50 04 89 02 83 c4 24 5b c3  
EIP: [<c06041fb>] __list_add+0xb/0xb0 SS:ESP 0068:dea6de88 
CR2: 00000000000b77a6 


Version-Release number of selected component (if applicable):
Snapshot 2 (-207 kernel)

How reproducible:
first time seen, will attempt to reproduce again.

Steps to Reproduce:
1.Run /kernel/distribution/ltp/generic with TESTARGS set to "RHEL6KT1LITE RHEL6FS RHEL6CGROUP RHELPTRACE" on a system with /mnt/testarea formatted as ext3. Panic appeared to occur during RHEL6FS test phase. I'll try and narrow it down from here.
2.
3.
  
Actual results:
list corruption warnings in the form of:

------------[ cut here ]------------ 
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Tainted: G    B   W  ----------------  ) 
Hardware name: SandyBridge Platform 
list_add corruption. next->prev should be prev (df98a59c), but was (null). (next=df9885fc). 
Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
Pid: 11295, comm: ps Tainted: G    B   W  ----------------   2.6.32-207.el6.i686 #1 
Call Trace: 
 [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 
 [<c053ebd9>] ? d_instantiate+0x29/0x50 
 [<c057dfce>] ? proc_lookup_de+0x7e/0xd0 
 [<c05788b9>] ? proc_root_lookup+0x19/0x50 
 [<c05367a2>] ? do_lookup+0x122/0x180 
 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 
 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 
 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 
 [<c0537841>] ? path_walk+0x51/0xc0 
 [<c05379c9>] ? do_path_lookup+0x59/0x90 
 [<c053871c>] ? do_filp_open+0xdc/0xb00 
 [<c0505231>] ? handle_mm_fault+0x131/0x1d0 
 [<c0527fb8>] ? do_sys_open+0x58/0x130 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c052810c>] ? sys_open+0x2c/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 

the we panic shortly after

Expected results:

run without panic, This test set has completed without any oops or panic's since the nightly trees prior to beta.

Additional info:

Comment 2 Eric Sandeen 2011-10-14 20:46:31 UTC
Ok, so list corruption.

By the time of the oops, it was also tainted with:
  6: 'B' if a page-release function has found a bad page reference or
     some unexpected page flags.
...

 10: 'W' if a warning has previously been issued by the kernel.
     (Though some warnings may set more specific taint flags.)

This seems to be the first error encountered:

------------[ cut here ]------------ 
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Not tainted) 
Hardware name: SandyBridge Platform 
list_add corruption. next->prev should be prev (df98a59c), but was df9885fc. (next=df9885fc). 
Modules linked in: sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
Pid: 28220, comm: ps Not tainted 2.6.32-207.el6.i686 #1 
Call Trace: 
 [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 
 [<c053ebd9>] ? d_instantiate+0x29/0x50 
 [<c057a84f>] ? proc_pident_instantiate+0x5f/0x90 
 [<c057a995>] ? proc_pident_lookup+0x75/0xb0 
 [<c057aa24>] ? proc_tgid_base_lookup+0x14/0x20 
 [<c05367a2>] ? do_lookup+0x122/0x180 
 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 
 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 
 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 
 [<c0537841>] ? path_walk+0x51/0xc0 
 [<c05379c9>] ? do_path_lookup+0x59/0x90 
 [<c053871c>] ? do_filp_open+0xdc/0xb00 
 [<c052eb6e>] ? cp_new_stat64+0xee/0x100 
 [<c0527fb8>] ? do_sys_open+0x58/0x130 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c052810c>] ? sys_open+0x2c/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 
---[ end trace 6a7cb877a54a826f ]--- 

so "ps" encountered list corruption somewhere in the proc filesystem guts ...

Comment 6 Eric Sandeen 2011-10-17 19:07:17 UTC
list_add corruption. next->prev should be prev (df98a59c), but was df9885fc.
(next=df9885fc). 

should be df98a59c (11011111100110001010010110011100)
  but was df9885fc (11011111100110001000010111111100)

More than just a bit flip, I guess, but based on the other bug, please do test memory on this box.

Thanks,
-Eric

Comment 7 Mike Gahagan 2011-10-17 20:03:24 UTC
I Installed memtest86+ on this box, but when I tried to boot to run it, I never got any output in the remote console, so I don't know if memtest86 failed to run or our remote console isn't passing anything back to the client for some reason or another.

I did notice the system was complaining a lot about single bit errors and it panic'ed on shutdown so bad memory or some other hardware issue is highly likely here.

Comment 9 Mike Gahagan 2011-11-02 17:07:59 UTC
The system this happened on was having hardware issues and is now being repaired, I think we can safely close this bug as I never saw this on anything else and saw no ext3 related issues with Snapshot 4.


Note You need to log in before you can comment on or make changes to this bug.