Bug 746312

Summary: kernel panic during LTP filesystem test run on ext3
Product: Red Hat Enterprise Linux 6 Reporter: Mike Gahagan <mgahagan>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.2CC: dchinner, eguan, esandeen, jburke, jstancek, lczerner, pbunyan, rwheeler
Target Milestone: rcKeywords: Regression
Target Release: 6.2   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-02 17:07:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mike Gahagan 2011-10-14 18:14:19 UTC
Description of problem:
BUG: unable to handle kernel paging request at 000b77a6 
IP: [<c06041fb>] __list_add+0xb/0xb0 
*pdpt = 00000000338db001 *pde = 000000014de3c067  
Oops: 0000 [#1] SMP  
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.7/0000:03:00.0/0000:04:00.0/local_cpus 
Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
 
Pid: 28111, comm: awk Tainted: G    B   W  ----------------   2.6.32-207.el6.i686 #1 Intel Corporation SandyBridge Platform/LosLunas CRB 
EIP: 0060:[<c06041fb>] EFLAGS: 00010246 CPU: 4 
EIP is at __list_add+0xb/0xb0 
EAX: f4757cc0 EBX: 000b77a6 ECX: ebace5b0 EDX: 000b77a6 
ESI: f4757cb0 EDI: ebace5a4 EBP: f464e9b4 ESP: dea6de88 
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 
Process awk (pid: 28111, ti=dea6c000 task=d8e58570 task.ti=dea6c000) 
Stack: 
 00000000 00000000 00000000 00000246 00000000 f46ede3c c050be8f df6f5800 
<0> f464e9f0 ebace5a4 c050be8f df597d9c f464e9b4 c050c564 df597d9c f464ea18 
<0> df597e00 c0452b47 e3e87fb0 da3dbe40 df458740 00000001 df5973d8 df5973e4 
Call Trace: 
 [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 
 [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 
 [<c050c564>] ? anon_vma_fork+0x84/0xa0 
 [<c0452b47>] ? dup_mm+0x1c7/0x420 
 [<c045380a>] ? copy_process+0xa1a/0x1010 
 [<c05a219c>] ? security_file_alloc+0xc/0x10 
 [<c0453e7a>] ? do_fork+0x7a/0x3e0 
 [<c05336b9>] ? do_pipe_flags+0xb9/0x120 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c04082c3>] ? sys_clone+0x33/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 
Code: c7 44 24 04 33 00 00 00 c7 04 24 64 b9 98 c0 e8 fc 09 e5 ff 8b 44 24 14 8b 10 eb 89 8d 74 26 00 53 83 ec 24 8b 59 04 39 d3 75 15 <8b> 1a 39 d9 75 51 89 41 04 89 08 89 50 04 89 02 83 c4 24 5b c3  
EIP: [<c06041fb>] __list_add+0xb/0xb0 SS:ESP 0068:dea6de88 
CR2: 00000000000b77a6 


Version-Release number of selected component (if applicable):
Snapshot 2 (-207 kernel)

How reproducible:
first time seen, will attempt to reproduce again.

Steps to Reproduce:
1.Run /kernel/distribution/ltp/generic with TESTARGS set to "RHEL6KT1LITE RHEL6FS RHEL6CGROUP RHELPTRACE" on a system with /mnt/testarea formatted as ext3. Panic appeared to occur during RHEL6FS test phase. I'll try and narrow it down from here.
2.
3.
  
Actual results:
list corruption warnings in the form of:

------------[ cut here ]------------ 
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Tainted: G    B   W  ----------------  ) 
Hardware name: SandyBridge Platform 
list_add corruption. next->prev should be prev (df98a59c), but was (null). (next=df9885fc). 
Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
Pid: 11295, comm: ps Tainted: G    B   W  ----------------   2.6.32-207.el6.i686 #1 
Call Trace: 
 [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 
 [<c053ebd9>] ? d_instantiate+0x29/0x50 
 [<c057dfce>] ? proc_lookup_de+0x7e/0xd0 
 [<c05788b9>] ? proc_root_lookup+0x19/0x50 
 [<c05367a2>] ? do_lookup+0x122/0x180 
 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 
 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 
 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 
 [<c0537841>] ? path_walk+0x51/0xc0 
 [<c05379c9>] ? do_path_lookup+0x59/0x90 
 [<c053871c>] ? do_filp_open+0xdc/0xb00 
 [<c0505231>] ? handle_mm_fault+0x131/0x1d0 
 [<c0527fb8>] ? do_sys_open+0x58/0x130 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c052810c>] ? sys_open+0x2c/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 

the we panic shortly after

Expected results:

run without panic, This test set has completed without any oops or panic's since the nightly trees prior to beta.

Additional info:

Comment 2 Eric Sandeen 2011-10-14 20:46:31 UTC
Ok, so list corruption.

By the time of the oops, it was also tainted with:
  6: 'B' if a page-release function has found a bad page reference or
     some unexpected page flags.
...

 10: 'W' if a warning has previously been issued by the kernel.
     (Though some warnings may set more specific taint flags.)

This seems to be the first error encountered:

------------[ cut here ]------------ 
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Not tainted) 
Hardware name: SandyBridge Platform 
list_add corruption. next->prev should be prev (df98a59c), but was df9885fc. (next=df9885fc). 
Modules linked in: sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
Pid: 28220, comm: ps Not tainted 2.6.32-207.el6.i686 #1 
Call Trace: 
 [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 
 [<c053ebd9>] ? d_instantiate+0x29/0x50 
 [<c057a84f>] ? proc_pident_instantiate+0x5f/0x90 
 [<c057a995>] ? proc_pident_lookup+0x75/0xb0 
 [<c057aa24>] ? proc_tgid_base_lookup+0x14/0x20 
 [<c05367a2>] ? do_lookup+0x122/0x180 
 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 
 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 
 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 
 [<c0537841>] ? path_walk+0x51/0xc0 
 [<c05379c9>] ? do_path_lookup+0x59/0x90 
 [<c053871c>] ? do_filp_open+0xdc/0xb00 
 [<c052eb6e>] ? cp_new_stat64+0xee/0x100 
 [<c0527fb8>] ? do_sys_open+0x58/0x130 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c052810c>] ? sys_open+0x2c/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 
---[ end trace 6a7cb877a54a826f ]--- 

so "ps" encountered list corruption somewhere in the proc filesystem guts ...

Comment 6 Eric Sandeen 2011-10-17 19:07:17 UTC
list_add corruption. next->prev should be prev (df98a59c), but was df9885fc.
(next=df9885fc). 

should be df98a59c (11011111100110001010010110011100)
  but was df9885fc (11011111100110001000010111111100)

More than just a bit flip, I guess, but based on the other bug, please do test memory on this box.

Thanks,
-Eric

Comment 7 Mike Gahagan 2011-10-17 20:03:24 UTC
I Installed memtest86+ on this box, but when I tried to boot to run it, I never got any output in the remote console, so I don't know if memtest86 failed to run or our remote console isn't passing anything back to the client for some reason or another.

I did notice the system was complaining a lot about single bit errors and it panic'ed on shutdown so bad memory or some other hardware issue is highly likely here.

Comment 9 Mike Gahagan 2011-11-02 17:07:59 UTC
The system this happened on was having hardware issues and is now being repaired, I think we can safely close this bug as I never saw this on anything else and saw no ext3 related issues with Snapshot 4.