Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 746312

Summary:	kernel panic during LTP filesystem test run on ext3
Product:	Red Hat Enterprise Linux 6	Reporter:	Mike Gahagan <mgahagan>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED NOTABUG	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.2	CC:	dchinner, eguan, esandeen, jburke, jstancek, lczerner, pbunyan, rwheeler
Target Milestone:	rc	Keywords:	Regression
Target Release:	6.2
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-11-02 17:07:59 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Gahagan 2011-10-14 18:14:19 UTC

Description of problem:
BUG: unable to handle kernel paging request at 000b77a6 
IP: [<c06041fb>] __list_add+0xb/0xb0 
*pdpt = 00000000338db001 *pde = 000000014de3c067  
Oops: 0000 [#1] SMP  
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.7/0000:03:00.0/0000:04:00.0/local_cpus 
Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
 
Pid: 28111, comm: awk Tainted: G    B   W  ----------------   2.6.32-207.el6.i686 #1 Intel Corporation SandyBridge Platform/LosLunas CRB 
EIP: 0060:[<c06041fb>] EFLAGS: 00010246 CPU: 4 
EIP is at __list_add+0xb/0xb0 
EAX: f4757cc0 EBX: 000b77a6 ECX: ebace5b0 EDX: 000b77a6 
ESI: f4757cb0 EDI: ebace5a4 EBP: f464e9b4 ESP: dea6de88 
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 
Process awk (pid: 28111, ti=dea6c000 task=d8e58570 task.ti=dea6c000) 
Stack: 
 00000000 00000000 00000000 00000246 00000000 f46ede3c c050be8f df6f5800 
<0> f464e9f0 ebace5a4 c050be8f df597d9c f464e9b4 c050c564 df597d9c f464ea18 
<0> df597e00 c0452b47 e3e87fb0 da3dbe40 df458740 00000001 df5973d8 df5973e4 
Call Trace: 
 [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 
 [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 
 [<c050c564>] ? anon_vma_fork+0x84/0xa0 
 [<c0452b47>] ? dup_mm+0x1c7/0x420 
 [<c045380a>] ? copy_process+0xa1a/0x1010 
 [<c05a219c>] ? security_file_alloc+0xc/0x10 
 [<c0453e7a>] ? do_fork+0x7a/0x3e0 
 [<c05336b9>] ? do_pipe_flags+0xb9/0x120 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c04082c3>] ? sys_clone+0x33/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 
Code: c7 44 24 04 33 00 00 00 c7 04 24 64 b9 98 c0 e8 fc 09 e5 ff 8b 44 24 14 8b 10 eb 89 8d 74 26 00 53 83 ec 24 8b 59 04 39 d3 75 15 <8b> 1a 39 d9 75 51 89 41 04 89 08 89 50 04 89 02 83 c4 24 5b c3  
EIP: [<c06041fb>] __list_add+0xb/0xb0 SS:ESP 0068:dea6de88 
CR2: 00000000000b77a6 


Version-Release number of selected component (if applicable):
Snapshot 2 (-207 kernel)

How reproducible:
first time seen, will attempt to reproduce again.

Steps to Reproduce:
1.Run /kernel/distribution/ltp/generic with TESTARGS set to "RHEL6KT1LITE RHEL6FS RHEL6CGROUP RHELPTRACE" on a system with /mnt/testarea formatted as ext3. Panic appeared to occur during RHEL6FS test phase. I'll try and narrow it down from here.
2.
3.
  
Actual results:
list corruption warnings in the form of:

------------[ cut here ]------------ 
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Tainted: G    B   W  ----------------  ) 
Hardware name: SandyBridge Platform 
list_add corruption. next->prev should be prev (df98a59c), but was (null). (next=df9885fc). 
Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
Pid: 11295, comm: ps Tainted: G    B   W  ----------------   2.6.32-207.el6.i686 #1 
Call Trace: 
 [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 
 [<c053ebd9>] ? d_instantiate+0x29/0x50 
 [<c057dfce>] ? proc_lookup_de+0x7e/0xd0 
 [<c05788b9>] ? proc_root_lookup+0x19/0x50 
 [<c05367a2>] ? do_lookup+0x122/0x180 
 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 
 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 
 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 
 [<c0537841>] ? path_walk+0x51/0xc0 
 [<c05379c9>] ? do_path_lookup+0x59/0x90 
 [<c053871c>] ? do_filp_open+0xdc/0xb00 
 [<c0505231>] ? handle_mm_fault+0x131/0x1d0 
 [<c0527fb8>] ? do_sys_open+0x58/0x130 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c052810c>] ? sys_open+0x2c/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 

the we panic shortly after

Expected results:

run without panic, This test set has completed without any oops or panic's since the nightly trees prior to beta.

Additional info:

Comment 2 Eric Sandeen 2011-10-14 20:46:31 UTC

Ok, so list corruption.

By the time of the oops, it was also tainted with:
  6: 'B' if a page-release function has found a bad page reference or
     some unexpected page flags.
...

 10: 'W' if a warning has previously been issued by the kernel.
     (Though some warnings may set more specific taint flags.)

This seems to be the first error encountered:

------------[ cut here ]------------ 
WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Not tainted) 
Hardware name: SandyBridge Platform 
list_add corruption. next->prev should be prev (df98a59c), but was df9885fc. (next=df9885fc). 
Modules linked in: sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
Pid: 28220, comm: ps Not tainted 2.6.32-207.el6.i686 #1 
Call Trace: 
 [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 
 [<c0604244>] ? __list_add+0x54/0xb0 
 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 
 [<c053ebd9>] ? d_instantiate+0x29/0x50 
 [<c057a84f>] ? proc_pident_instantiate+0x5f/0x90 
 [<c057a995>] ? proc_pident_lookup+0x75/0xb0 
 [<c057aa24>] ? proc_tgid_base_lookup+0x14/0x20 
 [<c05367a2>] ? do_lookup+0x122/0x180 
 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 
 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 
 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 
 [<c0537841>] ? path_walk+0x51/0xc0 
 [<c05379c9>] ? do_path_lookup+0x59/0x90 
 [<c053871c>] ? do_filp_open+0xdc/0xb00 
 [<c052eb6e>] ? cp_new_stat64+0xee/0x100 
 [<c0527fb8>] ? do_sys_open+0x58/0x130 
 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 
 [<c052810c>] ? sys_open+0x2c/0x40 
 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 
---[ end trace 6a7cb877a54a826f ]--- 

so "ps" encountered list corruption somewhere in the proc filesystem guts ...

Comment 6 Eric Sandeen 2011-10-17 19:07:17 UTC

list_add corruption. next->prev should be prev (df98a59c), but was df9885fc.
(next=df9885fc). 

should be df98a59c (11011111100110001010010110011100)
  but was df9885fc (11011111100110001000010111111100)

More than just a bit flip, I guess, but based on the other bug, please do test memory on this box.

Thanks,
-Eric

Comment 7 Mike Gahagan 2011-10-17 20:03:24 UTC

I Installed memtest86+ on this box, but when I tried to boot to run it, I never got any output in the remote console, so I don't know if memtest86 failed to run or our remote console isn't passing anything back to the client for some reason or another.

I did notice the system was complaining a lot about single bit errors and it panic'ed on shutdown so bad memory or some other hardware issue is highly likely here.

Comment 9 Mike Gahagan 2011-11-02 17:07:59 UTC

The system this happened on was having hardware issues and is now being repaired, I think we can safely close this bug as I never saw this on anything else and saw no ext3 related issues with Snapshot 4.