Hide Forgot
Description of problem: BUG: unable to handle kernel paging request at 000b77a6 IP: [<c06041fb>] __list_add+0xb/0xb0 *pdpt = 00000000338db001 *pde = 000000014de3c067 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:1c.7/0000:03:00.0/0000:04:00.0/local_cpus Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 28111, comm: awk Tainted: G B W ---------------- 2.6.32-207.el6.i686 #1 Intel Corporation SandyBridge Platform/LosLunas CRB EIP: 0060:[<c06041fb>] EFLAGS: 00010246 CPU: 4 EIP is at __list_add+0xb/0xb0 EAX: f4757cc0 EBX: 000b77a6 ECX: ebace5b0 EDX: 000b77a6 ESI: f4757cb0 EDI: ebace5a4 EBP: f464e9b4 ESP: dea6de88 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process awk (pid: 28111, ti=dea6c000 task=d8e58570 task.ti=dea6c000) Stack: 00000000 00000000 00000000 00000246 00000000 f46ede3c c050be8f df6f5800 <0> f464e9f0 ebace5a4 c050be8f df597d9c f464e9b4 c050c564 df597d9c f464ea18 <0> df597e00 c0452b47 e3e87fb0 da3dbe40 df458740 00000001 df5973d8 df5973e4 Call Trace: [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 [<c050be8f>] ? anon_vma_chain_link+0x2f/0x40 [<c050c564>] ? anon_vma_fork+0x84/0xa0 [<c0452b47>] ? dup_mm+0x1c7/0x420 [<c045380a>] ? copy_process+0xa1a/0x1010 [<c05a219c>] ? security_file_alloc+0xc/0x10 [<c0453e7a>] ? do_fork+0x7a/0x3e0 [<c05336b9>] ? do_pipe_flags+0xb9/0x120 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 [<c04082c3>] ? sys_clone+0x33/0x40 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 Code: c7 44 24 04 33 00 00 00 c7 04 24 64 b9 98 c0 e8 fc 09 e5 ff 8b 44 24 14 8b 10 eb 89 8d 74 26 00 53 83 ec 24 8b 59 04 39 d3 75 15 <8b> 1a 39 d9 75 51 89 41 04 89 08 89 50 04 89 02 83 c4 24 5b c3 EIP: [<c06041fb>] __list_add+0xb/0xb0 SS:ESP 0068:dea6de88 CR2: 00000000000b77a6 Version-Release number of selected component (if applicable): Snapshot 2 (-207 kernel) How reproducible: first time seen, will attempt to reproduce again. Steps to Reproduce: 1.Run /kernel/distribution/ltp/generic with TESTARGS set to "RHEL6KT1LITE RHEL6FS RHEL6CGROUP RHELPTRACE" on a system with /mnt/testarea formatted as ext3. Panic appeared to occur during RHEL6FS test phase. I'll try and narrow it down from here. 2. 3. Actual results: list corruption warnings in the form of: ------------[ cut here ]------------ WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Tainted: G B W ---------------- ) Hardware name: SandyBridge Platform list_add corruption. next->prev should be prev (df98a59c), but was (null). (next=df9885fc). Modules linked in: tun snd_seq_dummy bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 11295, comm: ps Tainted: G B W ---------------- 2.6.32-207.el6.i686 #1 Call Trace: [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 [<c0604244>] ? __list_add+0x54/0xb0 [<c0604244>] ? __list_add+0x54/0xb0 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 [<c0604244>] ? __list_add+0x54/0xb0 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 [<c053ebd9>] ? d_instantiate+0x29/0x50 [<c057dfce>] ? proc_lookup_de+0x7e/0xd0 [<c05788b9>] ? proc_root_lookup+0x19/0x50 [<c05367a2>] ? do_lookup+0x122/0x180 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 [<c0537841>] ? path_walk+0x51/0xc0 [<c05379c9>] ? do_path_lookup+0x59/0x90 [<c053871c>] ? do_filp_open+0xdc/0xb00 [<c0505231>] ? handle_mm_fault+0x131/0x1d0 [<c0527fb8>] ? do_sys_open+0x58/0x130 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 [<c052810c>] ? sys_open+0x2c/0x40 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 the we panic shortly after Expected results: run without panic, This test set has completed without any oops or panic's since the nightly trees prior to beta. Additional info:
Ok, so list corruption. By the time of the oops, it was also tainted with: 6: 'B' if a page-release function has found a bad page reference or some unexpected page flags. ... 10: 'W' if a warning has previously been issued by the kernel. (Though some warnings may set more specific taint flags.) This seems to be the first error encountered: ------------[ cut here ]------------ WARNING: at lib/list_debug.c:26 __list_add+0x54/0xb0() (Not tainted) Hardware name: SandyBridge Platform list_add corruption. next->prev should be prev (df98a59c), but was df9885fc. (next=df9885fc). Modules linked in: sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000e ext3 jbd mbcache firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 28220, comm: ps Not tainted 2.6.32-207.el6.i686 #1 Call Trace: [<c0454b41>] ? warn_slowpath_common+0x81/0xc0 [<c0604244>] ? __list_add+0x54/0xb0 [<c0604244>] ? __list_add+0x54/0xb0 [<c0454c13>] ? warn_slowpath_fmt+0x33/0x40 [<c0604244>] ? __list_add+0x54/0xb0 [<c053eb0a>] ? __d_instantiate+0x2a/0xd0 [<c053ebd9>] ? d_instantiate+0x29/0x50 [<c057a84f>] ? proc_pident_instantiate+0x5f/0x90 [<c057a995>] ? proc_pident_lookup+0x75/0xb0 [<c057aa24>] ? proc_tgid_base_lookup+0x14/0x20 [<c05367a2>] ? do_lookup+0x122/0x180 [<c0536eb3>] ? __link_path_walk+0x5e3/0xd60 [<c051cd40>] ? kmem_cache_alloc_notrace+0xa0/0xb0 [<c05adb32>] ? selinux_file_alloc_security+0x42/0xc0 [<c0537841>] ? path_walk+0x51/0xc0 [<c05379c9>] ? do_path_lookup+0x59/0x90 [<c053871c>] ? do_filp_open+0xdc/0xb00 [<c052eb6e>] ? cp_new_stat64+0xee/0x100 [<c0527fb8>] ? do_sys_open+0x58/0x130 [<c04afb0c>] ? audit_syscall_entry+0x21c/0x240 [<c052810c>] ? sys_open+0x2c/0x40 [<c0409a9f>] ? sysenter_do_call+0x12/0x28 ---[ end trace 6a7cb877a54a826f ]--- so "ps" encountered list corruption somewhere in the proc filesystem guts ...
list_add corruption. next->prev should be prev (df98a59c), but was df9885fc. (next=df9885fc). should be df98a59c (11011111100110001010010110011100) but was df9885fc (11011111100110001000010111111100) More than just a bit flip, I guess, but based on the other bug, please do test memory on this box. Thanks, -Eric
I Installed memtest86+ on this box, but when I tried to boot to run it, I never got any output in the remote console, so I don't know if memtest86 failed to run or our remote console isn't passing anything back to the client for some reason or another. I did notice the system was complaining a lot about single bit errors and it panic'ed on shutdown so bad memory or some other hardware issue is highly likely here.
The system this happened on was having hardware issues and is now being repaired, I think we can safely close this bug as I never saw this on anything else and saw no ext3 related issues with Snapshot 4.