Bug 620504

Summary: during longevity test run hit paging request BUG assertion
Product: Red Hat Enterprise Linux 6 Reporter: Mike Gahagan <mgahagan>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: songhai.yu
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-04 14:07:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Complete console log none

Description Mike Gahagan 2010-08-02 17:46:38 UTC
Description of problem:
BUG: unable to handle kernel paging request at ffffeba400000000
IP: [<ffffffff8116bd7e>] free_block+0x9e/0x230
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu63/cache/index2/shared_cpu_map
CPU 8 
Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs nls_koi8_u cryptd aes_x86_64 aes_generic autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ahci dm_mod [last unloaded: rmd128]

Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs nls_koi8_u cryptd aes_x86_64 aes_generic autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ahci dm_mod [last unloaded: rmd128]
Pid: 28, comm: ksoftirqd/8 Not tainted 2.6.32-54.el6.x86_64.debug #1 Sunrise Ridge
RIP: 0010:[<ffffffff8116bd7e>]  [<ffffffff8116bd7e>] free_block+0x9e/0x230
RSP: 0018:ffff88002fa03da8  EFLAGS: 00010086
RAX: ffffeba400000000 RBX: ffff88057b0f0100 RCX: 0000000000000008
RDX: ffffea0000000000 RSI: ffff8801453d20c0 RDI: 0000000000000000
RBP: ffff88002fa03df8 R08: ffff8802777a3740 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: ffff880236994000
R13: ffff880276a40760 R14: 0000000000000006 R15: 000000000000101a
FS:  0000000000000000(0000) GS:ffff88002fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffffeba400000000 CR3: 000000047343f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksoftirqd/8 (pid: 28, threadinfo ffff880276a84000, task ffff880276a80740)
Stack:
 ffff8802777a3798 0000001000000000 0000000000000000 ffff8802777a3740
<0> ffff88002fa03df8 0000000000000010 ffff880276a406e0 ffff8802777a3740
<0> ffff88057b0f0100 ffff880276a40730 ffff88002fa03e58 ffffffff8116c205
Call Trace:
 <IRQ> 
 [<ffffffff8116c205>] cache_flusharray+0x95/0x180
 [<ffffffff8116bb06>] kmem_cache_free+0x256/0x2b0
 [<ffffffff8118746d>] file_free_rcu+0x4d/0x70
 [<ffffffff810effcd>] __rcu_process_callbacks+0x12d/0x3e0
 [<ffffffff810f02ab>] rcu_process_callbacks+0x2b/0x50
 [<ffffffff81077135>] __do_softirq+0xd5/0x220
 [<ffffffff810143cc>] call_softirq+0x1c/0x30
 <EOI> 
 [<ffffffff810160cd>] ? do_softirq+0xad/0xe0
 [<ffffffff81076a70>] ksoftirqd+0x80/0x120
 [<ffffffff810769f0>] ? ksoftirqd+0x0/0x120
 [<ffffffff81096646>] kthread+0x96/0xa0
 [<ffffffff810142ca>] child_rip+0xa/0x20
 [<ffffffff81013c10>] ? restore_args+0x0/0x30
 [<ffffffff810965b0>] ? kthread+0x0/0xa0
 [<ffffffff810142c0>] ? child_rip+0x0/0x20
Code: 89 c7 48 89 45 c0 e8 a2 ee ed ff 48 c1 e8 0c 48 8d 14 c5 00 00 00 00 48 c1 e0 06 48 29 d0 48 ba 00 00 00 00 00 ea ff ff 48 01 d0 <48> 8b 10 66 85 d2 0f 88 23 01 00 00 84 d2 0f 89 70 01 00 00 4c 
RIP  [<ffffffff8116bd7e>] free_block+0x9e/0x230
 RSP <ffff88002fa03da8>
CR2: ffffeba400000000
---[ end trace 0b9a0d246f57ca69 ]---

see attached log for the rest of the kernel trace.

Version-Release number of selected component (if applicable):
0722.0 tree running the -54.x86_64.debug kernel.

How reproducible:
So far only once

Steps to Reproduce:
1.build/install LTP and run ltpstress.sh as follows:
./ltpstress.sh  -m 22000 -t 96  # use 22GB of RAM, run for 96hrs.
2.
3.
  
Actual results:

see above and attached file for complete log

Expected results:

test completes with no panic

Additional info:

Failure seems to have occured sometime after the first 24 hours or operation.

Comment 1 Mike Gahagan 2010-08-02 17:48:00 UTC
Created attachment 436079 [details]
Complete console log

Comment 2 Mike Gahagan 2010-08-02 19:33:17 UTC
starting another run with the non-debug -54 kernel.

Comment 5 Larry Woodman 2010-08-05 20:09:10 UTC
If possible we need to get a crash dump when this happens.  Evidently there is corruption in the slabcache because we are crashing in free_block() while dereferencing a kmem_list.


Larry

Comment 6 Mike Gahagan 2010-08-05 20:45:41 UTC
I'll try with the debug kernel again and hopefully get a crash dump this time.

By the way, the run I started on Monday with the non-debug kernel is nearly finished. It should finish tonight/early tomorrow morning. So far I haven't seen any issues.

Comment 7 Mike Gahagan 2010-08-06 19:09:52 UTC
I'm running into some issues getting a crash dump out of the debug kernel that looks like bz 612244, so it may be hard to get a crash dump unless I can find a workaround. I've sent mail to Jason B to follow up.

Comment 9 RHEL Program Management 2011-01-07 04:08:26 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 10 Suzanne Logcher 2011-01-07 16:17:04 UTC
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 11 Larry Woodman 2011-01-13 15:47:41 UTC
Does this problem still happen in the latest 6.1 kernel?  We removed some buggy debug code from the slab debug code that looks like it was in this area.

Either way, I can not reproduce this problem and we never got a dump so I cant make any progress on this BZ until we can get more data.

Larry

Comment 12 Mike Gahagan 2011-01-13 16:05:40 UTC
I don't recall ever hitting this on any recent RHEL 6.0 kernel. I think it only occured one time. We have not yet done a longevity test run with any of the 6.1 kernels yet, usually we wait till close to the end of the testing phase, but in light of this bug we'll run it a bit earlier.

Comment 14 RHEL Program Management 2011-02-01 05:41:16 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 15 RHEL Program Management 2011-02-01 18:31:59 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 16 RHEL Program Management 2011-04-04 02:21:49 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 17 Mike Gahagan 2011-04-04 14:07:32 UTC
I've run the longevity test on x86_64 with the beta kernel and it finished without issues. Also ran on s/390x and observed one panic (filed as a separate bz) that had to do with NFS, so I think this one can be closed.