521055 – Fedora12 Alpha: Badness at lib/debugobjects.c

Bug 521055 - Fedora12 Alpha: Badness at lib/debugobjects.c

Summary: Fedora12 Alpha: Badness at lib/debugobjects.c

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	12
Hardware:	ppc64
OS:	All
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-09-03 11:40 UTC by IBM Bug Proxy
Modified:	2010-12-05 06:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-12-05 06:29:47 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test case KernelTtyTest.c (5.19 KB, text/plain) 2009-09-03 11:40 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	55755	0	None	None	None	Never

Description IBM Bug Proxy 2009-09-03 11:40:45 UTC

=Comment: #0=================================================
PAVAN NAREGUNDI <pavan.naregundi.com> - 

Test case KernelTtyTest.c

Ran the attached testcase(KernelTtyTest.c) on Fedora12 Alpha running on power5. Test case reported a
Badness at lib/debugobjects.c:254 as below and system droped to xmon

=================
------------[ cut here ]------------
Badness at lib/debugobjects.c:254
NIP: c000000000368fe0 LR: c000000000368fdc CTR: 80000000000d5d74
REGS: c00000002f91adc0 TRAP: 0700   Tainted: G      D     (2.6.31-0.125.4.2.rc5.git2.fc12.ppc64)
MSR: 8000000000029032 <EE,ME,CE,IR,DR>  CR: 28008482  XER: 0000000d
TASK = c000000034e9cfe0[7850] 'KernelTtyTest' THREAD: c00000002f918000 CPU: 0
GPR00: c000000000368fdc c00000002f91b040 c00000000127e930 0000000000000032 
GPR04: c0000000006eeb94 c0000000000bd100 0000000000000000 0000000000000002 
GPR08: 0000000000000000 c000000034e9cfe0 000000000a000000 0000000000000001 
GPR12: 0000000000004000 c0000000012d8300 c000000034ad44b8 c000000034ad4478 
GPR16: c000000034ad5550 0000000000000000 c000000034ad4398 0000000000000000 
GPR20: c0000000334209a0 c000000001c910d8 c000000034ad5120 c000000001c910e0 
GPR24: c000000034ad4120 c00000002f91b130 0000000000000002 c000000034ad4000 
GPR28: 0000000000100100 0000000000200200 c00000000120d950 c00000002f91b040 
NIP [c000000000368fe0] .debug_print_object+0x70/0x9c
LR [c000000000368fdc] .debug_print_object+0x6c/0x9c
Call Trace:
[c00000002f91b040] [c000000000368fdc] .debug_print_object+0x6c/0x9c (unreliable)
[c00000002f91b0c0] [c0000000003696e8] .debug_check_no_obj_freed+0xe4/0x230
[c00000002f91b1c0] [c0000000001c0f44] .kfree+0x150/0x200
[c00000002f91b270] [c0000000004593a8] .free_tty_struct+0x44/0x60
[c00000002f91b300] [c000000000459498] .release_one_tty+0xd4/0xf8
[c00000002f91b3a0] [c00000000035b0e0] .kref_put+0x78/0x9c
[c00000002f91b420] [c000000000455b38] .tty_kref_put+0x40/0x60
[c00000002f91b4b0] [c000000000455df8] .release_tty+0x60/0x7c
[c00000002f91b540] [c000000000459290] .tty_release_dev+0x4d8/0x554
[c00000002f91b6a0] [c000000000459340] .tty_release+0x34/0x58
[c00000002f91b730] [c0000000001d1e80] .__fput+0x1ac/0x2d4
[c00000002f91b7f0] [c0000000001d1ff4] .fput+0x4c/0x60
[c00000002f91b880] [c0000000001cd2c8] .filp_close+0xc8/0xf4
[c00000002f91b920] [c0000000000bfa5c] .put_files_struct+0xcc/0x15c
[c00000002f91b9d0] [c0000000000bfb50] .exit_files+0x64/0x84
[c00000002f91ba70] [c0000000000c1d58] .do_exit+0x2a4/0x824
[c00000002f91bb50] [c0000000000c237c] .do_group_exit+0xa4/0xd8
[c00000002f91bbf0] [c0000000000d4bb4] .get_signal_to_deliver+0x42c/0x490
[c00000002f91bce0] [c000000000015cd4] .do_signal+0x7c/0x350
[c00000002f91be30] [c000000000008c58] do_work+0x24/0x28
Instruction dump:
7fab0000 41be0030 38090001 e93e8078 e87e8070 90080008 800a0010 e8cb0000 
78001f24 7ca9002a 4838b755 60000000 <0fe00000> e93e8000 8169000c 380b0001 
Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6b73
Faulting instruction address: 0xc0000000000da510
cpu 0x3: Vector: 300 (Data Access) at [c000000039443ae0]
    pc: c0000000000da510: .worker_thread+0x208/0x3dc
    lr: c0000000000da4e8: .worker_thread+0x1e0/0x3dc
    sp: c000000039443d60
   msr: 8000000000009032
   dar: 6b6b6b6b6b6b6b73
 dsisr: 42000000
  current = 0xc00000003943a7f0
  paca    = 0xc0000000012d8900
    pid   = 60, comm = events/3
enter ? for help
3:mon> t
[c000000039443e90] c0000000000e16ec .kthread+0xac/0xb8
[c000000039443f90] c0000000000313b8 .kernel_thread+0x54/0x70
3:mon> r
R00 = 0000000000000000   R16 = 000000000032e000
R01 = c000000039443d60   R17 = 000000000263ebf0
R02 = c00000000127e930   R18 = c00000003943a7f0
R03 = c000000039406668   R19 = c000000070004a40
R04 = c0000000006eeb94   R20 = c000000034ad4358
R05 = c00000000016069c   R21 = c0000000394066b8
R06 = 0000000000000000   R22 = c0000000394066a8
R07 = 0000000000000002   R23 = c000000039440000
R08 = 0000000000000001   R24 = c000000039443dd0
R09 = c000000034ad4358   R25 = 0000000000000001
R10 = c0000000394066a8   R26 = c000000039443df8
R11 = 6b6b6b6b6b6b6b6b   R27 = 6b6b6b6b6b6b6b6b
R12 = 0000000024000044   R28 = c000000034ad4350
R13 = c0000000012d8900   R29 = c000000039406668
R14 = 0000000003f00000   R30 = c0000000011fa708
R15 = 0000000001b5faa8   R31 = c000000039443d60
pc  = c0000000000da510 .worker_thread+0x208/0x3dc
lr  = c0000000000da4e8 .worker_thread+0x1e0/0x3dc
msr = 8000000000009032   cr  = 24000042
ctr = c00000000016057c   xer = 0000000020000001   trap =  300
dar = 6b6b6b6b6b6b6b73   dsisr = 42000000
3:mon> di %pc
c0000000000da510  f94b0008	std	r10,8(r11)
c0000000000da514  f9290000	std	r9,0(r9)
c0000000000da518  f9290008	std	r9,8(r9)
c0000000000da51c  48614595	bl	c0000000006eeab0	# ._spin_unlock_irq+0x0/0x78
c0000000000da520  60000000	nop
c0000000000da524  e814fff8	ld	r0,-8(r20)
c0000000000da528  78000764	rldicr	r0,r0,0,61
c0000000000da52c  7fa00278	xor	r0,r29,r0
c0000000000da530  3120ffff	addic	r9,r0,-1
c0000000000da534  7c090110	subfe	r0,r9,r0
c0000000000da538  0b000000	tdnei	r0,0
c0000000000da53c  7c00e0a8	ldarx	r0,0,r28
c0000000000da540  7c00c878	andc	r0,r0,r25
c0000000000da544  7c00e1ad	stdcx.	r0,0,r28
c0000000000da548  40a2fff4	bne	c0000000000da53c	# .worker_thread+0x234/0x3dc
c0000000000da54c  e87d00a8	ld	r3,168(r29)
3:mon> 
========================

Kernel: 2.6.31-0.125.4.2.rc5.git2.fc12.ppc64
CPU: power5
model : IBM,9113-550

Steps:
1. Compile the test case
            gcc -o KernelTtyTest KernelTtyTest.c -Wall -O2 -lutil
2. Run the test case in a while loop
            while :; do ./KernelTtyTest ; done

Test will generate failures as below. Ignore the failures and within a 5 min of the test run system
will drop to xmon
---------------------------
      Aborted
read failed: Input/output error
lastbytes: '\r\nprocess_cmd read failed: Interrupted system call\r\n'
------------------------------
=Comment: #2=================================================
SHARYATHI NAGESH <sharyath.com> - 
Pavan
  This is what I could debug:

   The problem is observed because the list_head of work_struct data structure contains garbage
value (6b6b..) which intern is the string (poison value) written by the slub/slab free routines,
when CONFIG_SLUB_DUBUG is enabled.
  
     I was not able to determine the exact race condition (kernel thread) which is freeing the
work_struct data structure before it is done by run_workqueue() function. 

This is what analysis I have done till now 

------------------------------------------------------------------
Disassembled instructions
c0000000000da4f8  e93d0040	ld	r9,64(r29)
c0000000000da4fc  fb9d00a0	std	r28,160(r29)
c0000000000da500  7fa3eb78	mr	r3,r29
c0000000000da504  e9490008	ld	r10,8(r9)
3:mon> 
c0000000000da508  e9690000	ld	r11,0(r9)
c0000000000da50c  f96a0000	std	r11,0(r10)
c0000000000da510  f94b0008	std	r10,8(r11)         <== This is where it seg faults
c0000000000da514  f9290000	std	r9,0(r9)
c0000000000da518  f9290008	std	r9,8(r9)
c0000000000da51c  48614595	bl	c0000000006eeab0	# ._spin_unlock_irq+0x0/0x78
------------------------------------------------------------------
register contents
3:mon> r
R00 = 0000000000000000   R16 = 000000000032e000
R01 = c0000000393dfd60   R17 = 000000000263ebf0
R02 = c00000000127e930   R18 = c0000000393d27f0
R03 = c0000000392db710   R19 = c000000070004a40
R04 = c0000000006eeb94   R20 = c0000000099f63e8
R05 = c00000000016069c   R21 = c0000000392db760
R06 = 0000000000000000   R22 = c0000000392db750
R07 = 0000000000000002   R23 = c0000000393dc000
R08 = 0000000000000001   R24 = c0000000393dfdd0
R09 = c0000000099f63e8   R25 = 0000000000000001
R10 = 6b6b6b6b6b6b6b6b   R26 = c0000000393dfdf8
R11 = 6b6b6b6b6b6b6b6b   R27 = 6b6b6b6b6b6b6b6b
R12 = 0000000000000000   R28 = c0000000099f63e0
R13 = c0000000012d8900   R29 = c0000000392db710
R14 = 0000000003f00000   R30 = c0000000011fa708
R15 = 0000000001b5faa8   R31 = c0000000393dfd60
pc  = c0000000000da50c .worker_thread+0x204/0x3dc
lr  = c0000000000da4e8 .worker_thread+0x1e0/0x3dc
msr = 8000000000009032   cr  = 24000042
ctr = c00000000016057c   xer = 0000000020000001   trap =  300
dar = 6b6b6b6b6b6b6b6b   dsisr = 42000000

You can observe R10 and R11 which holds previous and next pointers in the list head respectively are
poisoned by slub free() routine
------------------------------------------------------------------
objdump results mapped the assembly instructions to these lines of C code

               cwq->current_work = work;
    1840:       fb 9d 00 a0     std     r28,160(r29)
/root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/kernel/workqueue.c:285    
                                                                                 <== From
run_workqueue() function
                list_del_init(cwq->worklist.next);
                spin_unlock_irq(&cwq->lock);
    1844:       7f a3 eb 78     mr      r3,r29
list_del_init():
/root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/include/linux/list.h:142
 * list_del_init - deletes entry from list and reinitialize it.
 * @entry: the element to delete from the list.
 */
static inline void list_del_init(struct list_head *entry)
{
        __list_del(entry->prev, entry->next);
    1848:       e9 49 00 08     ld      r10,8(r9)
    184c:       e9 69 00 00     ld      r11,0(r9)
__list_del():
/root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/include/linux/list.h:93
 * the prev/next entries already!
 */
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
        next->prev = prev;
        prev->next = next;
    1850:       f9 6a 00 00     std     r11,0(r10)
/root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/include/linux/list.h:92
 * This is only for internal list manipulation where we know
 * the prev/next entries already!
 */
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
        next->prev = prev;
    1854:       f9 4b 00 08     std     r10,8(r11)
                                         <== This is where it breaks 
------------------------------------------------------------------
Still working on understanding the actual cause for the race / corruption of work_struct data structure
Thanks

=Comment: #5=================================================
SHARYATHI NAGESH <sharyath.com> - 
Pavan 
  Can you update the bug with your efforts to recreate the issue with mainline kernel
Thanks

Comment 1 IBM Bug Proxy 2009-09-03 11:40:57 UTC

Created attachment 359668 [details]
Test case KernelTtyTest.c

Comment 2 IBM Bug Proxy 2009-09-04 07:00:54 UTC

------- Comment From pavan.naregundi.com 2009-09-04 02:55 EDT-------
(In reply to comment #5)
> Pavan
>   Can you update the bug with your efforts to recreate the issue with mainline
> kernel
> Thanks
>

Could not reproduce the issue with kernel-2.6.31-rc8-git2 on a similar machine with the config file of F12alpha.

Comment 3 Chuck Ebbert 2009-09-08 02:18:57 UTC

(In reply to comment #2)
> Could not reproduce the issue with kernel-2.6.31-rc8-git2 on a similar machine
> with the config file of F12alpha.  

But you are comparing Fedora's 2.6.31-rc5-git2 with upstream -rc8-git2. Can you retest Fedora with the latest kernel?

Comment 4 IBM Bug Proxy 2009-09-08 04:21:50 UTC

------- Comment From sharyath.com 2009-09-08 00:18 EDT-------
If the machine is accessible I would like to trace queue_work() using systemtap to understand who are all adding work queues when the test is run.
It will help to narrow down the problem area, let me know when I can use the machine for testing

Comment 5 IBM Bug Proxy 2009-09-08 06:21:05 UTC

------- Comment From pavan.naregundi.com 2009-09-08 02:10 EDT-------
(In reply to comment #10)
> But you are comparing Fedora's 2.6.31-rc5-git2 with upstream -rc8-git2. Can you
> retest Fedora with the latest kernel?
>

Tested with latest fedora kernel-2.6.31-0.204.rc9.fc12.ppc64, I could not reproduce the issue now.

Comment 6 Bug Zapper 2009-11-16 11:57:11 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 7 Bug Zapper 2010-11-04 10:13:45 UTC

This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 8 Bug Zapper 2010-12-05 06:29:47 UTC

Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.