=Comment: #0================================================= PAVAN NAREGUNDI <pavan.naregundi.com> - Test case KernelTtyTest.c Ran the attached testcase(KernelTtyTest.c) on Fedora12 Alpha running on power5. Test case reported a Badness at lib/debugobjects.c:254 as below and system droped to xmon ================= ------------[ cut here ]------------ Badness at lib/debugobjects.c:254 NIP: c000000000368fe0 LR: c000000000368fdc CTR: 80000000000d5d74 REGS: c00000002f91adc0 TRAP: 0700 Tainted: G D (2.6.31-0.125.4.2.rc5.git2.fc12.ppc64) MSR: 8000000000029032 <EE,ME,CE,IR,DR> CR: 28008482 XER: 0000000d TASK = c000000034e9cfe0[7850] 'KernelTtyTest' THREAD: c00000002f918000 CPU: 0 GPR00: c000000000368fdc c00000002f91b040 c00000000127e930 0000000000000032 GPR04: c0000000006eeb94 c0000000000bd100 0000000000000000 0000000000000002 GPR08: 0000000000000000 c000000034e9cfe0 000000000a000000 0000000000000001 GPR12: 0000000000004000 c0000000012d8300 c000000034ad44b8 c000000034ad4478 GPR16: c000000034ad5550 0000000000000000 c000000034ad4398 0000000000000000 GPR20: c0000000334209a0 c000000001c910d8 c000000034ad5120 c000000001c910e0 GPR24: c000000034ad4120 c00000002f91b130 0000000000000002 c000000034ad4000 GPR28: 0000000000100100 0000000000200200 c00000000120d950 c00000002f91b040 NIP [c000000000368fe0] .debug_print_object+0x70/0x9c LR [c000000000368fdc] .debug_print_object+0x6c/0x9c Call Trace: [c00000002f91b040] [c000000000368fdc] .debug_print_object+0x6c/0x9c (unreliable) [c00000002f91b0c0] [c0000000003696e8] .debug_check_no_obj_freed+0xe4/0x230 [c00000002f91b1c0] [c0000000001c0f44] .kfree+0x150/0x200 [c00000002f91b270] [c0000000004593a8] .free_tty_struct+0x44/0x60 [c00000002f91b300] [c000000000459498] .release_one_tty+0xd4/0xf8 [c00000002f91b3a0] [c00000000035b0e0] .kref_put+0x78/0x9c [c00000002f91b420] [c000000000455b38] .tty_kref_put+0x40/0x60 [c00000002f91b4b0] [c000000000455df8] .release_tty+0x60/0x7c [c00000002f91b540] [c000000000459290] .tty_release_dev+0x4d8/0x554 [c00000002f91b6a0] [c000000000459340] .tty_release+0x34/0x58 [c00000002f91b730] [c0000000001d1e80] .__fput+0x1ac/0x2d4 [c00000002f91b7f0] [c0000000001d1ff4] .fput+0x4c/0x60 [c00000002f91b880] [c0000000001cd2c8] .filp_close+0xc8/0xf4 [c00000002f91b920] [c0000000000bfa5c] .put_files_struct+0xcc/0x15c [c00000002f91b9d0] [c0000000000bfb50] .exit_files+0x64/0x84 [c00000002f91ba70] [c0000000000c1d58] .do_exit+0x2a4/0x824 [c00000002f91bb50] [c0000000000c237c] .do_group_exit+0xa4/0xd8 [c00000002f91bbf0] [c0000000000d4bb4] .get_signal_to_deliver+0x42c/0x490 [c00000002f91bce0] [c000000000015cd4] .do_signal+0x7c/0x350 [c00000002f91be30] [c000000000008c58] do_work+0x24/0x28 Instruction dump: 7fab0000 41be0030 38090001 e93e8078 e87e8070 90080008 800a0010 e8cb0000 78001f24 7ca9002a 4838b755 60000000 <0fe00000> e93e8000 8169000c 380b0001 Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6b73 Faulting instruction address: 0xc0000000000da510 cpu 0x3: Vector: 300 (Data Access) at [c000000039443ae0] pc: c0000000000da510: .worker_thread+0x208/0x3dc lr: c0000000000da4e8: .worker_thread+0x1e0/0x3dc sp: c000000039443d60 msr: 8000000000009032 dar: 6b6b6b6b6b6b6b73 dsisr: 42000000 current = 0xc00000003943a7f0 paca = 0xc0000000012d8900 pid = 60, comm = events/3 enter ? for help 3:mon> t [c000000039443e90] c0000000000e16ec .kthread+0xac/0xb8 [c000000039443f90] c0000000000313b8 .kernel_thread+0x54/0x70 3:mon> r R00 = 0000000000000000 R16 = 000000000032e000 R01 = c000000039443d60 R17 = 000000000263ebf0 R02 = c00000000127e930 R18 = c00000003943a7f0 R03 = c000000039406668 R19 = c000000070004a40 R04 = c0000000006eeb94 R20 = c000000034ad4358 R05 = c00000000016069c R21 = c0000000394066b8 R06 = 0000000000000000 R22 = c0000000394066a8 R07 = 0000000000000002 R23 = c000000039440000 R08 = 0000000000000001 R24 = c000000039443dd0 R09 = c000000034ad4358 R25 = 0000000000000001 R10 = c0000000394066a8 R26 = c000000039443df8 R11 = 6b6b6b6b6b6b6b6b R27 = 6b6b6b6b6b6b6b6b R12 = 0000000024000044 R28 = c000000034ad4350 R13 = c0000000012d8900 R29 = c000000039406668 R14 = 0000000003f00000 R30 = c0000000011fa708 R15 = 0000000001b5faa8 R31 = c000000039443d60 pc = c0000000000da510 .worker_thread+0x208/0x3dc lr = c0000000000da4e8 .worker_thread+0x1e0/0x3dc msr = 8000000000009032 cr = 24000042 ctr = c00000000016057c xer = 0000000020000001 trap = 300 dar = 6b6b6b6b6b6b6b73 dsisr = 42000000 3:mon> di %pc c0000000000da510 f94b0008 std r10,8(r11) c0000000000da514 f9290000 std r9,0(r9) c0000000000da518 f9290008 std r9,8(r9) c0000000000da51c 48614595 bl c0000000006eeab0 # ._spin_unlock_irq+0x0/0x78 c0000000000da520 60000000 nop c0000000000da524 e814fff8 ld r0,-8(r20) c0000000000da528 78000764 rldicr r0,r0,0,61 c0000000000da52c 7fa00278 xor r0,r29,r0 c0000000000da530 3120ffff addic r9,r0,-1 c0000000000da534 7c090110 subfe r0,r9,r0 c0000000000da538 0b000000 tdnei r0,0 c0000000000da53c 7c00e0a8 ldarx r0,0,r28 c0000000000da540 7c00c878 andc r0,r0,r25 c0000000000da544 7c00e1ad stdcx. r0,0,r28 c0000000000da548 40a2fff4 bne c0000000000da53c # .worker_thread+0x234/0x3dc c0000000000da54c e87d00a8 ld r3,168(r29) 3:mon> ======================== Kernel: 2.6.31-0.125.4.2.rc5.git2.fc12.ppc64 CPU: power5 model : IBM,9113-550 Steps: 1. Compile the test case gcc -o KernelTtyTest KernelTtyTest.c -Wall -O2 -lutil 2. Run the test case in a while loop while :; do ./KernelTtyTest ; done Test will generate failures as below. Ignore the failures and within a 5 min of the test run system will drop to xmon --------------------------- Aborted read failed: Input/output error lastbytes: '\r\nprocess_cmd read failed: Interrupted system call\r\n' ------------------------------ =Comment: #2================================================= SHARYATHI NAGESH <sharyath.com> - Pavan This is what I could debug: The problem is observed because the list_head of work_struct data structure contains garbage value (6b6b..) which intern is the string (poison value) written by the slub/slab free routines, when CONFIG_SLUB_DUBUG is enabled. I was not able to determine the exact race condition (kernel thread) which is freeing the work_struct data structure before it is done by run_workqueue() function. This is what analysis I have done till now ------------------------------------------------------------------ Disassembled instructions c0000000000da4f8 e93d0040 ld r9,64(r29) c0000000000da4fc fb9d00a0 std r28,160(r29) c0000000000da500 7fa3eb78 mr r3,r29 c0000000000da504 e9490008 ld r10,8(r9) 3:mon> c0000000000da508 e9690000 ld r11,0(r9) c0000000000da50c f96a0000 std r11,0(r10) c0000000000da510 f94b0008 std r10,8(r11) <== This is where it seg faults c0000000000da514 f9290000 std r9,0(r9) c0000000000da518 f9290008 std r9,8(r9) c0000000000da51c 48614595 bl c0000000006eeab0 # ._spin_unlock_irq+0x0/0x78 ------------------------------------------------------------------ register contents 3:mon> r R00 = 0000000000000000 R16 = 000000000032e000 R01 = c0000000393dfd60 R17 = 000000000263ebf0 R02 = c00000000127e930 R18 = c0000000393d27f0 R03 = c0000000392db710 R19 = c000000070004a40 R04 = c0000000006eeb94 R20 = c0000000099f63e8 R05 = c00000000016069c R21 = c0000000392db760 R06 = 0000000000000000 R22 = c0000000392db750 R07 = 0000000000000002 R23 = c0000000393dc000 R08 = 0000000000000001 R24 = c0000000393dfdd0 R09 = c0000000099f63e8 R25 = 0000000000000001 R10 = 6b6b6b6b6b6b6b6b R26 = c0000000393dfdf8 R11 = 6b6b6b6b6b6b6b6b R27 = 6b6b6b6b6b6b6b6b R12 = 0000000000000000 R28 = c0000000099f63e0 R13 = c0000000012d8900 R29 = c0000000392db710 R14 = 0000000003f00000 R30 = c0000000011fa708 R15 = 0000000001b5faa8 R31 = c0000000393dfd60 pc = c0000000000da50c .worker_thread+0x204/0x3dc lr = c0000000000da4e8 .worker_thread+0x1e0/0x3dc msr = 8000000000009032 cr = 24000042 ctr = c00000000016057c xer = 0000000020000001 trap = 300 dar = 6b6b6b6b6b6b6b6b dsisr = 42000000 You can observe R10 and R11 which holds previous and next pointers in the list head respectively are poisoned by slub free() routine ------------------------------------------------------------------ objdump results mapped the assembly instructions to these lines of C code cwq->current_work = work; 1840: fb 9d 00 a0 std r28,160(r29) /root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/kernel/workqueue.c:285 <== From run_workqueue() function list_del_init(cwq->worklist.next); spin_unlock_irq(&cwq->lock); 1844: 7f a3 eb 78 mr r3,r29 list_del_init(): /root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/include/linux/list.h:142 * list_del_init - deletes entry from list and reinitialize it. * @entry: the element to delete from the list. */ static inline void list_del_init(struct list_head *entry) { __list_del(entry->prev, entry->next); 1848: e9 49 00 08 ld r10,8(r9) 184c: e9 69 00 00 ld r11,0(r9) __list_del(): /root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/include/linux/list.h:93 * the prev/next entries already! */ static inline void __list_del(struct list_head * prev, struct list_head * next) { next->prev = prev; prev->next = next; 1850: f9 6a 00 00 std r11,0(r10) /root/rpmbuild/BUILD/kernel-2.6.30/linux-2.6.30.ppc64/include/linux/list.h:92 * This is only for internal list manipulation where we know * the prev/next entries already! */ static inline void __list_del(struct list_head * prev, struct list_head * next) { next->prev = prev; 1854: f9 4b 00 08 std r10,8(r11) <== This is where it breaks ------------------------------------------------------------------ Still working on understanding the actual cause for the race / corruption of work_struct data structure Thanks =Comment: #5================================================= SHARYATHI NAGESH <sharyath.com> - Pavan Can you update the bug with your efforts to recreate the issue with mainline kernel Thanks
Created attachment 359668 [details] Test case KernelTtyTest.c
------- Comment From pavan.naregundi.com 2009-09-04 02:55 EDT------- (In reply to comment #5) > Pavan > Can you update the bug with your efforts to recreate the issue with mainline > kernel > Thanks > Could not reproduce the issue with kernel-2.6.31-rc8-git2 on a similar machine with the config file of F12alpha.
(In reply to comment #2) > Could not reproduce the issue with kernel-2.6.31-rc8-git2 on a similar machine > with the config file of F12alpha. But you are comparing Fedora's 2.6.31-rc5-git2 with upstream -rc8-git2. Can you retest Fedora with the latest kernel?
------- Comment From sharyath.com 2009-09-08 00:18 EDT------- If the machine is accessible I would like to trace queue_work() using systemtap to understand who are all adding work queues when the test is run. It will help to narrow down the problem area, let me know when I can use the machine for testing
------- Comment From pavan.naregundi.com 2009-09-08 02:10 EDT------- (In reply to comment #10) > But you are comparing Fedora's 2.6.31-rc5-git2 with upstream -rc8-git2. Can you > retest Fedora with the latest kernel? > Tested with latest fedora kernel-2.6.31-0.204.rc9.fc12.ppc64, I could not reproduce the issue now.
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
This message is a reminder that Fedora 12 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 12. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '12'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 12's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 12 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.