LTC Owner is: skannery.com LTC Originator is: marksmit.com Problem description: Fedora devel (aka rawhide) netbooting a ppc64 pSeries OpenPower720, crashes into Xmon. 9.3.117.7:/distros/latest-rawhide/SRPMS/kernel-2.6.19-1.2912.fc7.src.rpm (aka Jan17 rawhide) did not recreate. But today's Jan18 snapshot is identical source code. Hardware Environment: Power5, HV4 OpenPower 720 Is this reproducible? don't know; stopped upon first instance and held for debug. similar SF4 is successfully netbooting and installing ok. Additional information: ------------[ cut here ]------------ cpu 0x1: Vector: 700 (Program Check) at [c000000002f733a0] pc: c0000000001b1994: .__list_add+0x60/0x98 lr: c0000000001b1990: .__list_add+0x5c/0x98 sp: c000000002f73620 msr: 8000000000029032 current = 0xc000000047c75360 paca = 0xc0000000005a1500 pid = 527, comm = loader kernel BUG at lib/list_debug.c:33! ------------[ cut here ]------------ enter ? for help [c000000002f736a0] c0000000001aaa94 .kobject_add+0xb0/0x1e4 [c000000002f73740] c0000000002a4214 .device_add+0xec/0x5b8 [c000000002f73800] c0000000002a47c4 .device_create+0xb4/0x104 [c000000002f738b0] c00000000028232c .vcs_make_sysfs+0x48/0x94 [c000000002f73940] c00000000028a8e8 .con_open+0xa0/0xd4 [c000000002f739d0] c00000000027ab8c .tty_open+0x200/0x368 [c000000002f73a80] c0000000000fad14 .chrdev_open+0x1a0/0x208 [c000000002f73b30] c0000000000f4fd8 .__dentry_open+0x13c/0x278 [c000000002f73be0] c0000000000f5288 .do_filp_open+0x50/0x70 [c000000002f73d00] c0000000000f531c .do_sys_open+0x74/0x130 [c000000002f73db0] c00000000013397c .compat_sys_open+0x24/0x38 [c000000002f73e30] c0000000000086c8 syscall_exit+0x0/0x40 --- Exception: c01 (System Call) at 00000000101a70c0 SP (ffff71e0) is in userspace 1:mon> 1:mon> r R00 = c0000000001b1990 R16 = 0000000000000003 R01 = c000000002f73620 R17 = 0000000000000000 R02 = c0000000006b5dc0 R18 = 0000000010200000 R03 = 0000000000000079 R19 = 0000000010200000 R04 = 0000000000000000 R20 = 0000000010200000 R05 = 0000000000000000 R21 = 0000000000000001 R06 = 0000000000000000 R22 = 0000000010290000 R07 = 0000000000023a1a R23 = 0000000010290000 R08 = 0000000023632fae R24 = 0000000000000000 R09 = c0000000006e2200 R25 = c000000001e9e9b8 R10 = 000000001c00000f R26 = fffffffffffffffe R11 = 0000000000000000 R27 = c000000002b1ac80 R12 = 0000000000004000 R28 = c000000047cd6058 R13 = c0000000005a1500 R29 = fffffffffffffff4 R14 = 0000000000000000 R30 = c00000000062d508 R15 = 0000000000000000 R31 = c000000001e9e9b8 pc = c0000000001b1994 .__list_add+0x60/0x98 lr = c0000000001b1990 .__list_add+0x5c/0x98 msr = 8000000000029032 cr = 24000482 ctr = 80000000001c5840 xer = 000000000000000e trap = 700 1:mon> ------------------------------------------------------------------- Crash is happening here.. if (unlikely(prev->next != next)) { printk(KERN_ERR "list_add corruption. prev->next should be " "next (%p), but was %p. (prev=%p).\n", next, prev->next, prev); BUG(); } Kobject list got corrupted. Dmesg log shows the following: ============================================================================== <3>list_add corruption. prev->next should be next (c000000000632780), but was c000000002e64418. (prev=c000000002e64418).. <2>kernel BUG at lib/list_debug.c:33!. ============================================================================== Before this crash generated error, dmesg shows another error -EEXIST that happened while trying to add a new kobject: ============================================================================== <5>scsi 0:255:255:255: No Device IBM 5709001 0150 PQ: 0 ANSI: 0. <4>kobject_add failed for 0:255:255:255 with -EEXIST, don't try to register things with the same name in the same directory.. <4>Call Trace:. <4>[C0000000027E3130] [C000000000010D1C] .show_stack+0x68/0x1b0 (unreliable). <4>[C0000000027E31D0] [C0000000001AAB70] .kobject_add+0x18c/0x1e4. <4>[C0000000027E3270] [C0000000002A4214] .device_add+0xec/0x5b8. <4>[C0000000027E3330] [D00000000020CD38] .scsi_sysfs_add_sdev+0x50/0x280 [scsi_mod]. <4>[C0000000027E33E0] [D000000000209DFC] .scsi_probe_and_add_lun+0x904/0xaa4 [scsi_mod]. <4>[C0000000027E34F0] [D00000000020B41C] .__scsi_add_device+0x84/0xd0 [scsi_mod]. <4>[C0000000027E35A0] [D00000000020B69C] .scsi_add_device+0x14/0x44 [scsi_mod]. <4>[C0000000027E3620] [D00000000026CCE4] .ipr_probe+0x113c/0x1228 [ipr]. <4>[C0000000027E3730] [C0000000001BD1D0] .pci_device_probe+0x144/0x1e4. <4>[C0000000027E37F0] [C0000000002A73E4] .really_probe+0xbc/0x180. <4>[C0000000027E3890] [C0000000002A77D4] .__driver_attach+0xdc/0x164. <4>[C0000000027E3920] [C0000000002A63F4] .bus_for_each_dev+0x7c/0xd4. <4>[C0000000027E39E0] [C0000000002A71D0] .driver_attach+0x28/0x40. ============================================================================== Mark, Failure while adding kobject "0:255:255:255" is happening only for jan 18 rawhide thru netboot. Checked in dmsg of athenalp1 (Jan 18 rawhide running fine) and found that the same kobject is added successfully without any errors. Also, this error is happening only when the device object is added, the driver object is getting added to kobject list successfully. ------------------------------------------------------------------------------------ Red Hat, Mirroring this bug for your awarness. -thanks.
----- Additional Comments From skannery.com 2007-02-21 04:34 EDT ------- Mark, From the current stack trace, looks like this time we got a crash much before in the installation process than the earlier ones. The reason for the crash also is different: BUG: spinlock bad magic on CPU#2, loader/2325 (Not tainted) I wanted to look at dmesg log to check whether there were any error messages before this. But not able to access hmc6lte. Can you please check. After looking into dmesg we will be able to confirm whether the initially reported scenario(EEXIST) has happened or not. Thanks & Rgds, Supriya
----- Additional Comments From skannery.com 2007-02-25 23:24 EDT ------- Mark, The problem, ps3_system_bus_driver_register() creating a WARN_ON() in kref_get() in platforms other than PS3, is discussed in linux-usb-devel mailing list. Can you pls try applying the patch suggested and see whether this problem is getting addressed. http://www.mail-archive.com/linux-usb-devel@lists.sourceforge.net/msg50834.html Thanks & Rgds, Supriya
----- Additional Comments From marksmit.com 2007-03-07 14:10 EDT ------- Supriya, I have not gotten a chance to try the patch; but not for lack of trying. The Feb15 rawhide build that triggered that crash would recreate every time. I could not work around it; even removed USB resources and ran with a minimal- resources lpar profile. I still have that rawhide snapshot available, but have moved up to current. The newer FC7-test2 (aka 6.91) build and newer rawhides are not recreating at all. Since the previous crashes took quite a few attempts to recreate, I would like to get the system installed and then reboot in a loop, waiting for a crash. I am currently hitting install bugs further in the process, so I need to deal with those first.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |OPEN ------- Additional Comments From marksmit.com 2007-03-08 21:06 EDT ------- Supriya, after installing FC6 and then "upgrade" installing to FC7-test2, I am now able to test again. In the past 20 hours, I have let it reboot 298 times, attempting recreates, but have hit nothing. I am prepared to call it un-recreatable on all stack traces and close this bug, unless you can suggest something else for me to try.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |ASSIGNED ------- Additional Comments From skannery.com 2007-03-09 12:01 EDT ------- Mark, 298 times rebooting is too big a number waiting for any of these stacks recreated. Agree with you, we can close this as unreproducible. And in case if you hit any of these stack traces during later testing, you could reopen the same bug report. Thanks! You have been very supportive. Rgds, Supriya