Bug 224082

Summary: kernel crash, netbooting to start rawhide install
Product: [Fedora] Fedora Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: rawhideCC: wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: powerpc   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-15 19:23:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description IBM Bug Proxy 2007-01-23 23:00:26 UTC
LTC Owner is: skannery.com
LTC Originator is: marksmit.com

Problem description:
Fedora devel (aka rawhide) netbooting a ppc64 pSeries OpenPower720, crashes 
into Xmon.
9.3.117.7:/distros/latest-rawhide/SRPMS/kernel-2.6.19-1.2912.fc7.src.rpm  (aka 
Jan17 rawhide) did not recreate.
But today's Jan18 snapshot is identical source code.

Hardware Environment: Power5, HV4 OpenPower 720

Is this reproducible?
 don't know; stopped upon first instance and held for debug.
 similar SF4 is successfully netbooting and installing ok.

 Additional information:
        ------------[ cut here ]------------
cpu 0x1: Vector: 700 (Program Check) at [c000000002f733a0]
    pc: c0000000001b1994: .__list_add+0x60/0x98
    lr: c0000000001b1990: .__list_add+0x5c/0x98 
    sp: c000000002f73620
   msr: 8000000000029032
  current = 0xc000000047c75360
  paca    = 0xc0000000005a1500
    pid   = 527, comm = loader
kernel BUG at lib/list_debug.c:33!
------------[ cut here ]------------
enter ? for help
[c000000002f736a0] c0000000001aaa94 .kobject_add+0xb0/0x1e4
[c000000002f73740] c0000000002a4214 .device_add+0xec/0x5b8
[c000000002f73800] c0000000002a47c4 .device_create+0xb4/0x104
[c000000002f738b0] c00000000028232c .vcs_make_sysfs+0x48/0x94
[c000000002f73940] c00000000028a8e8 .con_open+0xa0/0xd4
[c000000002f739d0] c00000000027ab8c .tty_open+0x200/0x368
[c000000002f73a80] c0000000000fad14 .chrdev_open+0x1a0/0x208
[c000000002f73b30] c0000000000f4fd8 .__dentry_open+0x13c/0x278
[c000000002f73be0] c0000000000f5288 .do_filp_open+0x50/0x70
[c000000002f73d00] c0000000000f531c .do_sys_open+0x74/0x130
[c000000002f73db0] c00000000013397c .compat_sys_open+0x24/0x38
[c000000002f73e30] c0000000000086c8 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000000101a70c0
SP (ffff71e0) is in userspace
1:mon>
1:mon> r
R00 = c0000000001b1990   R16 = 0000000000000003
R01 = c000000002f73620   R17 = 0000000000000000
R02 = c0000000006b5dc0   R18 = 0000000010200000
R03 = 0000000000000079   R19 = 0000000010200000
R04 = 0000000000000000   R20 = 0000000010200000
R05 = 0000000000000000   R21 = 0000000000000001
R06 = 0000000000000000   R22 = 0000000010290000
R07 = 0000000000023a1a   R23 = 0000000010290000
R08 = 0000000023632fae   R24 = 0000000000000000
R09 = c0000000006e2200   R25 = c000000001e9e9b8
R10 = 000000001c00000f   R26 = fffffffffffffffe
R11 = 0000000000000000   R27 = c000000002b1ac80
R12 = 0000000000004000   R28 = c000000047cd6058
R13 = c0000000005a1500   R29 = fffffffffffffff4
R14 = 0000000000000000   R30 = c00000000062d508
R15 = 0000000000000000   R31 = c000000001e9e9b8
pc  = c0000000001b1994 .__list_add+0x60/0x98
lr  = c0000000001b1990 .__list_add+0x5c/0x98
msr = 8000000000029032   cr  = 24000482
ctr = 80000000001c5840   xer = 000000000000000e   trap =  700
1:mon>
-------------------------------------------------------------------
Crash is happening here..
        if (unlikely(prev->next != next)) {
                printk(KERN_ERR "list_add corruption. prev->next should be "
                        "next (%p), but was %p. (prev=%p).\n",
                        next, prev->next, prev);
                BUG();
        }
Kobject list got corrupted. Dmesg log shows the following:
==============================================================================
<3>list_add corruption. prev->next should be next (c000000000632780), but was
c000000002e64418. (prev=c000000002e64418)..
<2>kernel BUG at lib/list_debug.c:33!.
==============================================================================

Before this crash generated error, dmesg shows another error -EEXIST that
happened while trying to add a new kobject:
==============================================================================
<5>scsi 0:255:255:255: No Device         IBM      5709001          0150 PQ: 0
ANSI: 0.
<4>kobject_add failed for 0:255:255:255 with -EEXIST, don't try to register
things with the same name in the same directory..
<4>Call Trace:.
<4>[C0000000027E3130] [C000000000010D1C] .show_stack+0x68/0x1b0 (unreliable).
<4>[C0000000027E31D0] [C0000000001AAB70] .kobject_add+0x18c/0x1e4.
<4>[C0000000027E3270] [C0000000002A4214] .device_add+0xec/0x5b8.
<4>[C0000000027E3330] [D00000000020CD38] .scsi_sysfs_add_sdev+0x50/0x280 [scsi_mod].
<4>[C0000000027E33E0] [D000000000209DFC] .scsi_probe_and_add_lun+0x904/0xaa4
[scsi_mod].
<4>[C0000000027E34F0] [D00000000020B41C] .__scsi_add_device+0x84/0xd0 [scsi_mod].
<4>[C0000000027E35A0] [D00000000020B69C] .scsi_add_device+0x14/0x44 [scsi_mod].
<4>[C0000000027E3620] [D00000000026CCE4] .ipr_probe+0x113c/0x1228 [ipr].
<4>[C0000000027E3730] [C0000000001BD1D0] .pci_device_probe+0x144/0x1e4.
<4>[C0000000027E37F0] [C0000000002A73E4] .really_probe+0xbc/0x180.
<4>[C0000000027E3890] [C0000000002A77D4] .__driver_attach+0xdc/0x164.
<4>[C0000000027E3920] [C0000000002A63F4] .bus_for_each_dev+0x7c/0xd4.
<4>[C0000000027E39E0] [C0000000002A71D0] .driver_attach+0x28/0x40.
==============================================================================

Mark,
 Failure while adding kobject "0:255:255:255" is happening only for jan 18
rawhide thru netboot. Checked in dmsg of athenalp1 (Jan 18 rawhide running fine)
and found that the same kobject is added successfully without any errors. 
 Also, this error is happening only when the device object is added, the driver
object is getting added to kobject list successfully.


------------------------------------------------------------------------------------

Red Hat,
Mirroring this bug for your awarness.
-thanks.

Comment 1 IBM Bug Proxy 2007-02-21 09:36:07 UTC
----- Additional Comments From skannery.com  2007-02-21 04:34 EDT -------
Mark,
  From the current stack trace, looks like this time we got a crash much 
before in the installation process than the earlier ones. 
The reason for the crash also is different:

BUG: spinlock bad magic on CPU#2, loader/2325 (Not tainted)

  I wanted to look at dmesg log to check whether there were any error messages 
before this. But not able to access hmc6lte. Can you please check.

  After looking into dmesg we will be able to confirm whether the initially 
reported scenario(EEXIST) has happened or not. 
Thanks & Rgds, Supriya 

Comment 2 IBM Bug Proxy 2007-02-26 04:30:23 UTC
----- Additional Comments From skannery.com  2007-02-25 23:24 EDT -------
Mark,
  The problem, ps3_system_bus_driver_register() creating a WARN_ON() in
kref_get() in platforms other than PS3, is discussed in linux-usb-devel mailing
list. Can you
pls try applying the patch suggested and see whether this problem is getting
addressed.

http://www.mail-archive.com/linux-usb-devel@lists.sourceforge.net/msg50834.html
Thanks & Rgds, Supriya 

Comment 3 IBM Bug Proxy 2007-03-07 19:15:13 UTC
----- Additional Comments From marksmit.com  2007-03-07 14:10 EDT -------
Supriya,
I have not gotten a chance to try the patch; but not for lack of trying.
The Feb15 rawhide build that triggered that crash would recreate every time.  
I could not work around it; even removed USB resources and ran with a minimal-
resources lpar profile.   I still have that rawhide snapshot available, but 
have moved up to current.

The newer FC7-test2 (aka 6.91) build and newer rawhides are not recreating at 
all.   Since the previous crashes took quite a few attempts to recreate, I 
would like to get the system installed and then reboot in a loop, waiting for 
a crash.   
I am currently hitting install bugs further in the process, so I need to deal 
with those first. 

Comment 4 IBM Bug Proxy 2007-03-09 02:10:42 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |OPEN




------- Additional Comments From marksmit.com  2007-03-08 21:06 EDT -------
Supriya,
after installing FC6 and then "upgrade" installing to FC7-test2, I am now able 
to test again.
In the past 20 hours, I have let it reboot 298 times, attempting recreates, 
but have hit nothing.
I am prepared to call it un-recreatable on all stack traces and close this 
bug, unless you can suggest something else for me to try. 

Comment 5 IBM Bug Proxy 2007-03-09 17:05:15 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |ASSIGNED




------- Additional Comments From skannery.com  2007-03-09 12:01 EDT -------
Mark,
 298 times rebooting is too big a number waiting for any of these stacks
recreated. Agree with you, we can close this as unreproducible. And in case if
you hit any of these stack traces during later testing, you could reopen the
same bug report.
 Thanks! You have been very supportive.
Rgds, Supriya