Bug 641086 - mpt2sas driver update causes boot failure with Dell PERC H200 SAS HBA
mpt2sas driver update causes boot failure with Dell PERC H200 SAS HBA
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.6
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Tomas Henzl
Red Hat Kernel QE team
: Regression, TestBlocker
Depends On:
Blocks: 640580 568281
  Show dependency treegraph
 
Reported: 2010-10-07 14:12 EDT by Nate Straz
Modified: 2011-01-13 16:56 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 16:56:30 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Full console log from buzz-02 boot. (36.38 KB, application/octet-stream)
2010-10-07 14:12 EDT, Nate Straz
no flags Details
Full console log from buzz-05 boot. (36.25 KB, application/octet-stream)
2010-10-07 14:18 EDT, Nate Straz
no flags Details
direct attached SEP device patch (1.37 KB, patch)
2010-10-28 11:51 EDT, kashyap
no flags Details | Diff
direct attached SEP device patch for RHEL5.6 (1.38 KB, patch)
2010-10-28 12:46 EDT, Nate Straz
no flags Details | Diff

  None (edit)
Description Nate Straz 2010-10-07 14:12:55 EDT
Created attachment 452174 [details]
Full console log from buzz-02 boot.

Description of problem:

While booting a Dell PowerEdge R710 w/ 24G RAM I'm getting the following panic.

                Welcome to Red Hat Enterprise Linux Server
                Press 'I' to enter interactive startup.
Setting clock  (utc): Thu Oct  7 13:02:39 CDT 2010 [  OK  ]
Starting udev: [  OK  ]
Loading default keymap (us): [  OK  ]
Setting hostname buzz-02:  [  OK  ]
ACPI: Power Button (FF) [PWRF]
ACPI: Mapper loaded
dell-wmi: No known WMI GUID found
Unable to handle kernel NULL pointer dereference at 0000000000000002 RIP:
 [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x16/0x54
PGD 62c54e067 PUD 62fd99067 PMD 0
Oops: 0002 [1] SMP
last sysfs file: /kernel/uevent_seqnum
CPU 0
Modules linked in: power_meter hwmon i2c_ec i2c_core dell_wmi wmi button batterd
Pid: 3563, comm: modprobe Not tainted 2.6.18-225.el5 #1
RIP: 0010:[<ffffffff80184ec4>]  [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x164
RSP: 0000:ffff81062d39fca8  EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff81062eeecc00 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff81062eeecc00 RDI: 0000000000000001
RBP: ffff81062fa63d60 R08: ffffc2000001062e R09: 0000000000000000
R10: ffff81062c523320 R11: 0000000000000050 R12: ffff81062b4f0420
R13: 000000000000000e R14: ffffc2000001062d R15: ffff81062b4f04a0
FS:  00002b5b30b1c6e0(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000002 CR3: 000000062c51a000 CR4: 00000000000006e0
Process modprobe (pid: 3563, threadinfo ffff81062d39e000, task ffff81062fce0080)
Stack:  ffff81062b4f04a0 ffffffff801825e2 ffff81062fa638a0 ffff81062eeecc00
 ffff81062eeecc00 ffff81062b4f0420 0000000000000012 ffffffff80183215
 ffff81062eeecc00 0000000000000000 ffff81062eeecc00 0000000000000000
Call Trace:
 [<ffffffff801825e2>] acpi_ds_eval_data_object_operands+0x6b/0xef
 [<ffffffff80183215>] acpi_ds_exec_end_op+0x297/0x408
 [<ffffffff801924ed>] acpi_ps_parse_loop+0x602/0x94f
 [<ffffffff80191a74>] acpi_ps_parse_aml+0x80/0x254
 [<ffffffff80192cec>] acpi_ps_execute_pass+0x82/0x98
 [<ffffffff80192e0f>] acpi_ps_execute_method+0xd3/0x169
 [<ffffffff8018fe6d>] acpi_ns_evaluate+0xa8/0x10a
 [<ffffffff8018fa69>] acpi_evaluate_object+0x131/0x1e0
 [<ffffffff80064604>] __down_read+0x12/0x92
 [<ffffffff8849008e>] :power_meter:read_capabilities+0x5f/0x215
 [<ffffffff884910a8>] :power_meter:acpi_power_meter_add+0x10a/0x17a
 [<ffffffff801a41a1>] acpi_bus_driver_init+0x30/0x57
 [<ffffffff801a53bf>] acpi_bus_register_driver+0x95/0xd4
 [<ffffffff882ef032>] :power_meter:acpi_power_meter_init+0x25/0x34
 [<ffffffff800a8d1e>] sys_init_module+0xaf/0x1f2
 [<ffffffff8005d116>] system_call+0x7e/0x83


Code: 11 00 49 89 f0 45 31 c9 48 c7 c2 f5 7e 2c 80 be c9 01 00 00
RIP  [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x16/0x54
 RSP <ffff81062d39fca8>
CR2: 0000000000000002
 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
kernel-2.6.18-225.el5.x86_64

How reproducible:
2/5 nodes won't boot because of this.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Nate Straz 2010-10-07 14:18:33 EDT
Created attachment 452175 [details]
Full console log from buzz-05 boot.

This is the node that looked to fail during acpi_memhotplug loading.  I didn't realize until after I submitted the bug that buzz-02 failed on loading power_meter.
Comment 2 Nate Straz 2010-10-07 14:22:46 EDT
Here's the console log from one of the systems that made it through the boot process.

                Welcome to Red Hat Enterprise Linux Server
                Press 'I' to enter interactive startup.
Setting clock  (utc): Thu Oct  7 12:49:50 CDT 2010 [  OK  ]^M
Starting udev: [  OK  ]^M
Loading default keymap (us): [  OK  ]^M
Setting hostname buzz-01:  [  OK  ]^M
ACPI: Power Button (FF) [PWRF]
ACPI: Mapper loaded
dell-wmi: No known WMI GUID found
ACPI Exception (evregion-0424): AE_SUPPORT, Returned by Handler for [DataTable] [20060707]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff81033005d8b0), AE_SUPPORT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff81033005d8f0), AE_SUPPORT
ACPI Exception (power_meter-0759): AE_SUPPORT, Evaluating _PMC [20060707]
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
Comment 3 Nate Straz 2010-10-07 15:08:51 EDT
With kernel-2.6.18-194.el5 all five systems boot and I found this warning in the console logs.

Setting hostname buzz-02:  [  OK  ]
ACPI: Power Button (FF) [PWRF]
ACPI: Mapper loaded 
dell-wmi: No known WMI GUID found
ACPI Exception (evregion-0424): AE_SUPPORT, Returned by Handler for [DataTable] [20060707]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff81033005d8b0), AE_SUPPORT
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff81033005d8f0), AE_SUPPORT
ACPI Exception (power_meter-0759): AE_SUPPORT, Evaluating _PMC [20060707]
Device 'power_meter0' does not have a release() function, it is broken and must be fixed.
BUG: warning at drivers/base/core.c:101/device_release() (Not tainted)

Call Trace:
 [<ffffffff801519ef>] kobject_cleanup+0x53/0x7e
 [<ffffffff80151a1a>] kobject_release+0x0/0x9
 [<ffffffff80035748>] kref_put+0x6f/0x7a
 [<ffffffff884300ed>] :power_meter:acpi_power_meter_add+0x158/0x16f
 [<ffffffff801a115e>] acpi_bus_driver_init+0x30/0x57
 [<ffffffff801a237c>] acpi_bus_register_driver+0x95/0xd4
 [<ffffffff882cd032>] :power_meter:acpi_power_meter_init+0x25/0x34
 [<ffffffff800a7fe0>] sys_init_module+0xaf/0x1f2
 [<ffffffff8005e116>] system_call+0x7e/0x83

md: Autodetecting RAID arrays.
Comment 4 Nate Straz 2010-10-07 15:47:43 EDT
I installed RHEL5.5 and kernel-2.6.18-194.el5 boots on all nodes.

I'm searching through kernels now to find where this regressed.  -210 is good too.
Comment 5 Nate Straz 2010-10-07 16:13:55 EDT
-215.el5 PASS
-219.el5 FAIL
Comment 6 Nate Straz 2010-10-08 10:05:10 EDT
Ran a git bisect between the two brew builds in comment #5.

[nstraz@sts-a rhel5-kernel]$ git bisect log
git-bisect start
# good: [651320bcf36355b3ffc50783ea3547eed83e7b6c] tag: kernel-2.6.18-215.el5
git-bisect good 651320bcf36355b3ffc50783ea3547eed83e7b6c
# bad: [2c99776eeec3c4457edd1a379cf591046af228f9] tag: kernel-2.6.18-219.el5
git-bisect bad 2c99776eeec3c4457edd1a379cf591046af228f9
# bad: [ceebd5d441199afbdead0a49ae34bdb4d24b7719] [scsi] mpt2sas: update to 05.101.00.02
git-bisect bad ceebd5d441199afbdead0a49ae34bdb4d24b7719
# good: [7d1faaf64d8a7a44b3040268ac5d26da3210994b] [fs] nfs: fix file create failure with HPUX client
git-bisect good 7d1faaf64d8a7a44b3040268ac5d26da3210994b
# good: [bc6c8e57f8e0eff1425ccc14541cce85a483ed53] [scsi] ipr: set data list length in request control block
git-bisect good bc6c8e57f8e0eff1425ccc14541cce85a483ed53
# good: [a784d244a9ea0899c19c5a7715e894a1e32ba21b] [scsi] ipr: add MMIO write for BIST on 64-bit adapters
git-bisect good a784d244a9ea0899c19c5a7715e894a1e32ba21b
# good: [240133380ed4718fa74764f01cb67e084147b47a] [scsi] ipr: fix transition to operational on new adapters
git-bisect good 240133380ed4718fa74764f01cb67e084147b47a
# good: [5b55b930ee4adff785148eb03e49b6d1131bad60] [scsi] ipr: bump the version number and date
git-bisect good 5b55b930ee4adff785148eb03e49b6d1131bad60


Looks like the mpt2sas update is where this regressed.  These systems do use this driver for the root drive.
Comment 9 Matthew Garrett 2010-10-19 10:08:13 EDT
I'm confused. The bug refers to an oops while loading acpi_memhotplug, while the original backtrace shows an oops while loading the power meter driver. Do both reproducibly occur? If this is in any way mptsas related then I can only see it being due to some sort of memory corruption...
Comment 10 Tom Coughlan 2010-10-19 10:18:47 EDT
You can run the -219.el5 kernel with the mpt2sas driver from -215.el5 (rebuild the initrd) to see if that has anything to do with it.
Comment 11 Nate Straz 2010-10-19 11:00:47 EDT
I'm having some trouble booting a -219.el5 kernel with the mpt2sas.ko from -215.el5.

Loading scsi_transport_sas.ko module
Loading mpt2sas.ko module
ksign: module signed with unknown public key
- signature keyid: 214227b7fd9649b7 ver=3
insmod: error inserting '/lib/mpt2sas.ko': -1 Opshpchp: Standard Hot Plug PCI C4
eration not permitted

What do I need to do to get around this issue?
Comment 12 Nate Straz 2010-10-19 11:08:22 EDT
(In reply to comment #9)
> I'm confused. The bug refers to an oops while loading acpi_memhotplug, while
> the original backtrace shows an oops while loading the power meter driver. Do
> both reproducibly occur? If this is in any way mptsas related then I can only
> see it being due to some sort of memory corruption...

It doesn't happen every boot or on every node, but we hit it often enough to make the systems unusable for RHEL5.6 testing.  I ran the git bisect on the rhel5-kernel tree, running each step on five nodes and the mpt2sas update was the reliable breaking point.  It's unfortunate that is such a large patch.
Comment 13 Matthew Garrett 2010-10-19 11:28:33 EDT
Right, but what are you hitting? You have two oopses here - is it always one of these two, or does it occur in random parts of the ACPI code?
Comment 14 Nate Straz 2010-10-19 11:49:31 EDT
I've seen one of those two oops, or more rarely, an oops in mpt2sas.

Unable to handle kernel NULL pointer dereference at 00000000000002a0 RIP: 
 [<ffffffff880d72e7>] :mpt2sas:mpt2sas_transport_update_links+0xf4/0x158
PGD 32fd98067 PUD 32ea7f067 PMD 0 
Oops: 0002 [1] SMP 
last sysfs file: /block/ram0/dev
CPU 0 
Modules linked in: mpt2sas scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 959, comm: fw_event0 Not tainted 2.6.18-219.el5 #1
RIP: 0010:[<ffffffff880d72e7>]  [<ffffffff880d72e7>] :mpt2sas:mpt2sas_transport_update_links+0xf4/0x158
RSP: 0000:ffff81032e4b1c20  EFLAGS: 00010293
RAX: 0000000000000005 RBX: 0000000000000246 RCX: 0000000000000000
RDX: 0000000000000010 RSI: ffff81062f25a038 RDI: ffff81032fd50818
RBP: ffff81032fd50a20 R08: ffff81032e4b0000 R09: 0000000000000037
R10: ffff81033a883710 R11: ffffffff880c805d R12: ffff81032ff2ef00
R13: ffff81032fd504f8 R14: 0000000000000009 R15: 0000000000000009
FS:  0000000000000000(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000000002a0 CR3: 000000032e912000 CR4: 00000000000006e0
Process fw_event0 (pid: 959, threadinfo ffff81032e4b0000, task ffff81062f6ac080)
Stack:  ffff81033ab20ec0 0808ffff880cfb66 5a4badb01ae6f600 0000000000000008
 ffff81033ab20008 0000000000000009 ffff81032fd504f8 ffff81033ab20ec0
 ffff81062f75de40 ffffffff880d1cf2 0012000e0f000008 1ae6f60000010000
Call Trace:
 [<ffffffff880d1cf2>] :mpt2sas:_scsih_sas_topology_change_event+0x481/0x51c
 [<ffffffff8008cd0b>] find_busiest_group+0x20d/0x621
 [<ffffffff880d1fa7>] :mpt2sas:_firmware_event_work+0x21a/0x10b8
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff8002e474>] __wake_up+0x38/0x4f
 [<ffffffff880d1d8d>] :mpt2sas:_firmware_event_work+0x0/0x10b8
 [<ffffffff8004d9d4>] run_workqueue+0x99/0xf6
 [<ffffffff8004a204>] worker_thread+0x0/0x122
 [<ffffffff800a22b8>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a2f4>] worker_thread+0xf0/0x122
 [<ffffffff8008e1c8>] default_wake_function+0x0/0xe
 [<ffffffff800a22b8>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032b3a>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a22b8>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032a3c>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 89 82 90 02 00 00 41 f6 85 9a 00 00 00 04 74 45 49 8b 5c 24 
RIP  [<ffffffff880d72e7>] :mpt2sas:mpt2sas_transport_update_links+0xf4/0x158
 RSP <ffff81032e4b1c20>
CR2: 00000000000002a0
Comment 15 Tomas Henzl 2010-10-19 12:12:30 EDT
There were some issues with mpt2sas version 4 firmware. Even though this doesn't look exactly like the problems we have had, to be sure please check if you are using latest firmware, it should be the 'phase 5' firmware I think.
Comment 16 Nate Straz 2010-10-19 12:27:36 EDT
Can you correspond "version 4 firmware" of "phase 5 firmware" with the output below?

Linux version 2.6.18-219.el5 (mockbuild@x86-009.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Thu Sep 9 17:10:23 EDT 2010
...
mpt2sas0: LSISAS2008: FWVersion(02.15.63.00), ChipRevision(0x02), BiosVersion(07.01.09.00)
mpt2sas0: Dell PERC H200 Integrated: Vendor(0x1000), Device(0x0072), SSVID(0x1028), SSDID(0x1F1E)
Comment 17 Tomas Henzl 2010-10-19 15:38:41 EDT
(In reply to comment #16)
> Can you correspond "version 4 firmware" of "phase 5 firmware" with the output
> below?

Hmm, don't know, you can try the firmware utility from the LSI site 
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/external/sas9200-8e/index.html
You will have to match up your controller on the website.
I'm not sure this will work with the Dell version ...

> Linux version 2.6.18-219.el5 (mockbuild@x86-009.build.bos.redhat.com) (gcc
> version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Thu Sep 9 17:10:23 EDT 2010
> ...
> mpt2sas0: LSISAS2008: FWVersion(02.15.63.00), ChipRevision(0x02),
> BiosVersion(07.01.09.00)
> mpt2sas0: Dell PERC H200 Integrated: Vendor(0x1000), Device(0x0072),
> SSVID(0x1028), SSDID(0x1F1E)
Comment 18 Nate Straz 2010-10-19 17:50:38 EDT
I contacted Dell and confirmed that I do have the latest firmware from them for the PERC H200.

Dell noticed that my driver was out of date and recommended installing their driver.

http://ftp.us.dell.com/SAS-RAID/R266980-mpt2sas-02.00.03.00-1.tar.gz

Which now totally confuses me because the mpt2sas version I see in the RHEL5.6 kernel is 05.101.00.02.
Comment 19 Tomas Henzl 2010-10-21 09:19:33 EDT
(In reply to comment #18)
> I contacted Dell and confirmed that I do have the latest firmware from them for
> the PERC H200.
> 
> Dell noticed that my driver was out of date and recommended installing their
> driver.
> 
> http://ftp.us.dell.com/SAS-RAID/R266980-mpt2sas-02.00.03.00-1.tar.gz
> 
> Which now totally confuses me because the mpt2sas version I see in the RHEL5.6
> kernel is 05.101.00.02.

The version string in mptsas change at fast pace, upstream is now at 06, in RHEL5 we have something which corresponds with upstream 05
from upstream git log
[SCSI] mpt2sas: Bump version 05.100.00.02 (6 months ago)
...
[SCSI] mpt2sas: Bump version 02.100.03.00 (1 year, 1 month ago)

I think from looking at the mpt2sas-02.00.03.00-1.tar.gz that it is based on the version 02, so about a year old. What does this mean for the firmware I don't know.
Comment 20 Tomas Henzl 2010-10-21 09:27:09 EDT
Kashyap,
is the firmware version OK, or can we use a newer version from LSI?

Fact is that the symptoms we had with older firmware an a new driver were different so we shouldn't concentrate too much on firmware.
Comment 22 Nate Straz 2010-10-22 15:56:00 EDT
I got a test -219.el5 kernel with the mpt2sas update reverted from Tomas.  I was able to boot it on all five nodes reliably.
Comment 23 kashyap 2010-10-25 05:27:11 EDT
(In reply to comment #22)
> I got a test -219.el5 kernel with the mpt2sas update reverted from Tomas.  I
> was able to boot it on all five nodes reliably.

(In reply to comment #20)
> Kashyap,
> is the firmware version OK, or can we use a newer version from LSI?
> 
> Fact is that the symptoms we had with older firmware an a new driver were
> different so we shouldn't concentrate too much on firmware.

LSI website is having latest PHASE-7 firmware.
I have seen you have recommended below link to customer
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/external/sas9200-8e/index.html

Above link is having phase-7 package. 

` Kashyap
Comment 24 Tomas Henzl 2010-10-25 07:23:39 EDT
(In reply to comment #23)
> LSI website is having latest PHASE-7 firmware.
> I have seen you have recommended below link to customer
> http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/external/sas9200-8e/index.html
> 
> Above link is having phase-7 package. 

Kashyap,
Is it safe to use the firmware from the LSI's site for a Dell controller, and will the flash utility work?
Tomas
Comment 25 Tom Coughlan 2010-10-26 10:27:18 EDT
Nate, we will need to make a decision on this tomorrow, in order for the 5.6
beta to stay on schedule. Have you tried multiple tests of each of the
scenarios that caused a panic, and they have all run without a problem on the
test kernel? If so, we will need to remove that driver update from 5.6 Beta. It
is important to confirm this because you seem to have had several different
problems, some of which do not apprear to be related to the driver.
Comment 26 kashyap 2010-10-26 10:53:35 EDT
(In reply to comment #24)
> (In reply to comment #23)
> > LSI website is having latest PHASE-7 firmware.
> > I have seen you have recommended below link to customer
> > http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/external/sas9200-8e/index.html
> > 
> > Above link is having phase-7 package. 
> 
> Kashyap,
> Is it safe to use the firmware from the LSI's site for a Dell controller, and
> will the flash utility work?
> Tomas

Tomas, My doubt was valid. here is input from our system engineers.
The LSI website contains concatenated FW (FW+NVDATA) for LSI production HBAs. The DELL HBAs are customized and will have customized NVDATA.

Hence it is not advisable to use it.

~ Kashyap
Comment 27 Nate Straz 2010-10-26 11:09:58 EDT
(In reply to comment #25)
> Nate, we will need to make a decision on this tomorrow, in order for the 5.6
> beta to stay on schedule. Have you tried multiple tests of each of the
> scenarios that caused a panic, and they have all run without a problem on the
> test kernel? If so, we will need to remove that driver update from 5.6 Beta. It
> is important to confirm this because you seem to have had several different
> problems, some of which do not apprear to be related to the driver.

There is only one scenario and that's booting the system.  I've rebooted the system many times since installing the mptsas_revert_ kernel and I have yet to hit the problems with the updated mpt2sas driver.

The word from Dell is that the 1.x driver runs the controller in a degraded mode and a 2.x or later driver is recommended.
Comment 29 Tomas Henzl 2010-10-26 11:26:31 EDT
(In reply to comment #26)
> 
> Tomas, My doubt was valid. here is input from our system engineers.
> The LSI website contains concatenated FW (FW+NVDATA) for LSI production HBAs.
> The DELL HBAs are customized and will have customized NVDATA.
> 
> Hence it is not advisable to use it.
> 
So we can't upgrade the firmware - any other debug methods? The update to  05.101.00.02 as an big patch, so no bisection is possible. 
It seems to happen during the device init, in the mpt2sas_base_map_resources is in that new only a call to pci_enable_pcie_error_reporting..
Comment 30 Matthew Garrett 2010-10-26 11:31:27 EDT
Given that the revert fixes this, I'm pretty sure this isn't an ACPI bug. Mind if it gets reassigned to someone more appropriate?
Comment 32 Shyam Iyer 2010-10-27 09:47:23 EDT
I checked in Dell's internal release portals that there is no firmware update Released/Non-Released that can be applied greater than what is present already on the systems.

Dell would be willing to test with test kernels. Pinging the Dell PERC teams on this.
Comment 34 Marizol Martinez 2010-10-27 10:08:55 EDT
Could someone please post a link to a test kernel the Dell team can access for testing (i.e., externally accessible)? Thanks!
Comment 35 Shyam Iyer 2010-10-27 10:38:46 EDT
(In reply to comment #29)
> (In reply to comment #26)
> > 
> > Tomas, My doubt was valid. here is input from our system engineers.
> > The LSI website contains concatenated FW (FW+NVDATA) for LSI production HBAs.
> > The DELL HBAs are customized and will have customized NVDATA.
> > 
> > Hence it is not advisable to use it.
> > 
> So we can't upgrade the firmware - any other debug methods? The update to 
> 05.101.00.02 as an big patch, so no bisection is possible. 

It would be useful to get the tty logs out of the controller.

> It seems to happen during the device init, in the mpt2sas_base_map_resources is
> in that new only a call to pci_enable_pcie_error_reporting..

Advanced error reporting seems like the place to start with in the patch hunk..
Comment 36 Tomas Henzl 2010-10-27 11:59:28 EDT
I've reverted the pci aer patch, I don't expect much from it but let us start with this. It' being compiled here -> https://brewweb.devel.redhat.com/taskinfo?taskID=2855115
I have only blindly removed the upstream aer patch, don't know if this will even compile.
Comment 37 Nate Straz 2010-10-27 15:29:00 EDT
Out of five systems, four panicked.  The panic messages are below.


Setting hostname buzz-01:  [  OK  ]
ACPI: Power Button (FF) [PWRF]
ACPI: Mapper loaded
dell-wmi: No known WMI GUID found
Unable to handle kernel NULL pointer dereference at 0000000000000002 RIP:
 [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x16/0x54
PGD 32afec067 PUD 32e613067 PMD 0
Oops: 0002 [1] SMP
last sysfs file: /kernel/uevent_seqnum
CPU 0
Modules linked in: power_meter hwmon i2c_ec i2c_core dell_wmi wmi button batterd
Pid: 3554, comm: modprobe Not tainted 2.6.18-229.el5_mptsas_aer_revert_ #1
RIP: 0010:[<ffffffff80184ec4>]  [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x164
RSP: 0000:ffff81062c025ca8  EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff81062d5e1400 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff81062d5e1400 RDI: 0000000000000001
RBP: ffff81062ff039a0 R08: ffffc2000001062e R09: 0000000000000000
R10: ffff81062af7a320 R11: 0000000000000050 R12: ffff81062d6c4420
R13: 000000000000000e R14: ffffc2000001062d R15: ffff81062d6c44a0
FS:  00002ad2eb19b6e0(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000002 CR3: 000000032bc01000 CR4: 00000000000006e0
Process modprobe (pid: 3554, threadinfo ffff81062c024000, task ffff81062d7d6040)
Stack:  ffff81062d6c44a0 ffffffff801825e2 ffff81062ff038e0 ffff81062d5e1400
 ffff81062d5e1400 ffff81062d6c4420 0000000000000012 ffffffff80183215
 ffff81062d5e1400 0000000000000000 ffff81062d5e1400 0000000000000000
Call Trace:
 [<ffffffff801825e2>] acpi_ds_eval_data_object_operands+0x6b/0xef
 [<ffffffff80183215>] acpi_ds_exec_end_op+0x297/0x408
 [<ffffffff801924ed>] acpi_ps_parse_loop+0x602/0x94f
 [<ffffffff80191a74>] acpi_ps_parse_aml+0x80/0x254
 [<ffffffff80192cec>] acpi_ps_execute_pass+0x82/0x98
 [<ffffffff80192e0f>] acpi_ps_execute_method+0xd3/0x169
 [<ffffffff8018fe6d>] acpi_ns_evaluate+0xa8/0x10a
 [<ffffffff8018fa69>] acpi_evaluate_object+0x131/0x1e0
 [<ffffffff80064604>] __down_read+0x12/0x92
 [<ffffffff8849008e>] :power_meter:read_capabilities+0x5f/0x215
 [<ffffffff884910a8>] :power_meter:acpi_power_meter_add+0x10a/0x17a
 [<ffffffff801a41a1>] acpi_bus_driver_init+0x30/0x57
 [<ffffffff801a53bf>] acpi_bus_register_driver+0x95/0xd4
 [<ffffffff8830a032>] :power_meter:acpi_power_meter_init+0x25/0x34
 [<ffffffff800a8d1e>] sys_init_module+0xaf/0x1f2
 [<ffffffff8005d116>] system_call+0x7e/0x83


Code: 11 00 49 89 f0 45 31 c9 48 c7 c2 15 7f 2c 80 be c9 01 00 00
RIP  [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x16/0x54
 RSP <ffff81062c025ca8>
CR2: 0000000000000002
 <0>Kernel panic - not syncing: Fatal exception
---------------------------------------------------------------------
Setting hostname buzz-03:  [  OK  ]
Unable to handle kernel paging request at 00000000802a0beb RIP:
 [<ffffffff80182f99>] acpi_ds_exec_end_op+0x1b/0x408
PGD 62fc56067 PUD 0
Oops: 0000 [1] SMP
last sysfs file: /kernel/uevent_seqnum
CPU 0
Modules linked in: acpi_memhotplug ac parport_pc lp parport sr_mod cdrom joydevd
Pid: 3404, comm: modprobe Not tainted 2.6.18-229.el5_mptsas_aer_revert_ #1
RIP: 0010:[<ffffffff80182f99>]  [<ffffffff80182f99>] acpi_ds_exec_end_op+0x1b/08
RSP: 0000:ffff81062e0efc90  EFLAGS: 00010206
RAX: 00000000802a0be0 RBX: ffff81062de32400 RCX: ffff81062decf3f8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff81062de32400
RBP: ffff81062de32400 R08: 0000000000000016 R09: 0000000000000000
R10: ffff81062fdcf460 R11: 0000000000000050 R12: ffff81062decf3f8
R13: ffff81062de32428 R14: ffffc20000011d71 R15: ffff81062eb05460
FS:  00002aba0f4d96e0(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000802a0beb CR3: 000000062de18000 CR4: 00000000000006e0
Process modprobe (pid: 3404, threadinfo ffff81062e0ee000, task ffff81062fbee7a0)
Stack:  0000000000000000 ffff81062de32400 0000000000000000 ffff81062de32428
 ffffffff801924ed 00000000000000d0 0000000000000000 ffff81062decf3f8
 ffff81033a9a7f40 ffff81062de32400 0000000000000000 ffff81062fdcf500
Call Trace:
 [<ffffffff801924ed>] acpi_ps_parse_loop+0x602/0x94f
 [<ffffffff80191a74>] acpi_ps_parse_aml+0x80/0x254
 [<ffffffff80192cec>] acpi_ps_execute_pass+0x82/0x98
 [<ffffffff80192e0f>] acpi_ps_execute_method+0xd3/0x169
 [<ffffffff8018fe6d>] acpi_ns_evaluate+0xa8/0x10a
 [<ffffffff801961a9>] acpi_ut_evaluate_object+0x63/0x196
 [<ffffffff88424570>] :acpi_memhotplug:acpi_memory_register_notify_handler+0x0/7
 [<ffffffff80196357>] acpi_ut_execute_STA+0x1f/0x4f
 [<ffffffff80190828>] acpi_get_object_info+0x13b/0x1d1
 [<ffffffff884242b9>] :acpi_memhotplug:is_memory_device+0x1f/0x6e
 [<ffffffff88424579>] :acpi_memhotplug:acpi_memory_register_notify_handler+0x9/7
 [<ffffffff8019117a>] acpi_ns_walk_namespace+0x111/0x134
 [<ffffffff88424570>] :acpi_memhotplug:acpi_memory_register_notify_handler+0x0/7
 [<ffffffff8018f7bd>] acpi_walk_namespace+0x5d/0x83
 [<ffffffff88330037>] :acpi_memhotplug:acpi_memory_device_init+0x37/0x7a
 [<ffffffff800a8d1e>] sys_init_module+0xaf/0x1f2
 [<ffffffff8005d116>] system_call+0x7e/0x83


Code: 8a 50 0b 8a 48 0c 80 fa 0a 75 25 41 0f b7 4c 24 0a 48 8b 3d
RIP  [<ffffffff80182f99>] acpi_ds_exec_end_op+0x1b/0x408
 RSP <ffff81062e0efc90>
CR2: 00000000802a0beb
 <0>Kernel panic - not syncing: Fatal exception
---------------------------------------------------------------------
Setting hostname buzz-04:  [  OK  ]
ACPI: Power Button (FF) [PWRF]
ACPI: Mapper loaded
dell-wmi: No known WMI GUID found
Unable to handle kernel NULL pointer dereference at 0000000000000002 RIP:
 [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x16/0x54
PGD 62ce16067 PUD 629d77067 PMD 0
Oops: 0002 [1] SMP
last sysfs file: /kernel/uevent_seqnum
CPU 0
Modules linked in: power_meter hwmon i2c_ec i2c_core dell_wmi wmi button batterd
Pid: 3420, comm: modprobe Not tainted 2.6.18-229.el5_mptsas_aer_revert_ #1
RIP: 0010:[<ffffffff80184ec4>]  [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x164
RSP: 0000:ffff81062da43ca8  EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff81062d988c00 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff81062d988c00 RDI: 0000000000000001
RBP: ffff81062feb7320 R08: ffffc2000001062e R09: 0000000000000000
R10: ffff81062d193320 R11: 0000000000000050 R12: ffff81062cd21420
R13: 000000000000000e R14: ffffc2000001062d R15: ffff81062cd214a0
FS:  00002b5eabc6c6e0(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000002 CR3: 000000062fe0c000 CR4: 00000000000006e0
Process modprobe (pid: 3420, threadinfo ffff81062da42000, task ffff81062eb80100)
Stack:  ffff81062cd214a0 ffffffff801825e2 ffff81062feb7f20 ffff81062d988c00
 ffff81062d988c00 ffff81062cd21420 0000000000000012 ffffffff80183215
 ffff81062d988c00 0000000000000000 ffff81062d988c00 0000000000000000
Call Trace:
 [<ffffffff801825e2>] acpi_ds_eval_data_object_operands+0x6b/0xef
 [<ffffffff80183215>] acpi_ds_exec_end_op+0x297/0x408
 [<ffffffff801924ed>] acpi_ps_parse_loop+0x602/0x94f
 [<ffffffff80191a74>] acpi_ps_parse_aml+0x80/0x254
 [<ffffffff80192cec>] acpi_ps_execute_pass+0x82/0x98
 [<ffffffff80192e0f>] acpi_ps_execute_method+0xd3/0x169
 [<ffffffff8018fe6d>] acpi_ns_evaluate+0xa8/0x10a
 [<ffffffff8018fa69>] acpi_evaluate_object+0x131/0x1e0
 [<ffffffff80064604>] __down_read+0x12/0x92
 [<ffffffff8842a08e>] :power_meter:read_capabilities+0x5f/0x215
 [<ffffffff8842b0a8>] :power_meter:acpi_power_meter_add+0x10a/0x17a
 [<ffffffff801a41a1>] acpi_bus_driver_init+0x30/0x57
 [<ffffffff801a53bf>] acpi_bus_register_driver+0x95/0xd4
 [<ffffffff883a0032>] :power_meter:acpi_power_meter_init+0x25/0x34
 [<ffffffff800a8d1e>] sys_init_module+0xaf/0x1f2
 [<ffffffff8005d116>] system_call+0x7e/0x83


Code: 11 00 49 89 f0 45 31 c9 48 c7 c2 15 7f 2c 80 be c9 01 00 00
RIP  [<ffffffff80184ec4>] acpi_ds_obj_stack_pop+0x16/0x54
 RSP <ffff81062da43ca8>
CR2: 0000000000000002
 <0>Kernel panic - not syncing: Fatal exception
---------------------------------------------------------------------
Setting hostname buzz-05:  [  OK  ]
Unable to handle kernel paging request at 00000000802a0beb RIP:
 [<ffffffff80182f99>] acpi_ds_exec_end_op+0x1b/0x408
PGD 62ed86067 PUD 0
Oops: 0000 [1] SMP
last sysfs file: /kernel/uevent_seqnum
CPU 10
Modules linked in: acpi_memhotplug ac parport_pc lp parport sr_mod cdrom ata_pid
Pid: 3418, comm: modprobe Not tainted 2.6.18-229.el5_mptsas_aer_revert_ #1
RIP: 0010:[<ffffffff80182f99>]  [<ffffffff80182f99>] acpi_ds_exec_end_op+0x1b/08
RSP: 0018:ffff81062e75bc90  EFLAGS: 00010206
RAX: 00000000802a0be0 RBX: ffff81033ab0b400 RCX: ffff81062b6673f8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff81033ab0b400
RBP: ffff81033ab0b400 R08: 0000000000000016 R09: 0000000000000000
R10: ffff81062d612460 R11: 0000000000000050 R12: ffff81062b6673f8
R13: ffff81033ab0b428 R14: ffffc20000011d71 R15: ffff81062e87a460
FS:  00002b2dfd8466e0(0000) GS:ffff81033a97e940(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000802a0beb CR3: 000000062b8e1000 CR4: 00000000000006e0
Process modprobe (pid: 3418, threadinfo ffff81062e75a000, task ffff81062fe1e7a0)
Stack:  0000000000000000 ffff81033ab0b400 0000000000000000 ffff81033ab0b428
 ffffffff801924ed 00000000000000d0 0000000000000000 ffff81062b6673f8
 ffff81062f74f1c0 ffff81033ab0b400 0000000000000000 ffff81062d612500
Call Trace:
 [<ffffffff801924ed>] acpi_ps_parse_loop+0x602/0x94f
 [<ffffffff80191a74>] acpi_ps_parse_aml+0x80/0x254
 [<ffffffff80192cec>] acpi_ps_execute_pass+0x82/0x98
 [<ffffffff80192e0f>] acpi_ps_execute_method+0xd3/0x169
 [<ffffffff8018fe6d>] acpi_ns_evaluate+0xa8/0x10a
 [<ffffffff801961a9>] acpi_ut_evaluate_object+0x63/0x196
 [<ffffffff88424570>] :acpi_memhotplug:acpi_memory_register_notify_handler+0x0/7
 [<ffffffff80196357>] acpi_ut_execute_STA+0x1f/0x4f
 [<ffffffff80190828>] acpi_get_object_info+0x13b/0x1d1
 [<ffffffff884242b9>] :acpi_memhotplug:is_memory_device+0x1f/0x6e
 [<ffffffff88424579>] :acpi_memhotplug:acpi_memory_register_notify_handler+0x9/7
 [<ffffffff8019117a>] acpi_ns_walk_namespace+0x111/0x134
 [<ffffffff88424570>] :acpi_memhotplug:acpi_memory_register_notify_handler+0x0/7
 [<ffffffff8018f7bd>] acpi_walk_namespace+0x5d/0x83
 [<ffffffff88347037>] :acpi_memhotplug:acpi_memory_device_init+0x37/0x7a
 [<ffffffff800a8d1e>] sys_init_module+0xaf/0x1f2
 [<ffffffff8005d116>] system_call+0x7e/0x83


Code: 8a 50 0b 8a 48 0c 80 fa 0a 75 25 41 0f b7 4c 24 0a 48 8b 3d
RIP  [<ffffffff80182f99>] acpi_ds_exec_end_op+0x1b/0x408
 RSP <ffff81062e75bc90>
CR2: 00000000802a0beb
 <0>Kernel panic - not syncing: Fatal exception
Comment 38 kashyap 2010-10-28 07:45:48 EDT
Tomas,

This is confusing to me...! Where we are targeting this issue ? are we targeting at mpt2sas driver or acpi code ?? Out of so many Opps I have seen only one mpt2sas related oops. Which one I have suppose to look.
mpt2sas related oops are in comment#14. For that oops we are already done from map resources.. Driver is failing somewhere in Event processing...

Opps happended somewhere here _scsih_sas_topology_change_event..!!

If we are targeting to fix this issue I need driver debug logs. (please make initrd image with mpt2sas driver logging with 0xfffffff) edit /etc/modeprobe.conf
options mpt2sas          logging_level=0xFFFFFFFFF

and rebuild initrd ..this will throw good amount of debug logs on console.

For DELL specific firmware customer needs to check Dell's support website. LSI's website provided firmware are not recommended to use for DELL provided controllers.

~ Kashyap
Comment 39 Nate Straz 2010-10-28 10:08:08 EDT
The issue here is that the update of the mpt2sas driver between the -215.el5 and the -219.el5 caused a regression which prevents systems with a Dell PERC H200 controller from booting reliably.  This was confirmed by reverting the driver update from the -219.el5 kernel.

I tried your logging_level option and the insmod failed because the value was invalid:

Loading mpt2sas.ko module
mpt2sas: `0xFFFFFFFFF' invalid for parameter `logging_level'
insmod: error inserting '/lib/mpt2sas.ko': -1 Invalid parameters
Comment 40 Tom Coughlan 2010-10-28 10:33:39 EDT
Kashyap,

As Nate said, all we know at this point is that if we remove this patch:

mpt2sas: update to 05.101.00.02

all the problems go away. That does not make much sense, based on the stack traces, but that seems to have been verified (comment 27). 

Removal of just the "pci aer patch" (comment 36) did not help. Still had four of five systems panic (comment 37). 

> mpt2sas related oops are in comment#14. For that oops we are already done from
> map resources.. Driver is failing somewhere in Event processing...
> 
> Opps happended somewhere here _scsih_sas_topology_change_event..!!

Nate,

As I understand it, the stack trace in comment#14 does not happen very often. Is that right? It may be that that one should not be the primary focus here. Hopefully Kashyap can get the mpt2sas logging working. Then maybe try that with the stock 5.6 kernel, and see what we can get. 

Tom
Comment 41 kashyap 2010-10-28 11:01:20 EDT
(In reply to comment #40)
> Kashyap,
> 
> As Nate said, all we know at this point is that if we remove this patch:
> 
> mpt2sas: update to 05.101.00.02
> 
> all the problems go away. That does not make much sense, based on the stack
> traces, but that seems to have been verified (comment 27). 
> 
> Removal of just the "pci aer patch" (comment 36) did not help. Still had four
> of five systems panic (comment 37). 
> 
> > mpt2sas related oops are in comment#14. For that oops we are already done from
> > map resources.. Driver is failing somewhere in Event processing...
> > 
> > Opps happended somewhere here _scsih_sas_topology_change_event..!!
> 
> Nate,
> 
> As I understand it, the stack trace in comment#14 does not happen very often.
> Is that right? It may be that that one should not be the primary focus here.
> Hopefully Kashyap can get the mpt2sas logging working. Then maybe try that with
> the stock 5.6 kernel, and see what we can get. 
my mistake. there is one extra F in loggig_level. make it logging_level=0xFFFFFFF (7 times F)

So can we concentrate on mpt2sas issue ? Otherwise will end up into infinite loop. So Can you have some defined steps which will endup into mpt2sas related crash ..?? This way we can at least target once issue.!

> 
> Tom
Comment 42 Nate Straz 2010-10-28 11:35:15 EDT
I finally got the logging_level setting working with 2.6.18-229.el5_mptsas_aer_revert_.  I used 0x1fffff since those are all the bits defined in mpt2sas_debug.h.

All five systems panicked on boot with the same backtrace in mpt2sas.  Hopefully this will get you guys closer to what the real issue is.

Loading mpt2sas.ko module
setting logging_level(0x001fffff)
mpt2sas version 05.101.00.02 loaded
scsi0 : Fusion MPT SAS Host
GSI 22 sharing vector 0x5A and IRQ 22
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 33 (level, low) -> IRQ 90
mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (24678268 kB)
mpt2sas0: msix is supported, vector_count(15), table_offset(0x0000e000), table()
mpt2sas0: PCI-MSI-X enabled: IRQ 98
mpt2sas0: iomem(0x00000000df2b0000), mapped(0xffffc20000060000), size(65536)
mpt2sas0: ioport(0x000000000000fc00), size(256)
mpt2sas0: hba queue depth(3439), max chains per io(96)
mpt2sas0: request frame size(128), reply frame size(128)
mpt2sas0: sending diag reset !!
usb 5-2: new full speed USB device using uhci_hcd and address 2
usb 5-2: configuration #1 chosen from 1 choice
input: Avocent USB Composite Device-0 as /class/input/input0
input: USB HID v1.00 Keyboard [Avocent USB Composite Device-0] on usb-0000:00:12
input: Avocent USB Composite Device-0 as /class/input/input1
input: USB HID v1.00 Mouse [Avocent USB Composite Device-0] on usb-0000:00:1d.02
mpt2sas0: diag reset: SUCCESS
mpt2sas0: scatter gather: sge_in_main_msg(1), sge_per_chain(9), sge_per_io(128))
mpt2sas0: scsi host: can_queue depth (467)
mpt2sas0: request pool(0xffff81062f200000): depth(600), frame_size(128), pool_s)
mpt2sas0: chain pool(0xffff81062f212c80): depth(7035), frame_size(128), pool_si)
mpt2sas0: request pool: dma(0x62f200000)
mpt2sas0: scsiio(0xffff81062f200000): depth(469)
mpt2sas0: hi_priority(0xffff81062f20eb00): depth(63), start smid(470)
mpt2sas0: internal(0xffff81062f210a80): depth(68), start smid(533)
mpt2sas0: sense pool(0xffff81062f1f0000): depth(469), element_size(96), pool_si)
mpt2sas0: sense_dma(0x62f1f0000)
mpt2sas0: reply pool(0xffff81062f300000): depth(640), frame_size(128), pool_siz)
mpt2sas0: reply_dma(0x62f300000)
mpt2sas0: reply_free pool(0xffff81062f1c7000): depth(640), element_size(4), poo)
mpt2sas0: reply_free_dma(0x62f1c7000)
mpt2sas0: reply post free pool(0xffff81062f1e8000): depth(1248), element_size(8)
mpt2sas0: reply_post_free_dma = (0x62f1e8000)
mpt2sas0: config page(0xffff81062f1c8000): size(512)
mpt2sas0: config_page_dma(0x62f1c8000)
mpt2sas0: Allocated physical memory: size(1091 kB)
mpt2sas0: Current Controller Queue Depth(467), Max Controller Queue Depth(3439)
mpt2sas0: Scatter Gather Elements per IO(128)
mpt2sas0: LSISAS2008: FWVersion(02.15.63.00), ChipRevision(0x02), BiosVersion(0)
mpt2sas0: Dell PERC H200 Integrated: Vendor(0x1000), Device(0x0072), SSVID(0x10)
mpt2sas0: Protocol=(Initiator,Target), Capabilities=(Raid,TLR,EEDP,Snapshot Buf)
mpt2sas0: sending port enable !!
mpt2sas0: Discovery: (start)
mpt2sas0: SAS Enclosure Device Status Change
mpt2sas0: SAS Topology Change List
 phy-0:0: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
mpt2sas0: SAS Topology Change List
 phy-0:1: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:2: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:3: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:4: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:5: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:6: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:7: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
        attached_handle(0x000a), sas_addr(0x500000e115365c42)
mpt2sas0: host_add: handle(0x0001), sas_addr(0x5a4badb01ae71800), phys(8)
        handle(0x0001), enclosure logical id(0x5a4badb01ae71800) number slots(0)
mpt2sas0: updating handles for sas_host(0x5a4badb01ae71800)
 phy-0:0: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x00), phy(0)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:1: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x00), phy(1)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:2: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x00), phy(2)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:3: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x00), phy(3)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:4: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x00), phy(4)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:5: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x00), phy(5)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:6: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x00), phy(6)
        attached_handle(0x0000), sas_addr(0x0000000000000000)
 phy-0:7: refresh: parent sas_addr(0x5a4badb01ae71800),
        link_rate(0x0a), phy(7)
        attached_handle(0x000a), sas_addr(0x500000e115365c42)
Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP:
 [<ffffffff801c961b>] dev_driver_string+0x0/0x27
PGD 32e918067 PUD 32e917067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /block/ram0/dev
CPU 0
Modules linked in: mpt2sas scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcdd
Pid: 959, comm: fw_event0 Not tainted 2.6.18-229.el5_mptsas_aer_revert_ #1
RIP: 0010:[<ffffffff801c961b>]  [<ffffffff801c961b>] dev_driver_string+0x0/0x27
RSP: 0000:ffff81032ea3dc18  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 5b4ba0b01ae71800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff81062f1c8038 RDI: 0000000000000000
RBP: ffff81032fcbba20 R08: ffff81032ea3c000 R09: 0000000000000037
R10: ffff81033a883710 R11: 0000000000000000 R12: 0000000000000000
R13: ffff81032fcbb4f8 R14: 0000000000000009 R15: 0000000000000009
FS:  0000000000000000(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000000001c8 CR3: 000000032e915000 CR4: 00000000000006e0
Process fw_event0 (pid: 959, threadinfo ffff81032ea3c000, task ffff81062fb93860)
Stack:  ffffffff880d717d ffff81062f7a5080 0808ffff880cfb76 5a4badb01ae71800
 0000000000000008 ffff81062f7a0008 0000000000000009 ffff81032fcbb4f8
 ffff81062f7a5080 ffff81033ab69d20 ffffffff880d1b88 0000003000000030
Call Trace:
 [<ffffffff880d717d>] :mpt2sas:mpt2sas_transport_update_links+0x116/0x158
 [<ffffffff880d1b88>] :mpt2sas:_scsih_sas_topology_change_event+0x481/0x51c
 [<ffffffff880d1e3d>] :mpt2sas:_firmware_event_work+0x21a/0x10b8
 [<ffffffff80062ff0>] thread_return+0x62/0xfe
 [<ffffffff8002e281>] __wake_up+0x38/0x4f
 [<ffffffff880d1c23>] :mpt2sas:_firmware_event_work+0x0/0x10b8
 [<ffffffff8004d7aa>] run_workqueue+0x99/0xf6
 [<ffffffff80049ff2>] worker_thread+0x0/0x122
 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a0e2>] worker_thread+0xf0/0x122
 [<ffffffff8008e414>] default_wake_function+0x0/0xe
 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032968>] kthread+0xfe/0x132
 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003286a>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 48 8b 87 c8 01 00 00 48 85 c0 74 04 48 8b 00 c3 48 8b 97 c0
RIP  [<ffffffff801c961b>] dev_driver_string+0x0/0x27
 RSP <ffff81032ea3dc18>
CR2: 00000000000001c8
 <6>mpt2sas0: Discovery: (stop)
Kernel panic - not syncing: Fatal exception
Comment 43 kashyap 2010-10-28 11:48:50 EDT
(In reply to comment #42)
> I finally got the logging_level setting working with
> 2.6.18-229.el5_mptsas_aer_revert_.  I used 0x1fffff since those are all the
> bits defined in mpt2sas_debug.h.
> 
> All five systems panicked on boot with the same backtrace in mpt2sas. 
> Hopefully this will get you guys closer to what the real issue is.
> 
> Loading mpt2sas.ko module
> setting logging_level(0x001fffff)
> mpt2sas version 05.101.00.02 loaded
> scsi0 : Fusion MPT SAS Host
> GSI 22 sharing vector 0x5A and IRQ 22
> ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 33 (level, low) -> IRQ 90
> mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (24678268 kB)
> mpt2sas0: msix is supported, vector_count(15), table_offset(0x0000e000),
> table()
> mpt2sas0: PCI-MSI-X enabled: IRQ 98
> mpt2sas0: iomem(0x00000000df2b0000), mapped(0xffffc20000060000), size(65536)
> mpt2sas0: ioport(0x000000000000fc00), size(256)
> mpt2sas0: hba queue depth(3439), max chains per io(96)
> mpt2sas0: request frame size(128), reply frame size(128)
> mpt2sas0: sending diag reset !!
> usb 5-2: new full speed USB device using uhci_hcd and address 2
> usb 5-2: configuration #1 chosen from 1 choice
> input: Avocent USB Composite Device-0 as /class/input/input0
> input: USB HID v1.00 Keyboard [Avocent USB Composite Device-0] on
> usb-0000:00:12
> input: Avocent USB Composite Device-0 as /class/input/input1
> input: USB HID v1.00 Mouse [Avocent USB Composite Device-0] on
> usb-0000:00:1d.02
> mpt2sas0: diag reset: SUCCESS
> mpt2sas0: scatter gather: sge_in_main_msg(1), sge_per_chain(9),
> sge_per_io(128))
> mpt2sas0: scsi host: can_queue depth (467)
> mpt2sas0: request pool(0xffff81062f200000): depth(600), frame_size(128),
> pool_s)
> mpt2sas0: chain pool(0xffff81062f212c80): depth(7035), frame_size(128),
> pool_si)
> mpt2sas0: request pool: dma(0x62f200000)
> mpt2sas0: scsiio(0xffff81062f200000): depth(469)
> mpt2sas0: hi_priority(0xffff81062f20eb00): depth(63), start smid(470)
> mpt2sas0: internal(0xffff81062f210a80): depth(68), start smid(533)
> mpt2sas0: sense pool(0xffff81062f1f0000): depth(469), element_size(96),
> pool_si)
> mpt2sas0: sense_dma(0x62f1f0000)
> mpt2sas0: reply pool(0xffff81062f300000): depth(640), frame_size(128),
> pool_siz)
> mpt2sas0: reply_dma(0x62f300000)
> mpt2sas0: reply_free pool(0xffff81062f1c7000): depth(640), element_size(4),
> poo)
> mpt2sas0: reply_free_dma(0x62f1c7000)
> mpt2sas0: reply post free pool(0xffff81062f1e8000): depth(1248),
> element_size(8)
> mpt2sas0: reply_post_free_dma = (0x62f1e8000)
> mpt2sas0: config page(0xffff81062f1c8000): size(512)
> mpt2sas0: config_page_dma(0x62f1c8000)
> mpt2sas0: Allocated physical memory: size(1091 kB)
> mpt2sas0: Current Controller Queue Depth(467), Max Controller Queue Depth(3439)
> mpt2sas0: Scatter Gather Elements per IO(128)
> mpt2sas0: LSISAS2008: FWVersion(02.15.63.00), ChipRevision(0x02),
> BiosVersion(0)
> mpt2sas0: Dell PERC H200 Integrated: Vendor(0x1000), Device(0x0072),
> SSVID(0x10)
> mpt2sas0: Protocol=(Initiator,Target), Capabilities=(Raid,TLR,EEDP,Snapshot
> Buf)
> mpt2sas0: sending port enable !!
> mpt2sas0: Discovery: (start)
> mpt2sas0: SAS Enclosure Device Status Change
> mpt2sas0: SAS Topology Change List
>  phy-0:0: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
> mpt2sas0: SAS Topology Change List
>  phy-0:1: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:2: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:3: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:4: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:5: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:6: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:7: add: handle(0x0001), sas_addr(0x5a4badb01ae71800)
>         attached_handle(0x000a), sas_addr(0x500000e115365c42)
> mpt2sas0: host_add: handle(0x0001), sas_addr(0x5a4badb01ae71800), phys(8)
>         handle(0x0001), enclosure logical id(0x5a4badb01ae71800) number
> slots(0)
> mpt2sas0: updating handles for sas_host(0x5a4badb01ae71800)
>  phy-0:0: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x00), phy(0)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:1: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x00), phy(1)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:2: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x00), phy(2)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:3: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x00), phy(3)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:4: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x00), phy(4)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:5: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x00), phy(5)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:6: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x00), phy(6)
>         attached_handle(0x0000), sas_addr(0x0000000000000000)
>  phy-0:7: refresh: parent sas_addr(0x5a4badb01ae71800),
>         link_rate(0x0a), phy(7)
>         attached_handle(0x000a), sas_addr(0x500000e115365c42)
> Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP:
>  [<ffffffff801c961b>] dev_driver_string+0x0/0x27
> PGD 32e918067 PUD 32e917067 PMD 0
> Oops: 0000 [1] SMP
> last sysfs file: /block/ram0/dev
> CPU 0
> Modules linked in: mpt2sas scsi_transport_sas sd_mod scsi_mod ext3 jbd
> uhci_hcdd
> Pid: 959, comm: fw_event0 Not tainted 2.6.18-229.el5_mptsas_aer_revert_ #1
> RIP: 0010:[<ffffffff801c961b>]  [<ffffffff801c961b>] dev_driver_string+0x0/0x27
> RSP: 0000:ffff81032ea3dc18  EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 5b4ba0b01ae71800 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffff81062f1c8038 RDI: 0000000000000000
> RBP: ffff81032fcbba20 R08: ffff81032ea3c000 R09: 0000000000000037
> R10: ffff81033a883710 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff81032fcbb4f8 R14: 0000000000000009 R15: 0000000000000009
> FS:  0000000000000000(0000) GS:ffffffff80424000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000000000001c8 CR3: 000000032e915000 CR4: 00000000000006e0
> Process fw_event0 (pid: 959, threadinfo ffff81032ea3c000, task
> ffff81062fb93860)
> Stack:  ffffffff880d717d ffff81062f7a5080 0808ffff880cfb76 5a4badb01ae71800
>  0000000000000008 ffff81062f7a0008 0000000000000009 ffff81032fcbb4f8
>  ffff81062f7a5080 ffff81033ab69d20 ffffffff880d1b88 0000003000000030
> Call Trace:
>  [<ffffffff880d717d>] :mpt2sas:mpt2sas_transport_update_links+0x116/0x158
>  [<ffffffff880d1b88>] :mpt2sas:_scsih_sas_topology_change_event+0x481/0x51c
>  [<ffffffff880d1e3d>] :mpt2sas:_firmware_event_work+0x21a/0x10b8
>  [<ffffffff80062ff0>] thread_return+0x62/0xfe
>  [<ffffffff8002e281>] __wake_up+0x38/0x4f
>  [<ffffffff880d1c23>] :mpt2sas:_firmware_event_work+0x0/0x10b8
>  [<ffffffff8004d7aa>] run_workqueue+0x99/0xf6
>  [<ffffffff80049ff2>] worker_thread+0x0/0x122
>  [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff8004a0e2>] worker_thread+0xf0/0x122
>  [<ffffffff8008e414>] default_wake_function+0x0/0xe
>  [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff80032968>] kthread+0xfe/0x132
>  [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>  [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4
>  [<ffffffff8003286a>] kthread+0x0/0x132
>  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> 
> 
> Code: 48 8b 87 c8 01 00 00 48 85 c0 74 04 48 8b 00 c3 48 8b 97 c0
> RIP  [<ffffffff801c961b>] dev_driver_string+0x0/0x27
>  RSP <ffff81032ea3dc18>
> CR2: 00000000000001c8
>  <6>mpt2sas0: Discovery: (stop)
> Kernel panic - not syncing: Fatal exception


A quick response after seeing this logs I am preparing set of new patch set for upstream and one of the patch looks to be solution of this issue.
Since you are consistently able to reproduce I will provide that particular patch on this bugzilla so that you can verify.

Issue is : With DELL card there is direct attached SEP. which normally we don't have for generic LSI provided controller. due to SEP count driver tries to access array beyond size which I am seeing here also the same case...and finally kernel crash.


Please find attached patch and try.!
Comment 44 kashyap 2010-10-28 11:51:15 EDT
Created attachment 456279 [details]
direct attached SEP device patch

Fix oops loading driver when there is direct attached
SEP device
 
The driver set max phys count to the value reported in sas iounit page
zero.  However this page doesn't take into account additional virutal
phys.  When sas topology event arrives, the phy count is larger than
expected, and the driver accesses memory array beyond the end of
allocated space, then oops.  Manufacturing page 8 contains the info
on direct attached phys.
 
For this fix will making sure that sas topology event is not
processing phys greater than the expected phy count
Comment 45 Nate Straz 2010-10-28 12:46:21 EDT
Created attachment 456287 [details]
direct attached SEP device patch for RHEL5.6

I'm building a test kernel this patch now.  I should have some results later this afternoon.
Comment 46 Nate Straz 2010-10-28 14:04:32 EDT
My test kernel with the patch in comment 45 booted on all five nodes on the first try.  I rebooted the nodes several times and they successfully booted every time.
Comment 47 Tom Coughlan 2010-10-28 14:24:16 EDT
The kernel for 5.6 beta is already built, so we need to decide what to do now. 

1) Nate, Kashyap, Shyam: Is there any workaround that would allow users to avoid the problem (like a BIOS parameter that turns off the direct attached SEP on these systems)? If so, we can ship the current kernel with instructions for the workaround, and then ship a fix for the bug in the first snapshot (just a few weeks later). 

2) Kashyap, what is the risk associated with the patch? If we put it in now, it will go with virtually no testing. Do you have any reservations about that? 

Tom
Comment 51 kashyap 2010-10-29 04:21:25 EDT
(In reply to comment #47)
> The kernel for 5.6 beta is already built, so we need to decide what to do now. 
> 
> 1) Nate, Kashyap, Shyam: Is there any workaround that would allow users to
> avoid the problem (like a BIOS parameter that turns off the direct attached SEP
> on these systems)? If so, we can ship the current kernel with instructions for
> the workaround, and then ship a fix for the bug in the first snapshot (just a
> few weeks later). 
I am not sure about this query but to me there is no workaround to avoid this in driver. you have to apply that patch only.!

> 
> 2) Kashyap, what is the risk associated with the patch? If we put it in now, it
> will go with virtually no testing. Do you have any reservations about that? 
I would say you can trust this patch without much worry, since LSI has tested this patch internally as this is part of our GCA bundled. And from code review point I would say you will not see any difficulties to include it with no testing.

so you can include this in next RC release. 

> 
> Tom

I will be on leave next week so if you have any queries which are very urgent please contact Joe Maloy.!
Comment 52 Jarod Wilson 2010-11-01 17:00:25 EDT
in kernel-2.6.18-230.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.
Comment 54 kashyap 2010-11-10 04:51:28 EST
(In reply to comment #52)
> in kernel-2.6.18-230.el5
> You can download this test kernel (or newer) from
> http://people.redhat.com/jwilson/el5
> 
> Detailed testing feedback is always welcomed.

I have verified above location source is contains necessary patch to fix this issue. I have verified from source rpm.

~ Kashyap
Comment 55 Nate Straz 2010-11-10 09:18:20 EST
I have also verified the -230.el6 kernel and am using it to get through testing of RHEL5.6 Beta.
Comment 57 errata-xmlrpc 2011-01-13 16:56:30 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.