Bug 512613 - Smartmontools cause hard drives behind a SAS controller to get dropped
Summary: Smartmontools cause hard drives behind a SAS controller to get dropped
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: smartmontools
Version: 11
Hardware: All
OS: Linux
low
urgent
Target Milestone: ---
Assignee: Michal Hlavinka
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-07-19 20:29 UTC by Philippe Troin
Modified: 2012-03-08 13:16 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-11-05 12:40:08 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Bug fix for uninitialized variables in sat driver. (388 bytes, patch)
2009-07-19 20:29 UTC, Philippe Troin
no flags Details | Diff

Description Philippe Troin 2009-07-19 20:29:46 UTC
Created attachment 354287 [details]
Bug fix for uninitialized variables in sat driver.

Description of problem:

After upgrading a server to F11, whenever I run smartmontools, I get an error, and then the following get logged to dmesg:

mptscsih: ioc0: attempting task abort! (sc=ffff8800bd1b0d00)
sd 7:0:2:0: [sdd] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8800bd1b0d00)
mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
mptscsih: ioc0: attempting target reset! (sc=ffff8800bd1b0d00)
sd 7:0:2:0: [sdd] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
mptsas: ioc0: removing sata device, channel 0, id 18, phy 2
 port-7:2: mptsas: ioc0: delete port (2)

I was not happy when smartd started up and the kernel offlined all eight drives in a RAID-5 configuration :-(

After some investigations and looking at bug #452389 at https://bugzilla.redhat.com/show_bug.cgi?id=452389 
I found the root cause of the problem.

There are some uninitialized variables passed down to the kernel and the SCSI or libata layer gets confused and offline the drive.

Please consider fixing this ASAP as SAT is mostly used with big disks array and the consequences of an unpatched smartmontools can be dire.  I did not enjoy having to get my RAID array back on its feet after this problem.

Comment 1 Philippe Troin 2009-07-19 23:07:39 UTC
Hold on there, I'm still seeing the problem with the patch.

Comment 2 Philippe Troin 2009-07-20 00:02:06 UTC
So, this patch is useless.

Let me restate the symptoms:

# smartctl -d sat -a /dev/sde

where sde is a SATA hard drive attached to a SAS controller (LSI SAS1068)

gives the following output:

smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

[hangs there for 15-30 seconds]
=== START OF INFORMATION SECTION ===
Device Model:     [No Information Found]
Serial Number:    [No Information Found]
Firmware Version: [No Information Found]
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   1
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jul 19 16:30:26 2009 PDT
SMART is only available in ATA Version 3 Revision 3 or greater.
We will try to proceed in spite of this.
SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 82-83 don't show if SMART supported.
                  Checking for SMART support by trying SMART ENABLE command.
                  SMART ENABLE appeared to work!  Continuing.
SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 85-87 don't show if SMART is enabled.
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

At the same time, the kernel logs:

mptscsih: ioc0: attempting task abort! (sc=ffff8800b0151e00)
sd 7:0:3:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8800b0151e00)
mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
mptscsih: ioc0: attempting target reset! (sc=ffff8800b0151e00)
sd 7:0:3:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
mptsas: ioc0: removing sata device, channel 0, id 19, phy 3
 port-7:3: mptsas: ioc0: delete port (3)
sd 7:0:3:0: [sde] Synchronizing SCSI cache
mptscsih: ioc0: target reset: SUCCESS (sc=ffff8800b0151e00)
mptscsih: ioc0: attempting bus reset! (sc=ffff8800b0151e00)
sd 7:0:3:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8800b0151e00)
mptscsih: ioc0: attempting host reset! (sc=ffff8800b0151e00)
mptbase: ioc0: Initiating recovery
mptscsih: ioc0: host reset: SUCCESS (sc=ffff8800b0151e00)
sd 7:0:3:0: Device offlined - not ready after error recovery
program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
sd 7:0:3:0: [sde] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK

Comment 3 Philippe Troin 2009-07-20 02:46:57 UTC
Ok, as pointed out in bug #452389 disabling -Wl,-z,relro,-z,now makes a working executable.

I have no idea why it would though.

I would consider disabling it as the consequences for people with SAS controllers are very annoying, and the security benefits of relro are negligible: smartmontools and smartd are only run by root and do not otherwise interact with the network.

Phil.

Comment 4 Michal Hlavinka 2009-07-20 07:31:59 UTC
I've prepared svn snapshot for different bug report. Smartmontools has changed a lot since last official release. Could you try if this package works for you?

http://koji.fedoraproject.org/koji/taskinfo?taskID=1465244

Comment 5 Philippe Troin 2009-07-20 16:55:21 UTC
Scuse my thickness, but how do I download the packages from there?
The "Output" section is empty...

Comment 6 Michal Hlavinka 2009-07-21 06:07:41 UTC
Sorry, my fault, it seems scratch builds are delete quite soon, I've prepared new package

all architectures sub-pages now have links to packages

Comment 7 Michal Hlavinka 2009-07-23 11:33:28 UTC
jeez... and this time completely without link... sorry :(

here is that link:

http://koji.fedoraproject.org/koji/taskinfo?taskID=1494352

x86_64 package:

http://koji.fedoraproject.org/koji/getfile?taskID=1494354&name=smartmontools-5.39-0.fc11.svn20090721.x86_64.rpm

i686 package:

http://koji.fedoraproject.org/koji/getfile?taskID=1494356&name=smartmontools-5.39-0.fc11.svn20090721.i586.rpm

if you want packages for different architecture or need debuginfo packages, use first link

Comment 8 Pim Zandbergen 2009-08-05 10:06:10 UTC
This scratch build appears to have been deleted too.
I don't see any links to the .src.rpm either.

Can you prepare something new?

Thanks,
Pim

Comment 10 Pim Zandbergen 2009-08-07 11:20:56 UTC
I tried this scratch build on one of six systems I have seen this problem on.

It looked to be OK for a while; smartctl did not immediately bring the drives down, but smartd would, after a while, randomly.

Al of the systems I'm seeing this problem on have LSI 1068E SAS controllers, with host RAID disabled, using mdadm RAID arrays of 4 to 12 drives. Needless to say it hurts when several drives in an mdadm RAID array go down unexpectedly.

On Fedora 9 and older, this problem did not exist. It might be that not smartmontools but the mptsas driver is the problem, with smartmontools just triggering the bug.

Has anyone seen this problem on SAS controllers other than those of LSI?

Thanks,
Pim

Comment 11 Michal Hlavinka 2009-08-07 15:18:31 UTC
(In reply to comment #10)
> I tried this scratch build on one of six systems I have seen this problem on.
> 
> It looked to be OK for a while; smartctl did not immediately bring the drives
> down, but smartd would, after a while, randomly.

are there any messages in logs related?

> 
> Al of the systems I'm seeing this problem on have LSI 1068E SAS controllers,
> with host RAID disabled, using mdadm RAID arrays of 4 to 12 drives.

do you have any similar systems where it works? What are they?


> On Fedora 9 and older, this problem did not exist. It might be that not
> smartmontools but the mptsas driver is the problem, with smartmontools just
> triggering the bug.
> 
> Has anyone seen this problem on SAS controllers other than those of LSI?

I don't know about anything related. I've found only bug #493697

Comment 12 Pim Zandbergen 2009-08-07 17:22:59 UTC
Here is the part of the log where things go wrong.

Like I said, with this scratch build it takes a while to cause the problem to appear. For this crash to happen, I started smartd on all drives (1 ata and 8 sat drive) and ran "smartctl -a /dev/sdX" on each drive. I had a hit on /dev/sdo:

Aug  7 19:00:12 moog kernel: mptscsih: ioc0: attempting task abort! (sc=ffff8802255bac00)
Aug  7 19:00:12 moog kernel: sd 8:0:6:0: [sdo] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
Aug  7 19:00:17 moog kernel: mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Aug  7 19:00:17 moog kernel: mptsas: ioc0: removing sata device, channel 0, id 6, phy 6
Aug  7 19:00:17 moog kernel: sd 8:0:6:0: [sdo] Synchronizing SCSI cache
Aug  7 19:00:17 moog kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802255bac00)
Aug  7 19:00:17 moog kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880225495e00)
Aug  7 19:00:17 moog kernel: sd 8:0:6:0: [sdo] CDB: Read(10): 28 00 13 3e 75 d7 00 00 08 00
Aug  7 19:00:17 moog kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff880225495e00)
Aug  7 19:00:17 moog kernel: mptscsih: ioc0: attempting target reset! (sc=ffff8802255bac00)
Aug  7 19:00:17 moog kernel: sd 8:0:6:0: [sdo] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
Aug  7 19:00:17 moog kernel: mptscsih: ioc0: target reset: SUCCESS (sc=ffff8802255bac00)
Aug  7 19:00:17 moog kernel: mptscsih: ioc0: attempting bus reset! (sc=ffff8802255bac00)
Aug  7 19:00:17 moog kernel: sd 8:0:6:0: [sdo] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
Aug  7 19:00:18 moog kernel: mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8802255bac00)
Aug  7 19:00:28 moog kernel: mptscsih: ioc0: attempting host reset! (sc=ffff8802255bac00)
Aug  7 19:00:28 moog kernel: mptbase: ioc0: Initiating recovery
Aug  7 19:00:39 moog kernel: mptscsih: ioc0: host reset: SUCCESS (sc=ffff8802255bac00)
Aug  7 19:00:39 moog kernel: sd 8:0:6:0: Device offlined - not ready after error recovery
Aug  7 19:00:39 moog kernel: sd 8:0:6:0: Device offlined - not ready after error recovery
Aug  7 19:00:39 moog kernel: sd 8:0:6:0: [sdo] Unhandled error code
Aug  7 19:00:39 moog kernel: sd 8:0:6:0: [sdo] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug  7 19:00:39 moog kernel: end_request: I/O error, dev sdo, sector 322860503
Aug  7 19:00:39 moog kernel: sd 8:0:6:0: [sdo] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug  7 19:00:39 moog kernel: raid5: Disk failure on sdo1, disabling device.
Aug  7 19:00:39 moog kernel: raid5: Operation continuing on 6 devices.
Aug  7 19:00:39 moog kernel: end_request: I/O error, dev sdm, sector 1953519935
Aug  7 19:00:39 moog kernel: md: super_written gets error=-5, uptodate=0
Aug  7 19:00:39 moog kernel: raid5: Disk failure on sdm1, disabling device.
Aug  7 19:00:39 moog kernel: raid5: Operation continuing on 6 devices.
Aug  7 19:00:39 moog kernel: RAID5 conf printout:
Aug  7 19:00:39 moog kernel: --- rd:7 wd:6
Aug  7 19:00:39 moog kernel: disk 0, o:1, dev:sdj1
Aug  7 19:00:39 moog kernel: disk 1, o:1, dev:sdl1
Aug  7 19:00:39 moog kernel: disk 2, o:1, dev:sdp1
Aug  7 19:00:39 moog kernel: disk 3, o:1, dev:sdn1
Aug  7 19:00:39 moog kernel: disk 4, o:1, dev:sdk1
Aug  7 19:00:39 moog kernel: disk 5, o:1, dev:sdi1
Aug  7 19:00:39 moog kernel: disk 6, o:0, dev:sdo1
Aug  7 19:00:39 moog kernel: RAID5 conf printout:
Aug  7 19:00:39 moog kernel: --- rd:7 wd:6
Aug  7 19:00:39 moog kernel: disk 0, o:1, dev:sdj1
Aug  7 19:00:39 moog kernel: disk 1, o:1, dev:sdl1
Aug  7 19:00:39 moog kernel: disk 2, o:1, dev:sdp1
Aug  7 19:00:39 moog kernel: disk 3, o:1, dev:sdn1
Aug  7 19:00:39 moog kernel: disk 4, o:1, dev:sdk1
Aug  7 19:00:39 moog kernel: disk 5, o:1, dev:sdi1
Aug  7 19:00:45 moog kernel: JBD: barrier-based sync failed on dm-1:8 - disabling barriers

reloading mptsas kernel module brings all drives back, but the drives are marked as failed in mdadm. re-adding them starts a resync which takes three hours on this system.

Pim

Comment 13 Pim Zandbergen 2009-08-07 17:37:37 UTC
(In reply to comment #9)
> yes, scratch build lives only a few days before it's deleted
> 
> build:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=1583235

Can you share the .src.rpm?

I'd like to try building with different compiler flags as suggested by various sources
http://markmail.org/message/zow4jsbfl2zxhimx

Pim

Comment 14 Michal Hlavinka 2009-08-10 06:30:05 UTC
ok, srpm is here:

http://mhlavink.fedorapeople.org/smartmontools-5.39-0.svn20090721.fc11.src.rpm

I'll rebuild this package without '-fpie' if you can confirm it helps

Comment 15 Pim Zandbergen 2009-08-19 11:42:46 UTC
(In reply to comment #14)
> ok, srpm is here:
> 
> http://mhlavink.fedorapeople.org/smartmontools-5.39-0.svn20090721.fc11.src.rpm
> 
> I'll rebuild this package without '-fpie' if you can confirm it helps  

I haven't been able to test this release built without -fpie, but I did build 5.38-11, the current fedora package, without -fpie.

It did not make any difference. I don't think building this newer package without -fpie will make a difference.

Comment 16 Pim Zandbergen 2009-08-19 12:38:38 UTC
(In reply to comment #11)
>
> do you have any similar systems where it works? What are they?
> 

I am administering six systems with LSI 1068E based SAS controllers.
1 Fedora 9 with LSI SAS3800X, works
2 Fedora 10 with LSI SAS3081E-R, work
1 Fedora 10 with LSI SAS3081E-R, problems
1 Fedora 11 with IBM-branded LSI SAS3444E, problems
1 Fedora 11 with Fujitsu card similar to LSI SAS3081E-R, problems

All controllers are controlling SATA disks.
LSI's host-RAID bios is disabled or not used.
All disks are used for mdadm software RAID.
All systems are booting off a SATA flash disk
controlled by an Intel ICHxxx SATA controller.

The most peculiar thing is that two of these systems
are exactly the same, yet only one has the problem.
Exact same hardware, same age, same SAS controller,
same firmware, same Fedora 10 release.

Their only difference is their upgrade history.

The system without problems was installed with Fedora 8,
then upgraded to Fedora 10.

The system with problems was installed with Fedora 9,
then upgraded to Fedora 11 (where the problem started)
then downgraded to Fedora 10 (for other reasons than
smartmontools). The smartmontools problem persists
after downgrading.

The system with the Fujitsu card was the only one that
started exposing this problem rightaway in Fedora 10.
As this was the first one, I was suspecting the card
itself then. This system now runs Fedora 11 and still has
the problem.

Comment 17 Pim Zandbergen 2009-08-19 12:45:03 UTC
Upgrading firmware does not help either.

I have upgraded the firmware of the problematic systems
from 1.26 to 1.28 for the LSI branded cards and
from 1.24 to 1.27.82 for the IBM branded card.
The Fujitsu card happily accepts LSI firmware.

No difference in behaviour was noted after upgrading.

Comment 18 Michal Hlavinka 2009-08-19 14:05:36 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > ok, srpm is here:
> > 
> > http://mhlavink.fedorapeople.org/smartmontools-5.39-0.svn20090721.fc11.src.rpm
> > 
> > I'll rebuild this package without '-fpie' if you can confirm it helps  
> 
> I haven't been able to test this release built without -fpie, but I did build
> 5.38-11, the current fedora package, without -fpie.
> 
> It did not make any difference. I don't think building this newer package
> without -fpie will make a difference.  

well, I've meant without '-pie' and '-fpie' as mentioned in

https://bugzilla.redhat.com/show_bug.cgi?id=452389#c1
and
http://markmail.org/message/zow4jsbfl2zxhimx

in comment #3 you've said:

> disabling -Wl,-z,relro,-z,now makes a working executable.

but in original comment it's said:

>the problem is that Fedora builds with the following command:
>make CXXFLAGS="$RPM_OPT_FLAGS -fpie" LDFLAGS="-pie -Wl,-z,relro,-z,now"
> Rebuilding without the 'pie' flags makes the problem go away....

so it says keeping the flags you said you've removed (-Wl,-z,...) and removing just -fpie from CXXFLAGS and -pie from LDFLAGS

so... what options did you use and was it working without the patch you've attached or only with the patch?

Did you test combination: removed -fpie, -pie and without the patch? If not, could you test it?

I'll try to get some answers from some kernel-storage guy meanwhile.

Comment 19 Pim Zandbergen 2009-08-19 14:52:28 UTC
Sorry for the confusion. I will restate my previous findings:

I built smartmontools-5.38-11.fc11.src.rpm using
make CXXFLAGS="$RPM_OPT_FLAGS" LDFLAGS="-Wl,-z,relro,-z,now"

It did not make a difference.

I used smartmontools-5.39-0.svn20090721.fc11.x86_64.rpm as
built by yourself.

It appeared to take longer before the drives went offline.

I have not yet tried custom builds of
smartmontools-5.39-0.svn20090721.fc11.src.rpm

It's not so easy to test this. If I deactivate my volume
groups and stop the md arrays, nothing unusual happens
when I use smartd or smartctl.

I have to bring everything up for the problem to rise.
But then when they rise, things get very scary.
Multiple drives from an MD RAID go offline.
The ones that stay online often are marked as failed, because
they did not respond for a while. Repairing this means re-enabling
failed drives and forcing re-assembly of broken RAID arrays,
and resyncs that take hours. I did not lose any data yet,
but wonder how long my luck will last.

Apart from my home file server, all the other servers are
production servers at work...

Next time I will try to force the problem while using
non-RAID related disk activity, like reading all drives
simultaneously using dd.

Thanks,
Pim

Comment 20 Michal Hlavinka 2009-08-19 15:15:43 UTC
(In reply to comment #19)
> I built smartmontools-5.38-11.fc11.src.rpm using
> make CXXFLAGS="$RPM_OPT_FLAGS" LDFLAGS="-Wl,-z,relro,-z,now"

so, to be clear, there is no build option, that can be used. It simply does not work. Even build options mentioned in comment #3 do not help. right?

> I have not yet tried custom builds of
> smartmontools-5.39-0.svn20090721.fc11.src.rpm

I think there is no need to do this for now.

could you get me pci ids of that sas controllers? I'll look at our lab if I can find one.

thanks

Comment 21 Pim Zandbergen 2009-08-19 16:11:32 UTC
(In reply to comment #20)
> so, to be clear, there is no build option, that can be used. It simply does not
> work.

Simply put: on those systems where the problem appeared
there was nothing I could do to make it go away, other
than disabling smartd and not using smartctl.

Not using different versions of smartmontools.
Not using different build options.
Not by upgrading the firmware.
Not by downgrading Fedora 11 to Fedora 10.

> Even build options mentioned in comment #3 do not help. right?

I never tried the build options suggested in comment #3.

> could you get me pci ids of that sas controllers?

Fujitsu card is 1000:0058 Subsystem: 1734:1130
LSI SAS3081E is 1000:0058 Subsystem: 1000:3140
LSI SAS3800X is 1000:0054 Subsystem: 1000:3070
IBM SAS3444E is 1000:0058 Subsystem: 1000:3100

Comment 22 Tomas Henzl 2009-08-27 11:43:11 UTC
Kashyap,
this could be a problem with the controller, does it look familiar to you ?

thanks, tomas

Comment 23 kashyap 2009-08-27 12:45:20 UTC
what are the values for OS level IO timeout and Firmware level IO missing delay and Device missing delay ?
Can I have below values ?

1. cat /sys/class/scsi_host/hostX/device_delay
2. cat /sys/class/scsi_host/hostX/io_delay
2. cat /sys/class/scsi_device/B:T:L:C/device/timeout.


Thanks,
Kashyap

Comment 24 Pim Zandbergen 2009-08-27 15:46:03 UTC
On every system, device_delay and io_delay are 00

Timeout values differ on the systems described in previous comments:

Fujitsu card 30 (fedora 11, failing)
LSI SAS3081E 60 (fedora 10, 2 working, 1 failing)
LSI SAS3800X 60 (fedora 9, working)
IBM SAS3444E 30 (fedora 11, failing)

The conclusion could almost be that 60 is good and 30 is bad.
Except that there's one system that's failing with 60.

Thanks
Pim

Comment 25 Pim Zandbergen 2009-08-27 16:49:53 UTC
There might there be a relation with

http://bugzilla.kernel.org/show_bug.cgi?id=13594

Pim

Comment 26 kashyap 2009-08-28 13:41:35 UTC
In comment #2 I am seeing this log
mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing
Delay Retry}, SubCode(0x0000)

It indicates device missing delay is set to non-zero.

Thanks,
Kashyap

Comment 27 Michal Hlavinka 2009-09-01 13:13:00 UTC
Thanks for the answer, Kashyap. Could you explain me your last comment?
Does it mean:
a) controller is misconfigured?
b) smartmontools are using wrong ioctl/something
c) bug in the driver
d) something else?

Thanks,
Michal

Comment 28 kashyap 2009-09-01 14:05:39 UTC
See my comments inline.
a) controller is misconfigured?
-> Any configuration is valid. Why I have asked for this is just to get some pointers for me. When you found device missing delay is set to 0, I would like you to cross check again, since Logs are indicating me that there are some device missing delay values set.

b) smartmontools are using wrong ioctl/something
c) bug in the driver
d) something else?
-> Need investigation for #b, #c, $d

Please ON driver log level to 0x8188 (It can be set by changing value of /sys/class/scsih_host/hostX/logging_level)
This can help me to see things from MPT fusion side.



Thanks,
Kashyap

Comment 29 Richard Huddleston 2009-09-01 19:32:44 UTC
For what it's worth, I'm seeing this issue on Ubuntu Jaunty, Kernel 2.6.28 

although my drives and mdadm come up after boot

01:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064ET PCI-Express Fusion-MPT SAS (rev 08)
        Subsystem: LSI Logic / Symbios Logic Device 3090
        Flags: bus master, fast devsel, latency 0, IRQ 2294
        I/O ports at b000 [size=256]
        Memory at d0610000 (64-bit, non-prefetchable) [size=16K]
        Memory at d0600000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at d0400000 [disabled] [size=2M]
        Capabilities: [50] Power Management version 2
        Capabilities: [68] Express Endpoint, MSI 00
        Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
        Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1
        Capabilities: [100] Advanced Error Reporting <?>
        Kernel driver in use: mptsas
        Kernel modules: mptsas

with 4 disks  ST3500630AS

[   13.259129] mptsas 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[   13.259225] mptbase: ioc0: Initiating bringup
[   14.980070] mptsas 0000:01:00.0: irq 2294 for MSI/MSI-X
[   14.980083] mptbase: ioc0: PCI-MSI enabled
[   14.980105] mptsas 0000:01:00.0: setting latency timer to 64
[   47.040043] mptscsih: ioc0: attempting task abort! (sc=ffff880238de5200)
[   47.514167] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[   47.563886] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880238de5200)
[   56.040032] mptscsih: ioc0: attempting task abort! (sc=ffff880238de5500)
[   56.513902] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[   56.560024] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880238de5500)
[   65.040048] mptscsih: ioc0: attempting task abort! (sc=ffff880238de5000)
[   65.509567] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[   65.560024] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880238de5000)
[   74.011304] mptscsih: ioc0: attempting task abort! (sc=ffff8802391cea00)
[   74.505256] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[   74.530021] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802391cea00)

Comment 30 Pim Zandbergen 2009-09-02 18:12:49 UTC
(In reply to comment #28)
> When you found device missing delay is set to 0, I would like
> you to cross check again, since Logs are indicating me that there are some
> device missing delay values set.

I think the confusion here is that the non-zero "device missing delay"
values were seen in Philippe Troin's report. Right?

I think my reports do not show this, and this would match the
/sys/class/scsi_host/hostX/device_delay values that I reported.


Thanks,
Pim

Comment 31 kashyap 2009-09-03 05:00:43 UTC
I got your point. It means with and without Device missing delay problem is still there. Right ?

We are targeting why Device is getting off lined by OS right ?
Is this problem still Alive ? Since I am not seeing Device Offline prints in Comment #30.

Can you try doing MSI disable and see what is behavior ?

Comment 32 Pim Zandbergen 2009-09-21 17:33:53 UTC
I haven't been able to answer questions posed in previous comments,
as I deemed it too dangerous to try more on production systems.

I now have assembled a test machine, a desktop PC with an IBM SAS HBA
(1000:0058 Subsystem: 1000:3100) in a secondary PEG slot. This works.
I configured 4 (different) laptop drives in an mdadm RAID-5.

But I now have trouble reproducing the problem. I can reproduce
I/O errors, card resets, mdadm failed disks, but I have not been
able to let drives actually go offline.

These laptop drives are much slower than the drives from
the production systems. Maybe I need faster drives, or
maybe I need to simulate more disk activity.

Or maybe the current symptopms (which are bad enough on
their own) suffice for continuing this bug report.

Pim

Comment 33 Pim Zandbergen 2009-09-21 17:39:29 UTC
There is yet another report of this problem on the linux-scsi
mailinglist
http://kerneltrap.org/mailarchive/linux-scsi/2009/9/14/6409773

Pim

Comment 34 Pim Zandbergen 2009-09-21 19:11:06 UTC
On my new test system, I see different values as described in comment #23

1. cat /sys/class/scsi_host/hostX/device_delay
144

2. cat /sys/class/scsi_host/hostX/io_delay
08

3. cat /sys/class/scsi_device/B:T:L:C/device/timeout.
30
30
30
30

This card has older firmware: FwRev=01175400h
Should I upgrade firmware first?

Comment 35 Pim Zandbergen 2009-09-21 19:59:33 UTC
I have upgraded my test system with yum update which means
it now runs kernel 2.6.30.5-43.fc11.x86_64

I have upgraded the firmware to 011b5200h as on production system.

I have found the delay values 144 and 8 in the BIOS setup of the adapter
and have changed them to zero

Before all these changes, the symptoms were the same as on the production systems, except that the drives would not disappear from /proc/scsi/scsi,
although they would be inaccessible.

Now after these changes, drives no longer fail.

I do see errors in the syslog, but they are different:

Sep 21 21:38:50 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012e5feb00)
Sep 21 21:38:50 botero kernel: sd 6:0:2:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
Sep 21 21:38:51 botero kernel: mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 21 21:38:51 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012e5feb00)
Sep 21 21:38:52 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012e439d00)
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] CDB: Read(10): 28 00 00 13 83 3f 00 00 08 00
Sep 21 21:38:52 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012e439d00)
Sep 21 21:38:52 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012e439600)
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] CDB: Read(10): 28 00 00 13 83 47 00 00 08 00
Sep 21 21:38:52 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012e439600)
Sep 21 21:38:52 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff8800bbd2ea00)
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] CDB: Read(10): 28 00 00 32 0c bf 00 00 08 00
Sep 21 21:38:52 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff8800bbd2ea00)
Sep 21 21:38:52 botero kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 21 21:38:52 botero kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:38:52 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:38:52 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:38:52 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:38:52 botero kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:38:52 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:38:52 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:38:52 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:38:52 botero ata_id[3196]: HDIO_GET_IDENTITY failed for '/dev/block/8:64'
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:38:52 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:38:52 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:38:52 botero kernel:        00 00 00 00 00 00
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:38:52 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:38:52 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:38:52 botero kernel:        00 00 00 00 00 00
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:38:52 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:38:52 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:38:52 botero kernel:        00 00 00 00 00 00
Sep 21 21:38:52 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:39:49 botero ntpd[2063]: synchronized to 82.204.76.2, stratum 2
Sep 21 21:39:49 botero ntpd[2063]: kernel time sync status change 2001
Sep 21 21:40:08 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012d9d1600)
Sep 21 21:40:08 botero kernel: sd 6:0:1:0: [sdd] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
Sep 21 21:40:09 botero kernel: mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 21 21:40:09 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012d9d1600)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012e439700)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Write(10): 2a 00 00 21 71 3f 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012e439700)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012e439e00)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Write(10): 2a 00 00 21 86 87 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012e439e00)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012ddf9400)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Write(10): 2a 00 00 21 90 ef 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012ddf9400)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012dd75d00)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Write(10): 2a 00 00 21 bb 7f 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012dd75d00)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012df84800)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Read(10): 28 00 00 33 bd 77 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012df84800)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880117338900)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Read(10): 28 00 00 33 bd 7f 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff880117338900)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880117338000)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Read(10): 28 00 00 33 bd 87 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff880117338000)
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88012b499000)
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] CDB: Read(10): 28 00 00 33 bd 8f 00 00 08 00
Sep 21 21:40:10 botero kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff88012b499000)
Sep 21 21:40:10 botero kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 21 21:40:10 botero kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:40:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:40:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:40:10 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:40:10 botero kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:40:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:40:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:40:10 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:40:10 botero ata_id[3284]: HDIO_GET_IDENTITY failed for '/dev/block/8:48'
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:40:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:40:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:40:10 botero kernel:        00 00 00 00 00 00
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:40:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:40:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:40:10 botero kernel:        00 00 00 00 00 00
Sep 21 21:40:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:41:07 botero smartd[2206]: smartd received signal 15: Terminated
Sep 21 21:41:07 botero smartd[2206]: smartd is exiting (exit status 0)
Sep 21 21:41:07 botero smartd[3365]: smartd version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Sep 21 21:41:07 botero smartd[3365]: Home page is http://smartmontools.sourceforge.net/#012
Sep 21 21:41:07 botero smartd[3365]: Opened configuration file /etc/smartd.conf
Sep 21 21:41:07 botero smartd[3365]: Configuration file /etc/smartd.conf parsed.
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/ata-ST3500320AS_9QM48WNB, opened
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/ata-ST3500320AS_9QM48WNB, found in smartd database.
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/ata-ST3500320AS_9QM48WNB, is SMART capable. Adding to "monitor" list.
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/ata-ST3500320AS_9QM48X07, opened
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/ata-ST3500320AS_9QM48X07, found in smartd database.
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/ata-ST3500320AS_9QM48X07, is SMART capable. Adding to "monitor" list.
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_FUJITSU_MHV2100_NW04T6525F28, opened
Sep 21 21:41:07 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_FUJITSU_MHV2100_NW04T6525F28, found in smartd database.
Sep 21 21:41:07 botero kernel: sd 6:0:0:0: [sdc] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:07 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:07 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:07 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:07 botero kernel: sd 6:0:0:0: [sdc] Add. Sense: ATA pass through information available
Sep 21 21:41:08 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_FUJITSU_MHV2100_NW04T6525F28, is SMART capable. Adding to "monitor" list.
Sep 21 21:41:08 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS541010G9SA00_MP2ZX0X0K9WPBS, opened
Sep 21 21:41:08 botero ata_id[3369]: HDIO_GET_IDENTITY failed for '/dev/block/8:32'
Sep 21 21:41:08 botero kernel: sd 6:0:0:0: [sdc] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:08 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:08 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:08 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:08 botero kernel: sd 6:0:0:0: [sdc] Add. Sense: ATA pass through information available
Sep 21 21:41:08 botero kernel: sd 6:0:0:0: [sdc] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:08 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:08 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:08 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:08 botero kernel: sd 6:0:0:0: [sdc] Add. Sense: ATA pass through information available
Sep 21 21:41:08 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS541010G9SA00_MP2ZX0X0K9WPBS, found in smartd database.
Sep 21 21:41:09 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:09 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:09 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:41:09 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:09 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:41:09 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS541010G9SA00_MP2ZX0X0K9WPBS, is SMART capable. Adding to "monitor" list.
Sep 21 21:41:09 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS541080G9SA00_MPBDPAXNGR2PVM, opened
Sep 21 21:41:09 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS541080G9SA00_MPBDPAXNGR2PVM, found in smartd database.
Sep 21 21:41:09 botero kernel: sd 6:0:3:0: [sdf] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:09 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:09 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:41:09 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:09 botero kernel: sd 6:0:3:0: [sdf] Add. Sense: ATA pass through information available
Sep 21 21:41:09 botero ata_id[3382]: HDIO_GET_IDENTITY failed for '/dev/block/8:64'
Sep 21 21:41:09 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS541080G9SA00_MPBDPAXNGR2PVM, is SMART capable. Adding to "monitor" list.
Sep 21 21:41:09 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS721010G9SA00_MPCZN2Y0GPYPHH, opened
Sep 21 21:41:09 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS721010G9SA00_MPCZN2Y0GPYPHH, found in smartd database.
Sep 21 21:41:09 botero ata_id[3391]: HDIO_GET_IDENTITY failed for '/dev/block/8:80'
Sep 21 21:41:09 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:09 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:09 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:09 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:09 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:41:09 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:09 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:09 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:41:09 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:09 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:41:09 botero kernel: sd 6:0:3:0: [sdf] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:09 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:09 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:09 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:09 botero kernel: sd 6:0:3:0: [sdf] Add. Sense: ATA pass through information available
Sep 21 21:41:09 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:09 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:09 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:09 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:09 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:41:09 botero kernel: sd 6:0:3:0: [sdf] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:09 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:09 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:09 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:09 botero kernel: sd 6:0:3:0: [sdf] Add. Sense: ATA pass through information available
Sep 21 21:41:09 botero smartd[3365]: Device: /dev/disk/by-id/scsi-SATA_HTS721010G9SA00_MPCZN2Y0GPYPHH, is SMART capable. Adding to "monitor" list.
Sep 21 21:41:09 botero smartd[3365]: Monitoring 6 ATA and 0 SCSI devices
Sep 21 21:41:09 botero ata_id[3404]: HDIO_GET_IDENTITY failed for '/dev/block/8:48'
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:10 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:10 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:41:10 botero kernel: sd 6:0:0:0: [sdc] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:10 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:10 botero kernel: sd 6:0:0:0: [sdc] Add. Sense: ATA pass through information available
Sep 21 21:41:10 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:41:10 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:10 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 21:41:10 botero kernel: sd 6:0:3:0: [sdf] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:41:10 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:10 botero kernel: sd 6:0:3:0: [sdf] Add. Sense: ATA pass through information available
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 01 00 00 00
Sep 21 21:41:10 botero kernel:        00 4f 00 c2 00 50
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:41:10 botero smartd[3414]: smartd has fork()ed into background mode. New PID=3414.
Sep 21 21:41:10 botero ata_id[3418]: HDIO_GET_IDENTITY failed for '/dev/block/8:48'
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:10 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Sense Key : Recovered Error [current] [descriptor]
Sep 21 21:41:10 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 21:41:10 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
Sep 21 21:41:10 botero kernel:        00 00 00 00 00 00
Sep 21 21:41:10 botero kernel: sd 6:0:1:0: [sdd] Add. Sense: ATA pass through information available

In spite of all these messages, this looks good.
Except that smartctl output is incomplete.

I will try this one of the production systems.

Pim

Comment 36 Pim Zandbergen 2009-09-21 20:05:22 UTC
smarctl -H -d scsi /dev/sdX

returns

SMART Health Status: OK

immediately, with this in the syslog

Sep 21 22:02:55 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 22:02:55 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 22:02:55 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 
Sep 21 22:02:55 botero kernel:        00 00 00 00 00 00 
Sep 21 22:02:55 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available
Sep 21 22:02:55 botero kernel: sd 6:0:2:0: [sde] Sense Key : Recovered Error [current] [descriptor]
Sep 21 22:02:55 botero kernel: Descriptor sense data with sense descriptors (in hex):
Sep 21 22:02:55 botero kernel:        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 
Sep 21 22:02:55 botero kernel:        00 00 00 00 00 00 
Sep 21 22:02:55 botero kernel: sd 6:0:2:0: [sde] Add. Sense: ATA pass through information available

So, faking "-d scsi" in /etc/smartd.conf might give a working SMART monitoring
without bringing the system down.

Pim


Pim

Comment 37 Pim Zandbergen 2009-09-21 20:07:14 UTC
(In reply to comment #31)

> Can you try doing MSI disable and see what is behavior ?  

Where and how do I do this?

Thanks,
Pim

Comment 38 Pim Zandbergen 2009-09-22 09:10:22 UTC
I have upgraded one of the production systems using yum update,
which upgrades the kernel to 2.6.30.5-43.fc11.x86_64

Everything seems to work now, both smartd and smartctl,
using "-d sat".

I do get syslog messages like in comment #35.
It assume these messages are related to
http://bugzilla.kernel.org/show_bug.cgi?id=13594

But that should probably be filed as a new bug.

Thanks,
Pim

Comment 39 Tomas Henzl 2009-09-22 12:25:25 UTC
(In reply to comment #37)
> (In reply to comment #31)
> 
> > Can you try doing MSI disable and see what is behavior ?  
> 
> Where and how do I do this?
> 
> Thanks,
> Pim  

There used to be mpt_msi_enable option, now replaced by (mpt_msi_enable_sas, mpt_msi_enable_spi, mpt_msi_enable_fc) - they do the same for the particular buses. 
Any  module  options  specified in /etc/modprobe.conf are passed to the modules as they are loaded by the initial ramdisk. So change the options in modprobe.conf and recreate the initrd.
I'm not sure, but I think that the default value is 0 (msi disabled) so this probably will not help much. You can try the option mpt_msi_enable_sas=0 to be sure.
tomash

Comment 40 Pim Zandbergen 2009-11-04 18:17:04 UTC
Do you really want me to try to disable MSI on an old kernel,
now that I reported that the problem is fixed in kernel 2.6.30 ?

I would suggest to close this bug.

Pim

Comment 41 Pim Zandbergen 2009-11-04 18:58:32 UTC
(In reply to comment #38)

> I do get syslog messages like in comment #35.
> It assume these messages are related to
> http://bugzilla.kernel.org/show_bug.cgi?id=13594
> 
> But that should probably be filed as a new bug.


I have filed this new problem as bug 533035
https://bugzilla.redhat.com/show_bug.cgi?id=533035

Pim

Comment 42 Michal Hlavinka 2009-11-05 12:40:08 UTC
(In reply to comment #40)
...
> the problem is fixed in kernel 2.6.30
...
> I would suggest to close this bug.

agreed


Note You need to log in before you can comment on or make changes to this bug.