Bug 329201 - scsi hot swapp mechanism not working with SATA HDD under RHEL4U5
Summary: scsi hot swapp mechanism not working with SATA HDD under RHEL4U5
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: David Milburn
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 430698 461297
TreeView+ depends on / blocked
 
Reported: 2007-10-12 08:46 UTC by Benedikt Schaefer
Modified: 2018-10-20 00:27 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:16:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to fix removing/adding hdd through /proc (1.59 KB, patch)
2008-11-11 14:25 UTC, David Milburn
no flags Details | Diff
dmesg.add (421 bytes, application/octet-stream)
2008-12-10 08:10 UTC, Benedikt Schaefer
no flags Details
dmesg.boot (25.62 KB, application/octet-stream)
2008-12-10 08:11 UTC, Benedikt Schaefer
no flags Details
dmesg.remove (58 bytes, application/octet-stream)
2008-12-10 08:11 UTC, Benedikt Schaefer
no flags Details
output of lspci (2.16 KB, text/plain)
2008-12-10 08:11 UTC, Benedikt Schaefer
no flags Details
scsi.log (1.35 KB, text/x-log)
2008-12-10 08:12 UTC, Benedikt Schaefer
no flags Details
dmesg3.boot (29.53 KB, application/octet-stream)
2008-12-18 08:46 UTC, Benedikt Schaefer
no flags Details
dmesg3.remove (150.23 KB, application/octet-stream)
2008-12-18 08:47 UTC, Benedikt Schaefer
no flags Details
dmesg3.add (953.97 KB, application/octet-stream)
2008-12-18 08:49 UTC, Benedikt Schaefer
no flags Details
dmesg4.add (953.42 KB, application/octet-stream)
2008-12-19 07:55 UTC, Benedikt Schaefer
no flags Details
dmesg6.boot (31.64 KB, application/octet-stream)
2009-01-16 12:18 UTC, Benedikt Schaefer
no flags Details
dmesg6.remove (313.02 KB, application/octet-stream)
2009-01-16 12:19 UTC, Benedikt Schaefer
no flags Details
dmesg6.add (954.71 KB, application/octet-stream)
2009-01-16 12:19 UTC, Benedikt Schaefer
no flags Details
dmesg7.boot (31.36 KB, application/octet-stream)
2009-01-21 07:59 UTC, Benedikt Schaefer
no flags Details
dmesg7.remove (282.29 KB, application/octet-stream)
2009-01-21 08:00 UTC, Benedikt Schaefer
no flags Details
dmesg7.add (954.97 KB, application/octet-stream)
2009-01-21 08:00 UTC, Benedikt Schaefer
no flags Details
Patch to support hdd hotplug (2.02 KB, patch)
2009-01-21 23:10 UTC, David Milburn
no flags Details | Diff
dmesg8.add (1.14 KB, application/octet-stream)
2009-01-22 07:56 UTC, Benedikt Schaefer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Benedikt Schaefer 2007-10-12 08:46:19 UTC
Description of problem:
scsi hot swapp mechanism is not working with SATA HDDs

Version-Release number of selected component (if applicable):
Hosts: NEC SA2510R-1 / NEC Express11Rf-1
OS: RHEL4U5/SL4U5
ARCH: x86_64 / i386
KERNEL: 2.6.9-55.EL/2.6.9-55.ELsmp/2.6.9-55.0.2.ELsmp/2.6.9-55.0.9.ELsmp
HDDs: WDT WD1600JS-19NCB1 / Maxtor Diamond Max Plus9 6Y120M00

How reproducible:
every time

Steps to Reproduce:
1. boot with RHEL4U5 rescue
2. echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi
3. remove HDD physicaly and add again
4. echo "scsi add-single-device 1 0 0 0" > /proc/scsi/scsi

Actual results:
HDD not added 

Expected results:
HDD added

Additional info:
after removing the HDD (remove-single-device) I see in dmesg: ata1.00 disabled
with kernel 2.6.9-42.0.10.ELsmp (RHEL4U4) everything is working fine

Comment 1 Jun'ichi Nomura (Red Hat) 2007-10-15 22:30:02 UTC
Prarit,

the problem can be reproduced with Dell Precision WorkStation 380
and is likely a generic SATA problem, not specific to the reported hardware.

When /proc/sys/dev/scsi/logging_level is set to 448,
the following messages were observed after "add-single-device":
  scsi_scan_host_selected: <3:0:0:0>
  scsi scan: INQUIRY to host 3 channel 0 id 0 lun 0
  scsi scan: 1st INQUIRY failed with code 0x40000

that means DID_BAD_TARGET.
libata returns DID_BAD_TARGET because the device is disabled
("dev->class++" in ata_dev_disable()) in the course of
"remove-single-device".

A big change has made for libata in RHEL4 U5 and it causes this
difference in behaviour.
The problem doesn't occur on RHEL5 and FC7, where the device scan
is done through different path (scsi_transport_template->user_scan()).



Comment 2 Prarit Bhargava 2007-10-16 12:48:01 UTC
Jun'ichi -- Can you verify that this worked with RHEL4U4 (or any earlier version
of RHEL4)?

P.

Comment 4 Jun'ichi Nomura (Red Hat) 2007-10-16 14:01:39 UTC
Prarit, remove/add works with RHEL4 U4 (2.6.9-42.EL)
(device is renamed after add, though).

See below for samples.

2.6.9-42.ELsmp:
# cat /proc/scsi/scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: WDC WD800JD-19LS Rev: 06.0
  Type:   Direct-Access                    ANSI SCSI revision: 0
# echo 1 > /sys/block/sdd/device/delete 
# dmesg -c
Synchronizing SCSI cache for disk sdd: 
# cat /proc/scsi/scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
# echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi
# dmesg -c
  Vendor: ATA       Model: WDC WD800JD-19LS  Rev: 06.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB)
SCSI device sde: drive cache: write back
SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB)
SCSI device sde: drive cache: write back
 sde: sde1 sde2 sde3 sde4
Attached scsi disk sde at scsi3, channel 0, id 0, lun 0
# echo 1 > /sys/block/sde/device/delete
# dmesg -c
Synchronizing SCSI cache for disk sde: 
# echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi
# dmesg -c
  Vendor: ATA       Model: WDC WD800JD-19LS  Rev: 06.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB)
SCSI device sde: drive cache: write back
SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB)
SCSI device sde: drive cache: write back
 sde: sde1 sde2 sde3 sde4
Attached scsi disk sde at scsi3, channel 0, id 0, lun 0
# echo 1 > /sys/block/sde/device/delete
# echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi
# cat /proc/scsi/scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: WDC WD800JD-19LS Rev: 06.0
  Type:   Direct-Access                    ANSI SCSI revision: 05

Note that 'sdd' became 'sde' after the 1st "add".


BTW, "remove-single-device" in the original report works only
once with 2.6.9-42.ELsmp.

# echo "scsi remove-single-device 3 0 0 0" > /proc/scsi/scsi
# dmesg -c
Synchronizing SCSI cache for disk sdd: 
# echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi
# dmesg -c
  Vendor: ATA       Model: WDC WD800JD-19LS  Rev: 06.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB)
SCSI device sde: drive cache: write back
SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB)
SCSI device sde: drive cache: write back
 sde: sde1 sde2 sde3 sde4
Attached scsi disk sde at scsi3, channel 0, id 0, lun 0
# echo "scsi remove-single-device 3 0 0 0" > /proc/scsi/scsi
# dmesg -c
<nothing displayed and sde is not removed>


As for the not-working case (RHEL4 U5 and RHEL4 U6 beta),

2.6.9-62.ELsmp:
# cat /proc/scsi/scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: WDC WD800JD-19LS Rev: 06.0
  Type:   Direct-Access                    ANSI SCSI revision: 05
# echo 1 > /sys/block/sdd/device/delete
# dmesg -c
Synchronizing SCSI cache for disk sdd: 
ata4.00: disabled
# echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi
# dmesg -c
<nothing displayed>
# cat /proc/scsi/scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HDS728080PLA380  Rev: PF2O
  Type:   Direct-Access                    ANSI SCSI revision: 05



Comment 5 Prarit Bhargava 2007-10-16 14:13:44 UTC
Thanks jnomura :) -- one last question.  The information in comment #4 was from
the dell380-2.rhts.boston.redhat.com in RHTS?

P.

Comment 6 Jun'ichi Nomura (Red Hat) 2007-10-16 14:31:20 UTC
Your welcome, Prarit.
No, the information was taken on my desktop machine.
I'll let you know other machines available by e-mail.


Comment 9 RHEL Program Management 2007-10-16 15:46:15 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 13 Benedikt Schaefer 2007-11-05 12:18:52 UTC
Do you have any update for me ?

thanks
best regards
Benedikt

Comment 14 RHEL Program Management 2008-03-10 10:26:17 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request. 

Comment 17 Peter Martuccelli 2008-07-01 14:29:24 UTC
Development will review this issue for resolution in RHEL 4.8.

Comment 18 RHEL Program Management 2008-09-03 12:52:50 UTC
Updating PM score.

Comment 20 David Milburn 2008-11-11 14:25:50 UTC
Created attachment 323174 [details]
Patch to fix removing/adding hdd through /proc

The reason this fails is because ata_eh_detach_dev call ata_dev_disable which
changes the class from ATA_DEV_ATA or ATA_DEV_ATAPI to a disabled state,
which in turn causes ata_scsi_queuecmd to report DID_BAD_TARGET since 
ata_scsi_find_dev reports no associated ATA device. RHEL5 and upstream code
use the scsi_transport user_scan function to allow a user-initiated bus
scan, but, this would require pulling in many other commits into RHEL4.8.
Instead, this patch sets an additional flag in ata_scsi_slave_destroy and
will check for this flag in ata_scsi_find_dev if a device is disabled
and re-enable if it had previously been disabled through ata_scsi_slave_destroy.
And since the SCSI layer has removed the device it won't try to queue up any SCSI
commands to the ATA device until it has been re-added.

Comment 21 David Milburn 2008-11-11 14:29:26 UTC
Hi,

Would you please verify the kernel-2.6.9-78.17.EL.bz329201.2 test kernel? Thanks.

http://people.redhat.com/dmilburn/

Comment 22 David Milburn 2008-11-21 20:57:16 UTC
Hi,

Has NEC had a chance to verify the kernel-2.6.9-78.17.EL.bz329201.2 test kernel?

Thanks,
David

Comment 23 David Milburn 2008-12-05 22:57:43 UTC
Hello Jun'ichi,

Can you verify the kernel-2.6.9-78.17.EL.bz329201.2 test kernel?

http://people.redhat.com/dmilburn/

The patch in Comment#20 will need to be reviewed and accepted to make it into 4.8.

Thanks,
David

Comment 24 Jun'ichi NOMURA 2008-12-08 00:58:20 UTC
David,
sorry but I've left the on-site work and don't have access to the test machine..
So we have to ask the original reporter.

Benedikt, could you please respond to the comment #21?

Comment 25 Benedikt Schaefer 2008-12-08 07:51:01 UTC
Hi,

sorry for late reply.
I was on customers sites the last few weeks.
I haven't had time to test the kernel. I will see if I find a hardware to test it.

best regards
Benedikt

Comment 26 Benedikt Schaefer 2008-12-09 09:16:08 UTC
I have done a short test with kernel kernel-smp-2.6.9-78.17.EL.bz329201.2.x86_64.rpm.
But without success.

I disabled the HDD with command:
echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi

output from dmesg:
Synchronizing SCSI cache for disk sdb:
ata2.00: disabled
ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
ata2: exception Emask 0x10 SAct 0x0 SErr 0x150000 action 0x2 frozen
ata2: hard resetting port

When I removed and added again HDD physicaly.

output from dmesg:
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7, max UDMA/133, 160836480 sectors: LBA48 NCQ (depth 0/32)
ata2.00: ata2: dev 0 multi count 0
ata2.00: configured for UDMA/133
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete

I tried to add HDD to linux with:
echo "scsi add-single-device 1 0 0 0" > /proc/scsi/scsi
but nothing happens :-(

best regards
Benedikt

Comment 27 David Milburn 2008-12-09 21:13:27 UTC
Benedikt,

Thanks for retesting, I re-verified the test kernel on my ata_piix platform,
my drive re-attaches after the "scsi add-single-device" command.

Whick controller/driver are you using? Can you attach your dmesg output after booting?

Actually, for debug purposes, would you repeat your test without physically removing/adding the drive?

# dmesg -c > dmesg.boot
# cat /proc/scsi/scsi
# echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi
# dmesg -c > dmesg.remove
# cat /proc/scsi/scsi
# echo "scsi add-single-device 1 0 0 0" > /proc/scsi/scsi
# cat /proc/scsi/scsi
# dmesg -c > dmesg.add

I will also try and retest with a different controller/driver.

David

Comment 28 Benedikt Schaefer 2008-12-10 08:09:24 UTC
Hi David,

I have tested the following script on my machine:
------
#!/bin/bash
LOG=scsi.log
dmesg -c > dmesg.boot
echo --- Start --- >> $LOG
cat /proc/scsi/scsi >> $LOG
echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi
dmesg -c > dmesg.remove
echo --- sdb removed --- >> $LOG
cat /proc/scsi/scsi >> $LOG
echo "scsi add-single-device 1 0 0 0" > /proc/scsi/scsi
echo --- sdb added --- >> $LOG
cat /proc/scsi/scsi >> $LOG
dmesg -c > dmesg.add
echo --- Finished --- >> $LOG
------

This was working.

I will attached the log files also output from lspci from my machine.

best regards
Benedikt

Comment 29 Benedikt Schaefer 2008-12-10 08:10:42 UTC
Created attachment 326457 [details]
dmesg.add

Comment 30 Benedikt Schaefer 2008-12-10 08:11:03 UTC
Created attachment 326458 [details]
dmesg.boot

Comment 31 Benedikt Schaefer 2008-12-10 08:11:21 UTC
Created attachment 326459 [details]
dmesg.remove

Comment 32 Benedikt Schaefer 2008-12-10 08:11:42 UTC
Created attachment 326460 [details]
output of lspci

Comment 33 Benedikt Schaefer 2008-12-10 08:12:00 UTC
Created attachment 326461 [details]
scsi.log

Comment 34 David Milburn 2008-12-10 22:46:43 UTC
Hi Benedikt,

Thanks for the info, I see you are using ck804/sata_nv. I was able to successful
retest on a sata_promise system.

If you physically remove/add the drive in-between the "echo scsi-remove... and 
echo scsi-add...", are you consistently able to reproduce your results in
Comment#26? I also noticed that you got a link up message when you plugged the
drive back in, how many seconds are you waiting in-between physically removing
and physically adding back the drive?

Thanks again,
David

Comment 35 Benedikt Schaefer 2008-12-11 09:26:22 UTC
Dear David,

I repeated the test with physically remove/add the drive with the some result.
After remove/add I'm not able to add the drive by scsi add-single-device command.
Time during remove and add the drive was about 3 minutes.

best regards
Benedikt

Comment 36 David Milburn 2008-12-12 21:27:50 UTC
Hello Benedikt,

I was able to test the 2.6.9-78.17.EL.bz329201.2smp kernel on CK804 system

IDE interface: nVidia Corporation CK804 IDE (rev f3)
RAID bus controller: nVidia Corporation CK804 Serial ATA Controller (rev f3)
RAID bus controller: nVidia Corporation CK804 Serial ATA Controller (rev f4)

In this case, the controller/driver supports hotplug events so you should not
need to remove/add with the "echo scsi..." commands.

I mounted/unmounted a filesystem off one of my sata drives, then unplugged
the drive, and it becomes detached

ata3: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen
ata3: hard resetting port
ata3: SATA link down (SStatus 0 SControl 300)
ata3: failed to recover some devices, retrying in 5 secs
ata3: hard resetting port
ata3: SATA link down (SStatus 0 SControl 300)
ata3: failed to recover some devices, retrying in 5 secs
ata3: hard resetting port
ata3: SATA link down (SStatus 0 SControl 300)
ata3.00: disabled
ata3: EH complete
ata3.00: detaching (SCSI 2:0:0:0)
Synchronizing SCSI cache for disk sdb:
FAILED
  status = 0, message = 00, host = 4, driver = 00

The drive no longer shows up in /proc/scsi/scsi, then plugging back in

ata3: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata3: hard resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: ATA-8, max UDMA/133, 976773168 sectors: LBA48 NCQ (depth 0/32)
ata3.00: configured for UDMA/133
ata3: EH complete
  Vendor: ATA       Model: ST3500320AS       Rev: SD15
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
SCSI device sdb: drive cache: write back
SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
SCSI device sdb: drive cache: write back
 sdb: sdb1 sdb2
Attached scsi disk sdb at scsi2, channel 0, id 0, lun 0

The drive re-attaches and I was able to mount/unmount a filesystem on the
drive. I did try using the "echo scsi..." commands and "echo scsi add-..."
command didn't appear to do anything, this was because the drive had already
been re-attached when I plugged it back in.

So, would please repeat the test without the "echo scsi..." commands? (these
should only be needed if the controller/driver do not support hotplug events)

Thank you,
David

Comment 37 Benedikt Schaefer 2008-12-15 08:30:48 UTC
Dear David,

I have done the test.
removing the drive without scsi commands:
ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: disabled
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
ata2.00: detaching (SCSI 1:0:0:0)
Synchronizing SCSI cache for disk sdb:
FAILED
  status = 0, message = 00, host = 1, driver = 00

drive not listed at /proc/scsi/scsi

plugged drive back
<3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7, max UDMA/133, 160836480 sectors: LBA48 NCQ (depth 0/32)
ata2.00: ata2: dev 0 multi count 0
ata2.00: configured for UDMA/133
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete

But the drive was not activated.

best regards
Benedikt

Comment 38 David Milburn 2008-12-15 23:23:25 UTC
Hello Benedikt,

Would you please turn on scsi scan logging and see if the INQUIRY still fails
when you plug the drive back in (repeat steps in Comment #37), you should see
some scsi_scan messages in your dmesg output.

# echo 448 > /proc/sys/dev/scsi/logging_level

Thanks,
David

Comment 39 Benedikt Schaefer 2008-12-16 07:49:40 UTC
Dear David,

I have repeated the test.
- boot the machine
- set 'echo 448 > /proc/sys/dev/scsi/logging_level'
- remove sdb drive
dmesg output:
ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: disabled
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
ata2.00: detaching (SCSI 1:0:0:0)
Synchronizing SCSI cache for disk sdb:
FAILED
  status = 0, message = 00, host = 1, driver = 00

- replug sdb drive
dmesg output:
<3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x150000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7, max UDMA/133, 160836480 sectors: LBA48 NCQ (depth 0/32)
ata2.00: ata2: dev 0 multi count 0
ata2.00: configured for UDMA/133
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0
scsi scan: 1st INQUIRY failed with code 0x10000
scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0
scsi scan: 1st INQUIRY failed with code 0x10000
scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0
scsi scan: 1st INQUIRY failed with code 0x10000
scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0
scsi scan: 1st INQUIRY failed with code 0x10000
scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0
scsi scan: 1st INQUIRY failed with code 0x10000

scsi scan message repeats a lot of time.

drive is not listed under /proc/scsi/scsi.

best regards
Benedikt

Comment 40 David Milburn 2008-12-16 22:15:04 UTC
Hello Benedikt,

An inquiry failure code of 0x10000 means DID_NO_CONNECT, scsi_dispatch_cmd
is seeing the device state as SDEV_DEL and is setting the cmd result to
DID_NO_CONNECT which would explain the inquiry failures.

On my CK804 system, I see
ata_scsi_hotplug
 ata_scsi_scan_host
  __scsi_add_device
     scsi_probe_and_add_lun
      scsi_alloc_sdev - here device state is set to SDEV_CREATED

Will you attach your /proc/mdstat before and after unplugging the drive?

Also, before unplugging the device would you remove it using mdadm,
and see if the behavior is different?

mdadm -f /dev/md0 /dev/sdb
mdadm -r /dev/md0 /dev/sdb

Thanks,
David

Comment 41 Benedikt Schaefer 2008-12-17 07:40:48 UTC
Dear David,

sdb was not part of the software raid when I removed it.

best regards
Benedikt

Comment 42 David Milburn 2008-12-17 12:25:16 UTC
Ok, thanks Benedikt, I thought it might be related to md, I will need to 
build a second test kernel to gather more debug data since I am still unable
to reproduce the problem.

Comment 43 David Milburn 2008-12-17 23:24:17 UTC
Hello Benedikt,

Would you please install the kernel-smp-2.6.9-78.21.EL.bz329201.3?

http://people.redhat.com/dmilburn/

Please boot with kernel parameter "log_buf_len=1000000", please save off dmesg
output after booting

# dmesg -c > dmesg.boot
# echo 448 > /proc/sys/dev/scsi/logging_level

Unplug drive, make sure it no longer shows up in /proc/scsi/scsi

# dmesg -c > dmesg.remove

Plug drive in, 

# dmesg -c > dmesg.add

Please attach the dmesg output and I will compare to my platform.

Thanks for your patience,
David

Comment 44 Benedikt Schaefer 2008-12-18 08:46:09 UTC
Dear David,

I have done the test.

boot kernel-smp-2.6.9-78.21.EL.bz329201.3 with 
 ro root=/dev/md2 iommu=soft log_buf_len=1000000 console=tty0
dmesg -c > dmesg3.boot
echo 448 > /proc/sys/dev/scsi/logging_level
 Unplug drive, make sure it no longer shows up in /proc/scsi/scsi
dmesg -c > dmesg3.remove
 Plug drive in, 
dmesg -c > dmesg3.add

best regards
benedikt

Comment 45 Benedikt Schaefer 2008-12-18 08:46:53 UTC
Created attachment 327306 [details]
dmesg3.boot

Comment 46 Benedikt Schaefer 2008-12-18 08:47:37 UTC
Created attachment 327307 [details]
dmesg3.remove

Comment 47 Benedikt Schaefer 2008-12-18 08:49:35 UTC
Created attachment 327308 [details]
dmesg3.add

Comment 48 David Milburn 2008-12-18 19:57:51 UTC
Noting one difference I see when adding the drive, I am only seeing one
hotplug interrupt, and you are taking multiple, nv_ck804_freeze should 
have masked off interrupts for the port

===========dmesg from my sata_nv system===========================
NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6
ATA_PORT_FREEZE: nr_aborted 0
NV_CK804_FREEZE: port_no 1  mask 0xd
__ATA_PORT_FREEZE: ata4 port_no 1 frozen
NV_ERROR_HANDLER: port_no 1
==================================================================

Your system is taking many hotplug interrupts
==================================================================
4>__ATA_PORT_FREEZE: ata2 port_no 1 frozen
NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6
ATA_PORT_FREEZE: nr_aborted 0
NV_CK804_FREEZE: port_no 1  mask 0xd
__ATA_PORT_FREEZE: ata2 port_no 1 frozen
NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6
ATA_PORT_FREEZE: nr_aborted 0
NV_CK804_FREEZE: port_no 1  mask 0xd
__ATA_PORT_FREEZE: ata2 port_no 1 frozen
NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6
ATA_PORT_FREEZE: nr_aborted 0
NV_CK804_FREEZE: port_no 1  mask 0xd
__ATA_PORT_FREEZE: ata2 port_no 1 frozen
NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6
ATA_PORT_FREEZE: nr_aborted 0
NV_CK804_FREEZE: port_no 1  mask 0xd
__ATA_PORT_FREEZE: ata2 port_no 1 frozen
NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6
ATA_PORT_FREEZE: nr_aborted 0
NV_CK804_FREEZE: port_no 1  mask 0xd
__ATA_PORT_FREEZE: ata2 port_no 1 frozen

Comment 49 David Milburn 2008-12-19 01:13:28 UTC
Hello Benedikt,

Would you please retest with the kernel-smp-2.6.9-78.21.EL.bz329201.4 kernel?

This kernel will clear interrupts directly from nv_ck804_freeze, please attach
dmesg4.add.

Thank you,
David

Comment 50 Benedikt Schaefer 2008-12-19 07:54:31 UTC
Dear David 

I have repeated the test.
But nothing changed.

best regards
Benedikt

Comment 51 Benedikt Schaefer 2008-12-19 07:55:38 UTC
Created attachment 327418 [details]
dmesg4.add

Comment 52 David Milburn 2008-12-20 04:16:30 UTC
Hello Benedikt,

Please retest with the .5 test kernel, this time turn on additional scsi logging

# echo 509 > /proc/sys/dev/scsi/logging_level

And please attach dmesg.boot, dmesg.remove, and dmesg.add.

Thanks,
David

Comment 53 Benedikt Schaefer 2008-12-20 10:26:01 UTC
Dear David,

I'm at vacation until 12.01.2009.
I will do the test when I return.

best regards
Benedikt

Comment 54 Benedikt Schaefer 2009-01-16 12:17:52 UTC
Dear David,

I have done the test.
Same result as before.

best regards
Benedikt

Comment 55 Benedikt Schaefer 2009-01-16 12:18:38 UTC
Created attachment 329198 [details]
dmesg6.boot

Comment 56 Benedikt Schaefer 2009-01-16 12:19:03 UTC
Created attachment 329199 [details]
dmesg6.remove

Comment 57 Benedikt Schaefer 2009-01-16 12:19:31 UTC
Created attachment 329200 [details]
dmesg6.add

Comment 58 David Milburn 2009-01-16 22:04:39 UTC
Thank you Benedikt, I am looking over the current logs.

Comment 59 David Milburn 2009-01-20 20:48:24 UTC
Hello Benedikt,

When re-adding the drive, I can see from the logs that scsi_dispatch_cmd is
returning DID_NO_CONNECT because it finds the SHOST_CANCEL bit set, this bit
maybe set during the error recover to cancel outstanding IO. During the hotplug
process, if a device is found I have added code to clear this bit. Would you
please repeat the test using the .7 test kernel?

http://people.redhat.com/dmilburn/kernel-smp-2.6.9-78.21.EL.bz329201.7.x86_64.rpm 

Assuming that works would you also test the .8 test kernel? The .8 version is 
the same as .7 with no debug turned on, no need to test .8 if .7 fails.

http://people.redhat.com/dmilburn/kernel-smp-2.6.9-79.EL.bz329201.8.x86_64.rpm

Thank you,
David

Comment 60 Benedikt Schaefer 2009-01-21 07:58:58 UTC
Dear David,

I have done the test but it fails again.
I attached the logs.

Maybe it would be a good idea to do a fresh RHEL4.5 installation on my test machine to exclude side effects from tests before? 

best regards
Benedikt

Comment 61 Benedikt Schaefer 2009-01-21 07:59:38 UTC
Created attachment 329562 [details]
dmesg7.boot

Comment 62 Benedikt Schaefer 2009-01-21 08:00:05 UTC
Created attachment 329563 [details]
dmesg7.remove

Comment 63 Benedikt Schaefer 2009-01-21 08:00:33 UTC
Created attachment 329564 [details]
dmesg7.add

Comment 64 David Milburn 2009-01-21 23:08:13 UTC
Hello Benedikt,

From the logs, I can see that scsi_dispatch_cmd is still seeing the SHOST_CANCEL
bit set. 

ata_scsi_hotplug
 ata_scsi_scan_host
  __scsi_add_device
   scsi_probe_and_add_lun
    scsi_probe_lun <=INQUIRY fails because SHOST_CANCEL bit set in dispatch

The v8 test kernel clears this bit in ata_scsi_hotplug before ata_scsi_hotplug, 
please verify this works on your system. Thanks.

http://people.redhat.com/dmilburn/kernel-smp-2.6.9-79.EL.bz329201.8.x86_64.rpm

Comment 65 David Milburn 2009-01-21 23:10:29 UTC
Created attachment 329671 [details]
Patch to support hdd hotplug

Comment 66 Benedikt Schaefer 2009-01-22 07:56:23 UTC
Dear David,

I have done the test and it worked :-)

I will attached demsg8.add.

thanks.

best regards
Benedikt

Comment 67 Benedikt Schaefer 2009-01-22 07:56:54 UTC
Created attachment 329697 [details]
dmesg8.add

Comment 69 Vivek Goyal 2009-02-12 15:34:54 UTC
Committed in 81.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 71 Chris Ward 2009-03-27 14:18:40 UTC
~~ Attention Partners! Snap 1 Released ~~
RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug
at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management

Comment 72 Chris Ward 2009-04-16 15:45:56 UTC
I confirmed that the patch NEC verified in comment #66 is included in the latest rhel 4.8 kernel (-88.EL)

Comment 74 errata-xmlrpc 2009-05-18 19:16:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.