Bug 329201
Summary: | scsi hot swapp mechanism not working with SATA HDD under RHEL4U5 | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Benedikt Schaefer <bschaefer> | ||||||||||||||||||||||||||||||||||||||
Component: | kernel | Assignee: | David Milburn <dmilburn> | ||||||||||||||||||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||||||||||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||||||||||||||||||||
Version: | 4.5 | CC: | cward, dmilburn, duck, i-kitayama, jbaron, jgarzik, jim, jnomura, junichi.nomura, peterm, prarit, riek, syeghiay, tao | ||||||||||||||||||||||||||||||||||||||
Target Milestone: | rc | Keywords: | Regression, Reopened | ||||||||||||||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||||||||||||||||||
Last Closed: | 2009-05-18 19:16:36 UTC | Type: | --- | ||||||||||||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||||||||||||||||||
Bug Blocks: | 430698, 461297 | ||||||||||||||||||||||||||||||||||||||||
Attachments: |
|
Description
Benedikt Schaefer
2007-10-12 08:46:19 UTC
Prarit, the problem can be reproduced with Dell Precision WorkStation 380 and is likely a generic SATA problem, not specific to the reported hardware. When /proc/sys/dev/scsi/logging_level is set to 448, the following messages were observed after "add-single-device": scsi_scan_host_selected: <3:0:0:0> scsi scan: INQUIRY to host 3 channel 0 id 0 lun 0 scsi scan: 1st INQUIRY failed with code 0x40000 that means DID_BAD_TARGET. libata returns DID_BAD_TARGET because the device is disabled ("dev->class++" in ata_dev_disable()) in the course of "remove-single-device". A big change has made for libata in RHEL4 U5 and it causes this difference in behaviour. The problem doesn't occur on RHEL5 and FC7, where the device scan is done through different path (scsi_transport_template->user_scan()). Jun'ichi -- Can you verify that this worked with RHEL4U4 (or any earlier version of RHEL4)? P. Prarit, remove/add works with RHEL4 U4 (2.6.9-42.EL) (device is renamed after add, though). See below for samples. 2.6.9-42.ELsmp: # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: WDC WD800JD-19LS Rev: 06.0 Type: Direct-Access ANSI SCSI revision: 0 # echo 1 > /sys/block/sdd/device/delete # dmesg -c Synchronizing SCSI cache for disk sdd: # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 # echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi # dmesg -c Vendor: ATA Model: WDC WD800JD-19LS Rev: 06.0 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB) SCSI device sde: drive cache: write back SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB) SCSI device sde: drive cache: write back sde: sde1 sde2 sde3 sde4 Attached scsi disk sde at scsi3, channel 0, id 0, lun 0 # echo 1 > /sys/block/sde/device/delete # dmesg -c Synchronizing SCSI cache for disk sde: # echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi # dmesg -c Vendor: ATA Model: WDC WD800JD-19LS Rev: 06.0 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB) SCSI device sde: drive cache: write back SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB) SCSI device sde: drive cache: write back sde: sde1 sde2 sde3 sde4 Attached scsi disk sde at scsi3, channel 0, id 0, lun 0 # echo 1 > /sys/block/sde/device/delete # echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: WDC WD800JD-19LS Rev: 06.0 Type: Direct-Access ANSI SCSI revision: 05 Note that 'sdd' became 'sde' after the 1st "add". BTW, "remove-single-device" in the original report works only once with 2.6.9-42.ELsmp. # echo "scsi remove-single-device 3 0 0 0" > /proc/scsi/scsi # dmesg -c Synchronizing SCSI cache for disk sdd: # echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi # dmesg -c Vendor: ATA Model: WDC WD800JD-19LS Rev: 06.0 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB) SCSI device sde: drive cache: write back SCSI device sde: 156250080 512-byte hdwr sectors (80000 MB) SCSI device sde: drive cache: write back sde: sde1 sde2 sde3 sde4 Attached scsi disk sde at scsi3, channel 0, id 0, lun 0 # echo "scsi remove-single-device 3 0 0 0" > /proc/scsi/scsi # dmesg -c <nothing displayed and sde is not removed> As for the not-working case (RHEL4 U5 and RHEL4 U6 beta), 2.6.9-62.ELsmp: # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: WDC WD800JD-19LS Rev: 06.0 Type: Direct-Access ANSI SCSI revision: 05 # echo 1 > /sys/block/sdd/device/delete # dmesg -c Synchronizing SCSI cache for disk sdd: ata4.00: disabled # echo 'scsi add-single-device 3 0 0 0' > /proc/scsi/scsi # dmesg -c <nothing displayed> # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: HDS728080PLA380 Rev: PF2O Type: Direct-Access ANSI SCSI revision: 05 Thanks jnomura :) -- one last question. The information in comment #4 was from the dell380-2.rhts.boston.redhat.com in RHTS? P. Your welcome, Prarit. No, the information was taken on my desktop machine. I'll let you know other machines available by e-mail. This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. Do you have any update for me ? thanks best regards Benedikt Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. Development will review this issue for resolution in RHEL 4.8. Updating PM score. Created attachment 323174 [details]
Patch to fix removing/adding hdd through /proc
The reason this fails is because ata_eh_detach_dev call ata_dev_disable which
changes the class from ATA_DEV_ATA or ATA_DEV_ATAPI to a disabled state,
which in turn causes ata_scsi_queuecmd to report DID_BAD_TARGET since
ata_scsi_find_dev reports no associated ATA device. RHEL5 and upstream code
use the scsi_transport user_scan function to allow a user-initiated bus
scan, but, this would require pulling in many other commits into RHEL4.8.
Instead, this patch sets an additional flag in ata_scsi_slave_destroy and
will check for this flag in ata_scsi_find_dev if a device is disabled
and re-enable if it had previously been disabled through ata_scsi_slave_destroy.
And since the SCSI layer has removed the device it won't try to queue up any SCSI
commands to the ATA device until it has been re-added.
Hi, Would you please verify the kernel-2.6.9-78.17.EL.bz329201.2 test kernel? Thanks. http://people.redhat.com/dmilburn/ Hi, Has NEC had a chance to verify the kernel-2.6.9-78.17.EL.bz329201.2 test kernel? Thanks, David Hello Jun'ichi, Can you verify the kernel-2.6.9-78.17.EL.bz329201.2 test kernel? http://people.redhat.com/dmilburn/ The patch in Comment#20 will need to be reviewed and accepted to make it into 4.8. Thanks, David David, sorry but I've left the on-site work and don't have access to the test machine.. So we have to ask the original reporter. Benedikt, could you please respond to the comment #21? Hi, sorry for late reply. I was on customers sites the last few weeks. I haven't had time to test the kernel. I will see if I find a hardware to test it. best regards Benedikt I have done a short test with kernel kernel-smp-2.6.9-78.17.EL.bz329201.2.x86_64.rpm. But without success. I disabled the HDD with command: echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi output from dmesg: Synchronizing SCSI cache for disk sdb: ata2.00: disabled ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete ata2: exception Emask 0x10 SAct 0x0 SErr 0x150000 action 0x2 frozen ata2: hard resetting port When I removed and added again HDD physicaly. output from dmesg: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata2.00: ATA-7, max UDMA/133, 160836480 sectors: LBA48 NCQ (depth 0/32) ata2.00: ata2: dev 0 multi count 0 ata2.00: configured for UDMA/133 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete I tried to add HDD to linux with: echo "scsi add-single-device 1 0 0 0" > /proc/scsi/scsi but nothing happens :-( best regards Benedikt Benedikt, Thanks for retesting, I re-verified the test kernel on my ata_piix platform, my drive re-attaches after the "scsi add-single-device" command. Whick controller/driver are you using? Can you attach your dmesg output after booting? Actually, for debug purposes, would you repeat your test without physically removing/adding the drive? # dmesg -c > dmesg.boot # cat /proc/scsi/scsi # echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi # dmesg -c > dmesg.remove # cat /proc/scsi/scsi # echo "scsi add-single-device 1 0 0 0" > /proc/scsi/scsi # cat /proc/scsi/scsi # dmesg -c > dmesg.add I will also try and retest with a different controller/driver. David Hi David, I have tested the following script on my machine: ------ #!/bin/bash LOG=scsi.log dmesg -c > dmesg.boot echo --- Start --- >> $LOG cat /proc/scsi/scsi >> $LOG echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi dmesg -c > dmesg.remove echo --- sdb removed --- >> $LOG cat /proc/scsi/scsi >> $LOG echo "scsi add-single-device 1 0 0 0" > /proc/scsi/scsi echo --- sdb added --- >> $LOG cat /proc/scsi/scsi >> $LOG dmesg -c > dmesg.add echo --- Finished --- >> $LOG ------ This was working. I will attached the log files also output from lspci from my machine. best regards Benedikt Created attachment 326457 [details]
dmesg.add
Created attachment 326458 [details]
dmesg.boot
Created attachment 326459 [details]
dmesg.remove
Created attachment 326460 [details]
output of lspci
Created attachment 326461 [details]
scsi.log
Hi Benedikt, Thanks for the info, I see you are using ck804/sata_nv. I was able to successful retest on a sata_promise system. If you physically remove/add the drive in-between the "echo scsi-remove... and echo scsi-add...", are you consistently able to reproduce your results in Comment#26? I also noticed that you got a link up message when you plugged the drive back in, how many seconds are you waiting in-between physically removing and physically adding back the drive? Thanks again, David Dear David, I repeated the test with physically remove/add the drive with the some result. After remove/add I'm not able to add the drive by scsi add-single-device command. Time during remove and add the drive was about 3 minutes. best regards Benedikt Hello Benedikt, I was able to test the 2.6.9-78.17.EL.bz329201.2smp kernel on CK804 system IDE interface: nVidia Corporation CK804 IDE (rev f3) RAID bus controller: nVidia Corporation CK804 Serial ATA Controller (rev f3) RAID bus controller: nVidia Corporation CK804 Serial ATA Controller (rev f4) In this case, the controller/driver supports hotplug events so you should not need to remove/add with the "echo scsi..." commands. I mounted/unmounted a filesystem off one of my sata drives, then unplugged the drive, and it becomes detached ata3: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen ata3: hard resetting port ata3: SATA link down (SStatus 0 SControl 300) ata3: failed to recover some devices, retrying in 5 secs ata3: hard resetting port ata3: SATA link down (SStatus 0 SControl 300) ata3: failed to recover some devices, retrying in 5 secs ata3: hard resetting port ata3: SATA link down (SStatus 0 SControl 300) ata3.00: disabled ata3: EH complete ata3.00: detaching (SCSI 2:0:0:0) Synchronizing SCSI cache for disk sdb: FAILED status = 0, message = 00, host = 4, driver = 00 The drive no longer shows up in /proc/scsi/scsi, then plugging back in ata3: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata3: hard resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: ATA-8, max UDMA/133, 976773168 sectors: LBA48 NCQ (depth 0/32) ata3.00: configured for UDMA/133 ata3: EH complete Vendor: ATA Model: ST3500320AS Rev: SD15 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB) SCSI device sdb: drive cache: write back SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB) SCSI device sdb: drive cache: write back sdb: sdb1 sdb2 Attached scsi disk sdb at scsi2, channel 0, id 0, lun 0 The drive re-attaches and I was able to mount/unmount a filesystem on the drive. I did try using the "echo scsi..." commands and "echo scsi add-..." command didn't appear to do anything, this was because the drive had already been re-attached when I plugged it back in. So, would please repeat the test without the "echo scsi..." commands? (these should only be needed if the controller/driver do not support hotplug events) Thank you, David Dear David, I have done the test. removing the drive without scsi commands: ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: disabled ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete ata2.00: detaching (SCSI 1:0:0:0) Synchronizing SCSI cache for disk sdb: FAILED status = 0, message = 00, host = 1, driver = 00 drive not listed at /proc/scsi/scsi plugged drive back <3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata2: hard resetting port ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata2.00: ATA-7, max UDMA/133, 160836480 sectors: LBA48 NCQ (depth 0/32) ata2.00: ata2: dev 0 multi count 0 ata2.00: configured for UDMA/133 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete But the drive was not activated. best regards Benedikt Hello Benedikt, Would you please turn on scsi scan logging and see if the INQUIRY still fails when you plug the drive back in (repeat steps in Comment #37), you should see some scsi_scan messages in your dmesg output. # echo 448 > /proc/sys/dev/scsi/logging_level Thanks, David Dear David, I have repeated the test. - boot the machine - set 'echo 448 > /proc/sys/dev/scsi/logging_level' - remove sdb drive dmesg output: ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: disabled ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete ata2.00: detaching (SCSI 1:0:0:0) Synchronizing SCSI cache for disk sdb: FAILED status = 0, message = 00, host = 1, driver = 00 - replug sdb drive dmesg output: <3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x150000 action 0x2 frozen ata2: hard resetting port ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata2.00: ATA-7, max UDMA/133, 160836480 sectors: LBA48 NCQ (depth 0/32) ata2.00: ata2: dev 0 multi count 0 ata2.00: configured for UDMA/133 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0 scsi scan: 1st INQUIRY failed with code 0x10000 scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0 scsi scan: 1st INQUIRY failed with code 0x10000 scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0 scsi scan: 1st INQUIRY failed with code 0x10000 scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0 scsi scan: 1st INQUIRY failed with code 0x10000 scsi scan: INQUIRY to host 1 channel 0 id 0 lun 0 scsi scan: 1st INQUIRY failed with code 0x10000 scsi scan message repeats a lot of time. drive is not listed under /proc/scsi/scsi. best regards Benedikt Hello Benedikt, An inquiry failure code of 0x10000 means DID_NO_CONNECT, scsi_dispatch_cmd is seeing the device state as SDEV_DEL and is setting the cmd result to DID_NO_CONNECT which would explain the inquiry failures. On my CK804 system, I see ata_scsi_hotplug ata_scsi_scan_host __scsi_add_device scsi_probe_and_add_lun scsi_alloc_sdev - here device state is set to SDEV_CREATED Will you attach your /proc/mdstat before and after unplugging the drive? Also, before unplugging the device would you remove it using mdadm, and see if the behavior is different? mdadm -f /dev/md0 /dev/sdb mdadm -r /dev/md0 /dev/sdb Thanks, David Dear David, sdb was not part of the software raid when I removed it. best regards Benedikt Ok, thanks Benedikt, I thought it might be related to md, I will need to build a second test kernel to gather more debug data since I am still unable to reproduce the problem. Hello Benedikt, Would you please install the kernel-smp-2.6.9-78.21.EL.bz329201.3? http://people.redhat.com/dmilburn/ Please boot with kernel parameter "log_buf_len=1000000", please save off dmesg output after booting # dmesg -c > dmesg.boot # echo 448 > /proc/sys/dev/scsi/logging_level Unplug drive, make sure it no longer shows up in /proc/scsi/scsi # dmesg -c > dmesg.remove Plug drive in, # dmesg -c > dmesg.add Please attach the dmesg output and I will compare to my platform. Thanks for your patience, David Dear David, I have done the test. boot kernel-smp-2.6.9-78.21.EL.bz329201.3 with ro root=/dev/md2 iommu=soft log_buf_len=1000000 console=tty0 dmesg -c > dmesg3.boot echo 448 > /proc/sys/dev/scsi/logging_level Unplug drive, make sure it no longer shows up in /proc/scsi/scsi dmesg -c > dmesg3.remove Plug drive in, dmesg -c > dmesg3.add best regards benedikt Created attachment 327306 [details]
dmesg3.boot
Created attachment 327307 [details]
dmesg3.remove
Created attachment 327308 [details]
dmesg3.add
Noting one difference I see when adding the drive, I am only seeing one hotplug interrupt, and you are taking multiple, nv_ck804_freeze should have masked off interrupts for the port ===========dmesg from my sata_nv system=========================== NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6 ATA_PORT_FREEZE: nr_aborted 0 NV_CK804_FREEZE: port_no 1 mask 0xd __ATA_PORT_FREEZE: ata4 port_no 1 frozen NV_ERROR_HANDLER: port_no 1 ================================================================== Your system is taking many hotplug interrupts ================================================================== 4>__ATA_PORT_FREEZE: ata2 port_no 1 frozen NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6 ATA_PORT_FREEZE: nr_aborted 0 NV_CK804_FREEZE: port_no 1 mask 0xd __ATA_PORT_FREEZE: ata2 port_no 1 frozen NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6 ATA_PORT_FREEZE: nr_aborted 0 NV_CK804_FREEZE: port_no 1 mask 0xd __ATA_PORT_FREEZE: ata2 port_no 1 frozen NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6 ATA_PORT_FREEZE: nr_aborted 0 NV_CK804_FREEZE: port_no 1 mask 0xd __ATA_PORT_FREEZE: ata2 port_no 1 frozen NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6 ATA_PORT_FREEZE: nr_aborted 0 NV_CK804_FREEZE: port_no 1 mask 0xd __ATA_PORT_FREEZE: ata2 port_no 1 frozen NV_HOST_INTR: HOTPLUGGED, freezing port 1 irq_stat 0x6 ATA_PORT_FREEZE: nr_aborted 0 NV_CK804_FREEZE: port_no 1 mask 0xd __ATA_PORT_FREEZE: ata2 port_no 1 frozen Hello Benedikt, Would you please retest with the kernel-smp-2.6.9-78.21.EL.bz329201.4 kernel? This kernel will clear interrupts directly from nv_ck804_freeze, please attach dmesg4.add. Thank you, David Dear David I have repeated the test. But nothing changed. best regards Benedikt Created attachment 327418 [details]
dmesg4.add
Hello Benedikt, Please retest with the .5 test kernel, this time turn on additional scsi logging # echo 509 > /proc/sys/dev/scsi/logging_level And please attach dmesg.boot, dmesg.remove, and dmesg.add. Thanks, David Dear David, I'm at vacation until 12.01.2009. I will do the test when I return. best regards Benedikt Dear David, I have done the test. Same result as before. best regards Benedikt Created attachment 329198 [details]
dmesg6.boot
Created attachment 329199 [details]
dmesg6.remove
Created attachment 329200 [details]
dmesg6.add
Thank you Benedikt, I am looking over the current logs. Hello Benedikt, When re-adding the drive, I can see from the logs that scsi_dispatch_cmd is returning DID_NO_CONNECT because it finds the SHOST_CANCEL bit set, this bit maybe set during the error recover to cancel outstanding IO. During the hotplug process, if a device is found I have added code to clear this bit. Would you please repeat the test using the .7 test kernel? http://people.redhat.com/dmilburn/kernel-smp-2.6.9-78.21.EL.bz329201.7.x86_64.rpm Assuming that works would you also test the .8 test kernel? The .8 version is the same as .7 with no debug turned on, no need to test .8 if .7 fails. http://people.redhat.com/dmilburn/kernel-smp-2.6.9-79.EL.bz329201.8.x86_64.rpm Thank you, David Dear David, I have done the test but it fails again. I attached the logs. Maybe it would be a good idea to do a fresh RHEL4.5 installation on my test machine to exclude side effects from tests before? best regards Benedikt Created attachment 329562 [details]
dmesg7.boot
Created attachment 329563 [details]
dmesg7.remove
Created attachment 329564 [details]
dmesg7.add
Hello Benedikt, From the logs, I can see that scsi_dispatch_cmd is still seeing the SHOST_CANCEL bit set. ata_scsi_hotplug ata_scsi_scan_host __scsi_add_device scsi_probe_and_add_lun scsi_probe_lun <=INQUIRY fails because SHOST_CANCEL bit set in dispatch The v8 test kernel clears this bit in ata_scsi_hotplug before ata_scsi_hotplug, please verify this works on your system. Thanks. http://people.redhat.com/dmilburn/kernel-smp-2.6.9-79.EL.bz329201.8.x86_64.rpm Created attachment 329671 [details]
Patch to support hdd hotplug
Dear David, I have done the test and it worked :-) I will attached demsg8.add. thanks. best regards Benedikt Created attachment 329697 [details]
dmesg8.add
Committed in 81.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ ~~ Attention Partners! Snap 1 Released ~~ RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should be a fix present, which addresses this bug. NOTE: there is only a short time left to test, please test and report back results on this bug at your earliest convenience. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. - Red Hat QE Partner Management I confirmed that the patch NEC verified in comment #66 is included in the latest rhel 4.8 kernel (-88.EL) An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |