49379 – Unable to open() a disconnected LUN

Bug 49379 - Unable to open() a disconnected LUN

Summary: Unable to open() a disconnected LUN

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-07-18 20:06 UTC by Wayne Berthiaume
Modified:	2008-08-01 16:22 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:39:05 UTC
Embargoed:

Attachments	(Terms of Use)

Description Wayne Berthiaume 2001-07-18 20:06:31 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.75 [en] (X11; U; Linux 2.2.17-14smp i686)

Description of problem:
Unable to open() an sg node associated with a disconnected LUN (PQ/PDType
001/00000). /var/log/messages says it is unable to read the partition
table. In RH 6.2 (lk 2.2.16-3 and 2.2.14-12) we were able to open the LUN.
RH 6.2 - /var/log/messages would report it was unable to read the partition
table and READ CAPACITY FAILED; however, would assign default blocksize 512
and disk size 1GB. RH 7.0 - /proc/scsi/sg/debug has an entry for the
device, but no data. Whereas, RH 6.2 would have both an entry and data in
/proc/scsi/sg/debug. Further, the scsi targets (i.e 3:0:0:0) are assigned
an sd device node in both RH 6.2 and 7.0; however, 7.0 has no device listed
in /proc/partitions. The disconnected LUN is reported by the disk array's
system process. The array is identified by the SCSI INQUIRY as a DGC RAID
ANSI 4 device. Without being able to open() the sg device node the
management software, EMC's Navisphere, is unable to manage the array. This
places you in a quandry if the array has no data LUNs existing on both of
its system processors to start with. Prior to RH 7.0 the EMC CLARiiON disk
array was not listed in the SCSI scan BLIST. It is a sparse LUN device so
it does belong there and I don't believe it is the reason the problem
exist. The issue is although a device node is assigned by both sd() and
sg() you cannot open() the device.

How reproducible:
Always

Steps to Reproduce:
1.Load both the clariion-attach and navisphere RPM packages, Ican supply
these or you can obatin them from EMC Tech Support
2.Install the driver and Navisphere per EMC documentation
3.modprobe qla2x00smp (if SMP system) or modprobe qla2x00 (non-SMP)
4./etc/rc.d/init.d/naviagent start
5.Examine /proc/scsi/sg/debug for the sg devices associated with the the
SCSI targets assigned whne the Qlogic QLA2x00 driver was initialized, there
should be several lines of information. If there is a single line you will
be unable to open the device. strace() will verify you were unable to open
the device.
	

Actual Results:  strace() shows open() failed with an ENXIO error - No such
device or address

Expected Results:  strace() shows open() succeeded in openning the the
/dev/sg<alpha device name> as O_RDONLY and the /dev/sg<numeric name> as
O_RDONLY; O_RDWR | O_NONBLOCK; and O_RDONLY | O_NONBLOCK. 


Additional info:

/usr/src/linux/drivers/scsi/sg.c is 3.00.10 which the EMC Navisphere
software is linked to during compilization.
/var/log/messages:
Jul 18 13:50:01 Linux88 kernel: (scsi): Found a QLA2300  @ bus 0, device
0x9, irq 19, iobase 0x2000
Jul 18 13:50:01 Linux88 kernel: scsi(2): Configure NVRAM parameters...
Jul 18 13:50:01 Linux88 kernel: scsi(2): Verifying loaded RISC code...
Jul 18 13:50:01 Linux88 kernel: scsi(2): Verifying chip...
Jul 18 13:50:01 Linux88 kernel: scsi(2): Waiting for LIP to complete...
Jul 18 13:50:01 Linux88 CROND[931]: (root) CMD (   /sbin/rmmod -as)
Jul 18 13:50:30 Linux88 kernel: scsi(2): LOOP UP detected
Jul 18 13:50:30 Linux88 kernel: scsi(2): Waiting for LIP to complete...
Jul 18 13:50:30 Linux88 kernel: scsi2: Topology - (F_Port), Host Loop
address  0xffff
Jul 18 13:50:30 Linux88 kernel: qla2100: Performing ISP error recovery -
ha= bf440078
Jul 18 13:50:30 Linux88 kernel: scsi(2): Waiting for LIP to complete...
Jul 18 13:50:30 Linux88 kernel: scsi(2): Waiting for LIP to complete...
Jul 18 13:50:30 Linux88 kernel: qla2100_configure_hba: [ERROR] Get host
loop ID  failed
Jul 18 13:50:30 Linux88 kernel: scsi-qla0-adapter-node=200000e08b04cec4;
Jul 18 13:50:30 Linux88 kernel: scsi-qla0-adapter-port=210000e08b04cec4;
Jul 18 13:50:30 Linux88 kernel: scsi-qla0-target-0=500601608802b398;
Jul 18 13:50:30 Linux88 kernel: (scsi): Found a QLA2300  @ bus 0, device
0xb, irq 18, iobase 0x2400
Jul 18 13:50:30 Linux88 kernel: scsi(3): Configure NVRAM parameters...
Jul 18 13:50:35 Linux88 kernel: scsi(2): LOOP UP detected
Jul 18 13:50:35 Linux88 kernel: scsi(3): Verifying loaded RISC code...
Jul 18 13:50:35 Linux88 kernel: scsi(3): Verifying chip...
Jul 18 13:50:35 Linux88 kernel: scsi(3): Waiting for LIP to complete...
Jul 18 13:50:35 Linux88 kernel: scsi(3): LOOP UP detected
Jul 18 13:50:35 Linux88 kernel: scsi3: Topology - (F_Port), Host Loop
address  0xffff
Jul 18 13:50:35 Linux88 kernel: scsi(2): Waiting for LIP to complete...
Jul 18 13:50:35 Linux88 kernel: scsi2: Topology - (F_Port), Host Loop
address  0xffff
Jul 18 13:50:35 Linux88 kernel: scsi(3): Waiting for LIP to complete...
Jul 18 13:50:36 Linux88 kernel: scsi3: Topology - (F_Port), Host Loop
address  0xffff
Jul 18 13:50:36 Linux88 kernel: scsi-qla1-adapter-node=200000e08b04cfc4;
Jul 18 13:50:36 Linux88 kernel: scsi-qla1-adapter-port=210000e08b04cfc4;
Jul 18 13:50:36 Linux88 kernel: scsi-qla1-target-0=500601688802b398;
Jul 18 13:50:36 Linux88 kernel: scsi2 : QLogic QLA2300 PCI to Fibre Channel
Host Adapter: bus 0 device 9 irq 19
Jul 18 13:50:36 Linux88 kernel:         Firmware version:  3.00.23, Driver
version 4.33b
Jul 18 13:50:36 Linux88 kernel: scsi3 : QLogic QLA2300 PCI to Fibre Channel
Host Adapter: bus 0 device 11 irq 18
Jul 18 13:50:36 Linux88 kernel:         Firmware version:  3.00.23, Driver
version 4.33b
Jul 18 13:50:36 Linux88 kernel: scsi : 4 hosts.
Jul 18 13:50:36 Linux88 kernel:   Vendor: DGC      
Model:                   Rev: 0524
Jul 18 13:50:36 Linux88 kernel:   Type:  
Direct-Access                      ANSI SCSI revision: 04
Jul 18 13:50:36 Linux88 kernel: Detected scsi disk sdd at scsi2, channel 0,
id 0, lun 0
Jul 18 13:50:36 Linux88 kernel: scsi(2:0:0:0): Enabled tagged queuing,
queue depth 16.
Jul 18 13:50:36 Linux88 kernel:   Vendor: DGC      
Model:                   Rev: 0524
Jul 18 13:50:36 Linux88 kernel:   Type:  
Direct-Access                      ANSI SCSI revision: 04
Jul 18 13:50:36 Linux88 kernel: Detected scsi disk sde at scsi3, channel 0,
id 0, lun 0
Jul 18 13:50:36 Linux88 kernel: scsi(3:0:0:0): Enabled tagged queuing,
queue depth 16.
Jul 18 13:50:36 Linux88 kernel:  sdd:scsidisk I/O error: dev 08:30, sector
0
Jul 18 13:50:36 Linux88 kernel:  unable to read partition table
Jul 18 13:50:36 Linux88 kernel:  sde:scsidisk I/O error: dev 08:40, sector
0
Jul 18 13:50:36 Linux88 kernel:  unable to read partition table

strace() of navisphere start:
1060  [2abd5354] open("/dev/sg4", O_RDONLY) = -1 ENXIO (No such device or
address) <0.000018>
1060  [2abd5354] open("/dev/sge", O_RDONLY) = -1 ENXIO (No such device or
address) <0.000013>
1060  [2abd5354] open("/dev/sg5", O_RDONLY) = -1 ENXIO (No such device or
address) <0.000015>
1060  [2abd5354] open("/dev/sgf", O_RDONLY) = -1 ENXIO (No such device or
address) <0.000011>
(The two disconnected LUNs - SPa and SPb of the array)

[root@Linux88 /root]# cat /proc/scsi/sg/debug 
dev_max=57 max_active_device=6 (origin 1)
 scsi_dma_free_sectors=144 sg_pool_secs_aval=320 def_reserved_size=32768
 >>> device=0(sga) scsi0 chan=0 id=2 lun=0   em=0 sg_tablesize=128 excl=0
   FD(1): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
   FD(2): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
 >>> device=1(sgb) scsi0 chan=0 id=3 lun=0   em=0 sg_tablesize=128 excl=0
   FD(1): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
   FD(2): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
 >>> device=2(sgc) scsi0 chan=0 id=4 lun=0   em=0 sg_tablesize=128 excl=0
   FD(1): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
   FD(2): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
 >>> device=3(sgd) scsi0 chan=0 id=9 lun=0   em=0 sg_tablesize=128 excl=0
   FD(1): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
   FD(2): timeout=6000 bufflen=32768 (res)sgat=0 low_dma=0
   cmd_q=0 f_packid=0 k_orphan=0 closed=0
     No requests active
 >>> device=4(sge) scsi2 chan=0 id=0 lun=0   em=0 sg_tablesize=32 excl=0
 >>> device=5(sgf) scsi3 chan=0 id=0 lun=0   em=0 sg_tablesize=32 excl=0
[root@Linux88 /root]# 
(sge and sgf are the disconnected LUNs.)

Comment 1 Wayne Berthiaume 2001-07-20 19:09:39 UTC

Have just completed testing on RH6.2 lk 2.2.16-3 and a disconnected LUN can be
openned by sg(). Tested RH7.0 lk 2.2.16-22, 2.2.17-14, and 2.2.19-7.0.1 and they
all fail when sg tries to open the disconnected LUN. I further tested RH7.1 lk
2.4.2-2 and was unable to open the disconnected LUN. All failures were the same
as above. I still suspect the change that is causing the problem occured in the
SCSI midlayer used in RH7.0 and 7.1. We've turned on SCSI logging in hopes of
gathering further information but can't seem to figure out how to get useful
information out of the debugging information. We're using the scan token
believing the problem exist somewhere in this area of the code. One of the
problems we're encounteing with SCSI logging is we have multiple Qlogic
QLA/2200FC HBA's in the system so the information that is pushed to
/var/log/messages from one HBA gets step on by the other HBA so it is not
complete and, at times, isn't intelligible. I hope this additional information
we provide further insight into the problem.

Comment 2 Arjan van de Ven 2001-07-20 20:05:33 UTC

Doug: any ideas ?

Comment 3 Doug Ledford 2001-08-02 18:32:05 UTC

Yeah, I'm pretty sure what the problem is, and what patch exactly caused it. 
The linux-2.4.2-scsi_scan.patch in the 2.4 kernel RPM is the cause of the
problem.  However, it went in specifically to solve another problem (some device
report lots of offline drives in the sparse space, including the Clarrion arrays
that Wayne is using, so that if you don't include this patch, you end up with
254 offline entries in the SCSI device list on some arrays).  In short, it's an
inconsistent usage of the offline status in the SCSI Inquiry data that is
causing this problem and I don't see any good answer.  With the patch you have
problems, and without the patch you have problems.  My preferred choice is to
leave the patch and make configuration tools go through whatever device is at
LUN0 on the chassis for proper configuration, but I don't know enough about the
current setup Wayne is using to say if that's possible.

Comment 4 Bugzilla owner 2004-09-30 15:39:05 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.