Description of problem: It has been observed that during fabric or storage failures, the RHEL5.7z host's Device Mapper Multipath daemon (multipathd) fails to update the correct path status of the LUNs. "multipath -ll" and multipathd -k "show paths" report path status of few paths as failed ([failed][ready]), while the scsi device actually exists, and is active. Also,in some cases, when "multipath -ll" reports the path_status as "active" ([active][ready]), the maps in the daemon (multipathd -k"show paths") reports ([failed][ready]) as "failed". Both the cases are explained below : Case 1 : When fabric faults with IO is run on a rhel5.7-z host, multipathd daemon reports the "path_status" of few paths as failed ,when the scsi device exists and active. The multipath -ll output looks like the following in such a scenario: Note scsi device sdal : 360a98000486e53636934694457326548 dm-11 NETAPP,LUN [size=10G][features=1 queue_if_no_path][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 1:0:0:37 sdca 68:224 [active][ready] \_ 0:0:0:37 sdal 66:80 [failed][ready] \_ round-robin 0 [prio=10][enabled] \_ 0:0:1:37 sdex 129:144 [active][ready] \_ 1:0:1:37 sdfe 130:0 [active][ready] #multipathd -k"show paths" | grep sdal 0:0:0:37 sdal 66:80 50 [failed][ready] X......... 2/20 The dm_status of sdal device is ready and is accessible to host. #dd if=/dev/sdal of=/dev/null 20971520+0 records in 20971520+0 records out 10737418240 bytes (11 GB) copied, 19.5835 seconds, 548 MB/s #sg_inq /dev/sdal standard INQUIRY: PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=1 HiSUP=1 Resp_data_format=2 SCCS=0 ACC=0 TPGS=1 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=1 (VS=0) [MChngr=0] [ACKREQQ=0] Addr16=0 [RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1 [SPI: Clocking=0x0 QAS=0 IUS=0] length=117 (0x75) Peripheral device type: disk Vendor identification: NETAPP Product identification: LUN Product revision level: 8020 Unit serial number: HnSci4iDW2eH Case 2 : Also for few paths, "multipath -ll" reports the path_status as "active" whereas the maps in the daemon reports "failed". Note scsi device sdfe : # multipath -ll | grep sdfe \_ 1:0:1:37 sdfe 130:0 [active][ready] # multipathd -k"show paths" | grep sdfe 1:0:1:37 sdfe 130:0 10 [failed][ready] XXXXX..... 10/20 All the scsi devices are accessible to host but the entries in multipath daemon are wrong. Version-Release number of selected component (if applicable): kernel : 2.6.18-274.18.1.el5 device-mapper-event-1.02.63-4.el5 device-mapper-1.02.63-4.el5 device-mapper-multipath-0.4.7-46.el5_7.2 device-mapper-1.02.63-4.el5 How reproducible: frequent. Steps to Reproduce: 1. Map 10 LUNs with 4 paths each. 2. Create few LVs. 3. Create an fs and start IO to the LVs. 4. Run fabric/switch/storage failure. Actual results: Path states are reported incorrect. Expected results: path states should be correct. Additional info:
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
Can you still reproduce this? If so, can you please the output of # multipath -ll # multipthd -k"show config" and the syslog output from when this occurs. Also, while this is occuring, could you run # dmesetup status <devname> This will let me know if it's simply multipathd that doesn't have the correct status, or if the status is really wrong in the kernel.
Are you still able to hit this?
I am not able to recreate this behavior. If you are not able to reproduce this on the current packages, I'm going to close this bug.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days