Bug 1502740
Summary: | [iSCSI]: windows BSOD seen when OSD addition done during IO from windows | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tejas <tchandra> |
Component: | iSCSI | Assignee: | Mike Christie <mchristi> |
Status: | CLOSED NOTABUG | QA Contact: | Tejas <tchandra> |
Severity: | urgent | Docs Contact: | Erin Donnelly <edonnell> |
Priority: | high | ||
Version: | 3.0 | CC: | bniver, ceph-eng-bugs, ceph-qe-bugs, edonnell, hnallurv, jdillama, mchristi, tchandra |
Target Milestone: | rc | ||
Target Release: | 3.* | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
.Having more than one path from an initiator to an iSCSI gateway is not supported
In the iSCSI gateway, `tcmu-runner` might return the same inquiry and Asymmetric logical unit access (ALUA) info for all iSCSI sessions to a target port group. This can cause the initiator or multipath layer to use the incorrect port info to reference the internal structures for paths and devices, which can result in failures, failover and failback failing, or incorrect multipath and SCSI log or tool output. Therefore, having more than one iSCSI session from an initiator to an iSCSI gateway is not supported.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-16 22:08:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1494421 |
Description
Tejas
2017-10-16 14:58:30 UTC
@Tejas: What does "pg remapping happening" mean? Are OSDs crashing or are you manually doing things in the background to slow down the OSDs' responsiveness? Hi Jason, I added a new OSD node with 8 OSDs, so the object redistribution is happening to the new OSDs. I did not manually do anything except add the OSDs. > 3. After a while of IOs, also a windows initiator reboot, I am seeing 8 sessions for 4 TPGs. Not sure from where 4 extra sessions got created.
Seeing exactly 2 of each path.
Where do you see the extra sessions? The target side or initiator side or both? If on the target side is it in gwcli or the configfs interface?
Tejas, For the extra sessions, you just have 4 extra sessions defined in the "Favorite Targets", so whenever you reboot or restart the iscsi service you will get the extra sessions. Did you by any chance maybe setup iscsi targets and forget you had already set up some Favorite Targets? We do not support multiple sessions to the same target port group from the same initiator, because tcmu-runner returns incorrect inquiry data. This will cause windows failover/failback issues, but I am not sure if it would cause a crash. It could cause the wrong paths to be referenced and it looks like during the test IO timed out and failovers were attempted. Do you want me to fix up the Favorites? We should fix that then rerun the test. Mike, okay let me try the same run tomorrow with just 4 sessions defined, and we can confirm ifthe crash was due to that. Tejas, Ok. Just FYI, I looked at the dmp and it looks like the multiple sessions and bad inquiry data might be the cause for the crash. Here is the trace from the dmp. Of course we do not have the source, but going by the function names, it seems like it might have been trying to update the alua tpg info and so we probably hit the bug I mentioned: nt!KeBugCheckEx nt!KiBugCheckDispatch+0x69 nt!KiPageFault+0x247 msdsm!DsmpUpdateTargetPortGroupEntry+0x3d9 msdsm!DsmpParseTargetPortGroupsInformation+0x18b msdsm!DsmInquire+0xdc7 mpio!DsmPrx_INQUIRE_DRIVER+0x84 mpio!MPIOAddSingleDevice+0x1a6 mpio!MPIODeviceRegistration+0x94 mpio!MPIOFdoInternalDeviceControl+0xd4 mpio!MPIOFdoDispatch+0xa6 CLASSPNP!ClassSendIrpSynchronous+0x4d CLASSPNP!ClassSendDeviceIoControlSynchronous+0xd9 CLASSPNP!ClasspMpdevStartDevice+0x165 CLASSPNP!ClassMpdevPnPDispatch+0x34e nt!IoSynchronousCallDriver+0x51 nt!IoForwardIrpSynchronously+0x41 partmgr!PmStartDevice+0x70 partmgr!PmPnp+0x112 partmgr!PmGlobalDispatch+0x63 nt!PnpAsynchronousCall+0xe5 nt!PnpSendIrp+0x92 nt!PnpStartDevice+0x88 nt!PnpStartDeviceNode+0xdb nt!PipProcessStartPhase1+0x53 nt!PipProcessDevNodeTree+0x401 nt!PiProcessReenumeration+0xa6 nt!PnpDeviceActionWorker+0x166 nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 For the command timeout issue that started this, I think we might have to increase the command timers on the initiators. @Mike: I thought the 25 second initiator timeout was chosen based upon ESX hard-coded limitations? Are you just suggesting increasing the timeout for Linux/Windows initiators OK -- so it sounds like we can close this as NOTABUG if it only occurs when Windows connects to the same target portal multiple times. We can keep this open till MCS is implemented and then verify it. Closing since it was a config issue that we have documented. |