Bug 1623601
Summary: | Cannot execute multipath -F | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Yevhenii Shapovalov <yshapova> | |
Component: | iSCSI | Assignee: | Mike Christie <mchristi> | |
Status: | CLOSED ERRATA | QA Contact: | Manohar Murthy <mmurthy> | |
Severity: | high | Docs Contact: | Bara Ancincova <bancinco> | |
Priority: | high | |||
Version: | 3.1 | CC: | agunn, ceph-eng-bugs, ceph-qe-bugs, dn-infra-peta-pers, hnallurv, jdillama, mchristi, mkasturi, tchandra, tpetr, tserlin, vakulkar, vumrao | |
Target Milestone: | z1 | Flags: | vakulkar:
automate_bug+
mkasturi: needinfo+ |
|
Target Release: | 3.1 | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | tcmu-runner-1.4.0-0.3.el7cp | Doc Type: | Bug Fix | |
Doc Text: |
.The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues
In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1624040 (view as bug list) | Environment: | ||
Last Closed: | 2018-11-09 00:59:32 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1584264 |
Description
Yevhenii Shapovalov
2018-08-29 17:46:13 UTC
Complete sequence of logs can be viewed from Reportportal (login as ceph/ceph) http://cistatus.ceph.redhat.com/ui/#cephci/launches/New_filter%7Cpage.page=1&page.size=50&filter.cnt.name=iscsi&page.sort=start_time%2Cnumber%2CDESC/5b841e83b3cb1f000162d78d?page.page=1&page.size=50&page.sort=start_time%2CASC One can click on each test case and there will be logs link that will tell the exact steps run like , setup cluster using ceph ansible -> create iscsi gateway using gw cli -> issue mpath commands and exercise IO etc jenkins run with same issue https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/RHCS%203.x%20RHEL%207.5/job/ceph-iscsi-sanity-3.x-rhel7.5/31/consoleFull How can I get the target and initiator side logs for these? I was going to say it looks like that recurring multipathd udev/systemd hung issue, but in the multpipath -ll output you pasted one of the devices has a APD and other device only has one path up. Are you doing path down tests during this test? Did multipathd crash by any chance? (In reply to Yevhenii Shapovalov from comment #5) > jenkins run with same issue > https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/RHCS%203. > x%20RHEL%207.5/job/ceph-iscsi-sanity-3.x-rhel7.5/31/consoleFull This is a different issue right? BZ https://bugzilla.redhat.com/show_bug.cgi?id=1623650 (In reply to Yevhenii Shapovalov from comment #5) > jenkins run with same issue > https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/RHCS%203. > x%20RHEL%207.5/job/ceph-iscsi-sanity-3.x-rhel7.5/31/consoleFull This looks like it is BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1623650 Mike, The failover test happens after this one, we purge the cluser and recreate the setup again for failover test but the IP's remain same, I have given you the iscsi client IP and login details along with Ceph cluster IP details, we can keep the system in same state for debug. Thanks It looks like this is caused by a RHEL 7.5 change to the alua driver: commit a553c7f7a6001c17d6575f182772add9a113ca92 Author: Mike Snitzer <snitzer> Date: Fri Oct 27 16:11:45 2017 -0400 [scsi] scsi_dh_alua: Recheck state on unit attention Message-id: <1509120727-26247-37-git-send-email-snitzer> Patchwork-id: 195279 O-Subject: [RHEL7.5 PATCH v2 36/58] scsi_dh_alua: Recheck state on unit atte ntion Bugzilla: 1499107 RH-Acked-by: Jerry Snitselaar <jsnitsel> RH-Acked-by: Ewan Milne <emilne> BZ: 1499107 commit 2b35865e7a290d313c3d156c0c2074b4c4ffaf52 Author: Hannes Reinecke <hare> Date: Fri Feb 19 09:17:13 2016 +0100 scsi_dh_alua: Recheck state on unit attention When we receive a unit attention code of 'ALUA state changed' we should recheck the state, as it might be due to an implicit ALUA state transition. This allows us to return NEEDS_RETRY instead of ADD_TO_MLQUEUE, allowing to terminate the retries after a certain time. At the same time a workqueue item might already be queued, which should be started immediately to avoid any delays. Reviewed-by: Bart Van Assche <bart.vanassche> Reviewed-by: Christoph Hellwig <hch> Signed-off-by: Hannes Reinecke <hare> Signed-off-by: Martin K. Petersen <martin.petersen> Signed-off-by: Rafael Aquini <aquini> The problem is that we return "alua state transition" while grabbing the lock. This can take a second or two. In RHEL 7.4 the alua layer handled this error by retrying for up to around 5 minutes. With the change above we get 5 retries but they are burned very quickly in a matter of hundreds of ms. So what happens is the IO is failed, the path is failed, other paths are tried and this repeats. multipathd then gets backed up handling path renablement, and this can go on for the entire test. I can fix this in a one line patch in tcmu-runner for 3.1.z. I am also going to work on a upstream/RHEL fix. *** Bug 1623650 has been marked as a duplicate of this bug. *** @Mike, should this be in the 3.1 Release notes as a known issue? *** Bug 1623650 has been marked as a duplicate of this bug. *** *** Bug 1626836 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3530 |