Back to bug 1623601

Who When What Removed Added
Red Hat Bugzilla Rules Engine 2018-08-29 17:46:20 UTC Target Release 3.0 3.*
Vasu Kulkarni 2018-08-29 17:54:09 UTC Priority unspecified high
Target Release 3.* 3.1
CC vakulkar
Flags automate_bug+
Red Hat Bugzilla Rules Engine 2018-08-29 17:54:15 UTC Target Release 3.1 3.0
Vasu Kulkarni 2018-08-29 17:59:10 UTC Target Release 3.0 3.1
Jason Dillaman 2018-08-29 19:56:03 UTC CC jdillama
Target Milestone rc z1
Jason Dillaman 2018-08-29 19:56:44 UTC CC mkasturi
Flags needinfo?(mkasturi)
Madhavi Kasturi 2018-08-30 05:50:19 UTC Flags needinfo?(mkasturi) needinfo+
Mike Christie 2018-08-30 07:54:12 UTC Link ID Github https://github.com/open-iscsi/tcmu-runner/pull/471
Mike Christie 2018-08-30 17:41:14 UTC Status NEW POST
Mike Christie 2018-08-30 17:43:38 UTC Blocks 1624040
Mike Christie 2018-09-05 05:06:53 UTC Blocks 1624040
Harish NV Rao 2018-09-05 07:28:02 UTC CC hnallurv, mchristi
Flags needinfo?(mchristi)
Mike Christie 2018-09-05 16:01:33 UTC Doc Text Cause:

The RHEL 7.5 kernel's ALUA layer reduced the number of it times an initiator retries the SCSI sense code ALUA State Transition. This is returned from the target side by tcmu-runner when it is taking the rbd exclusive lock during failover/failback and device discovery.

Consequence:

We can run out of retries before failover/discovery has completed, and the SCSI layer will return a failure to the multipath layer. The multipath layer will try another path and we can hit the same problem. The multipath layer will then bounce between paths resulting in slow or failed IO, management operations to the multipath device failing, in the initiator side logs you will see messages about paths being failed and removed then immediately re-added while IO is being performed to the multipath device.

Workaround (if any):

The ALUA layer change was added in RHEL 7.5. Downgrading the initiator's kernel to the RHEL 7.4 kernel will workaround the problem.

Result:

IO should not be failed from the SCSI layer to the multipath layer when performing IO and all paths are initially in the active and enabled dm-multipath state.
Doc Type If docs needed, set a value Known Issue
Flags needinfo?(mchristi)
Mike Christie 2018-09-05 20:30:37 UTC Status POST ASSIGNED
Vikhyat Umrao 2018-09-09 17:33:30 UTC CC vumrao
Tomas Petr 2018-09-12 06:40:56 UTC CC tpetr
Tomas Petr 2018-09-12 15:15:10 UTC CC tserlin
Flags needinfo?(mchristi)
Mike Christie 2018-09-12 15:42:46 UTC Link ID Github https://github.com/open-iscsi/tcmu-runner/pull/471 Github open-iscsi/tcmu-runner/pull/471
Flags needinfo?(mchristi)
Jason Dillaman 2018-09-12 17:54:05 UTC Flags needinfo?(mchristi)
Flags needinfo?(jdillama)
Flags needinfo?(mchristi) needinfo?(jdillama)
Mike Christie 2018-09-12 21:38:47 UTC Flags needinfo?(mchristi)
Flags needinfo?(mchristi)
Vikhyat Umrao 2018-09-12 22:14:02 UTC Flags needinfo?(mchristi)
Mike Christie 2018-09-13 06:29:09 UTC Flags needinfo?(mchristi)
Tomas Petr 2018-09-13 06:36:11 UTC Flags needinfo?(mchristi)
Vikhyat Umrao 2018-09-13 16:29:56 UTC Flags needinfo?(mchristi)
Mike Christie 2018-09-13 16:46:02 UTC Flags needinfo?(mchristi) needinfo?(mchristi)
Harish NV Rao 2018-09-17 12:19:14 UTC Blocks 1584264
Aron Gunn 2018-09-17 20:35:20 UTC CC agunn
Docs Contact agunn
Doc Text Cause:

The RHEL 7.5 kernel's ALUA layer reduced the number of it times an initiator retries the SCSI sense code ALUA State Transition. This is returned from the target side by tcmu-runner when it is taking the rbd exclusive lock during failover/failback and device discovery.

Consequence:

We can run out of retries before failover/discovery has completed, and the SCSI layer will return a failure to the multipath layer. The multipath layer will try another path and we can hit the same problem. The multipath layer will then bounce between paths resulting in slow or failed IO, management operations to the multipath device failing, in the initiator side logs you will see messages about paths being failed and removed then immediately re-added while IO is being performed to the multipath device.

Workaround (if any):

The ALUA layer change was added in RHEL 7.5. Downgrading the initiator's kernel to the RHEL 7.4 kernel will workaround the problem.

Result:

IO should not be failed from the SCSI layer to the multipath layer when performing IO and all paths are initially in the active and enabled dm-multipath state.
.An iSCSI device is busy according to the `systemd-udevd` service

In the Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and when doing a device discovery. As a consequence, the maximum number of retries occurs before the discovery process has completed, and the SCSI layer will return a failure to the multipath IO layer. The multipath IO layer will try the next available path, and the same problem will occur. This causes a loop of path checking, resulting in failed IO, and management operations to the multipath device to fail. The logs on the initiator node will print messages about devices being removed and then re-added. To workaround this issued, downgrade the initiator's kernel to Red Hat Enterprise Linux 7.4.
Aron Gunn 2018-09-17 20:43:07 UTC Doc Text .An iSCSI device is busy according to the `systemd-udevd` service

In the Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and when doing a device discovery. As a consequence, the maximum number of retries occurs before the discovery process has completed, and the SCSI layer will return a failure to the multipath IO layer. The multipath IO layer will try the next available path, and the same problem will occur. This causes a loop of path checking, resulting in failed IO, and management operations to the multipath device to fail. The logs on the initiator node will print messages about devices being removed and then re-added. To workaround this issued, downgrade the initiator's kernel to Red Hat Enterprise Linux 7.4.
.An iSCSI device is busy according to the `systemd-udevd` service

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and when doing a device discovery. As a consequence, the maximum number of retries occurs before the discovery process has completed, and the SCSI layer will return a failure to the multipath IO layer. The multipath IO layer will try the next available path, and the same problem will occur. This causes a loop of path checking, resulting in failed IO, and management operations to the multipath device to fail. The logs on the initiator node will print messages about devices being removed and then re-added. To workaround this issued, downgrade the initiator's kernel to Red Hat Enterprise Linux 7.4.
Mike Christie 2018-09-22 16:38:12 UTC Status ASSIGNED MODIFIED
Fixed In Version tcmu-runner-1.4.0-0.3.el7cp
errata-xmlrpc 2018-10-02 15:43:49 UTC CC dn-infra-peta-pers
Status MODIFIED ON_QA
Bara Ancincova 2018-10-10 17:08:03 UTC Docs Contact agunn bancinco
Doc Text .An iSCSI device is busy according to the `systemd-udevd` service

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and when doing a device discovery. As a consequence, the maximum number of retries occurs before the discovery process has completed, and the SCSI layer will return a failure to the multipath IO layer. The multipath IO layer will try the next available path, and the same problem will occur. This causes a loop of path checking, resulting in failed IO, and management operations to the multipath device to fail. The logs on the initiator node will print messages about devices being removed and then re-added. To workaround this issued, downgrade the initiator's kernel to Red Hat Enterprise Linux 7.4.
In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
Doc Type Known Issue Bug Fix
Flags needinfo?(mchristi)
Bara Ancincova 2018-10-10 17:33:34 UTC Doc Text In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail. .An iSCSI device is no longer busy according to the `systemd-udevd`

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
Mike Christie 2018-10-10 18:42:22 UTC Flags needinfo?(mchristi)
Bara Ancincova 2018-10-16 13:27:16 UTC Doc Text .An iSCSI device is no longer busy according to the `systemd-udevd`

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
.The `dm-multipath` device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
Bara Ancincova 2018-10-16 13:37:03 UTC Doc Text .The `dm-multipath` device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
.The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
Tejas 2018-10-22 06:01:34 UTC CC tchandra
QA Contact mkasturi mmurthy
Bara Ancincova 2018-10-23 16:59:19 UTC Doc Text .The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
.The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues _[fixed by 3.1z1]_

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
Manohar Murthy 2018-10-24 10:47:44 UTC Status ON_QA VERIFIED
Bara Ancincova 2018-11-05 18:58:30 UTC Doc Text .The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues _[fixed by 3.1z1]_

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
.The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues _[fixed in 3.1z1]_

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
Bara Ancincova 2018-11-06 16:34:31 UTC Doc Text .The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues _[fixed in 3.1z1]_

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
.The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues _[fixed by 3.1z1]_

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
Bara Ancincova 2018-11-07 19:12:45 UTC Doc Text .The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues _[fixed by 3.1z1]_

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
.The DM-Multipath device's path no longer bounces between the failed and active state causing I/O failures, hangs, and performance issues

In Red Hat Enterprise Linux 7.5, the kernel's ALUA layer reduced the number of times an initiator retries the SCSI sense code `ALUA State Transition`. This code is returned from the target side by the `tcmu-runner` service when taking the RBD exclusive lock during a failover or failback scenario and during a device discovery. As a consequence, the maximum number of retries had occurred before the discovery process was completed, and the SCSI layer returned a failure to the multipath I/O layer. The multipath I/O layer tried the next available path, and the same problem occurred. This behavior caused a loop of path checking, resulting in failed I/O operations and management operations to the multipath device. In addition, the logs on the initiator node printed messages about devices being removed and then re-added. This bug has been fixed, and the aforementioned operations no longer fail.
errata-xmlrpc 2018-11-08 18:50:16 UTC Status VERIFIED RELEASE_PENDING
errata-xmlrpc 2018-11-09 00:59:32 UTC Status RELEASE_PENDING CLOSED
Resolution --- ERRATA
Last Closed 2018-11-08 19:59:32 UTC
errata-xmlrpc 2018-11-09 01:00:30 UTC Link ID Red Hat Product Errata RHBA-2018:3530

Back to bug 1623601