RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 997570 - multipath may hang if tur checker thread becomes unresponsive
Summary: multipath may hang if tur checker thread becomes unresponsive
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath
Version: 6.4
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: yanfu,wang
URL:
Whiteboard:
Depends On:
Blocks: 1069434 1069435 1109995
TreeView+ depends on / blocked
 
Reported: 2013-08-15 15:50 UTC by Bryn M. Reeves
Modified: 2019-07-11 07:44 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If the TUR checker thread becomes unresponsive, multipathd was starting a synchronous TUR check, which could also block. Consequence: Multipathd could hang when a device that uses the TUR checker failed. Fix: Multipathd no longer starts a synchronous TUR check when the TUR checker thread becomes unresponsive. It simply marks the path as failed. Result: Multipathd will no longer hang when a device using the TUR checker fails.
Clone Of:
: 1109995 (view as bug list)
Environment:
Last Closed: 2014-10-14 07:41:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Do not fall back to sync TURs when pthread_cancel fails (837 bytes, patch)
2013-08-15 15:51 UTC, Bryn M. Reeves
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1555 0 normal SHIPPED_LIVE device-mapper-multipath bug fix and enhancement update 2014-10-14 01:27:56 UTC

Description Bryn M. Reeves 2013-08-15 15:50:47 UTC
Description of problem:
Multipath currently has logic in the TUR checker's checker thread handling to fall back to synchronous mode in the event that the thread stops responding (i.e. a pthread_cancel() fails):

428                 if (ct->thread) {
429                         /* pthread cancel failed. continue in sync mode */
430                         pthread_mutex_unlock(&ct->lock);
431                         condlog(3, "%d:%d: tur thread not responding, "
432                                 "using sync mode", TUR_DEVT(ct));
433                         return tur_check(c->fd, c->timeout, c->message,
434                                          ct->wwid);
435                 }

This turns out to be the wrong thing to do: if the async thread has gone off into the weeds (generally it's blocked in the kernel while SCSI EH work takes place) the last thing we want to do is to cause multipathd to block in the same place.

Instead just report the path as down.

Version-Release number of selected component (if applicable):
multipath-tools-0.4.9-*.el6

How reproducible:
Difficult

Steps to Reproduce:
1. Requires fabric faults and driver handling that cause an SG_IO ioctl() from the daemon to block uninterruptibly for long enough that the next check sees the thread as stuck, attempts to cancel it, fails and then switches to sync mode.

Fault injection is probably the best way to recreate this under controlled conditions.

Actual results:

multipathd switches to sync TUR and blocks in D state in SG_IO ioctl()

Expected results:
multipathd reports the path as unavailable and continues other normal activity.

Additional info:

Comment 1 Bryn M. Reeves 2013-08-15 15:51:43 UTC
Created attachment 786975 [details]
Do not fall back to sync TURs when pthread_cancel fails

Comment 7 Ben Marzinski 2014-02-10 21:28:45 UTC
RHEL-6.4.z packages are available at

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/997570/

Comment 23 Ben Marzinski 2014-02-24 18:25:41 UTC
Patch added. Thanks.

Comment 31 errata-xmlrpc 2014-10-14 07:41:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1555.html


Note You need to log in before you can comment on or make changes to this bug.