Bug 997570 - multipath may hang if tur checker thread becomes unresponsive
multipath may hang if tur checker thread becomes unresponsive
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath (Show other bugs)
6.4
Unspecified Unspecified
urgent Severity high
: rc
: ---
Assigned To: Ben Marzinski
yanfu,wang
: ZStream
Depends On:
Blocks: 1069434 1069435 1109995
  Show dependency treegraph
 
Reported: 2013-08-15 11:50 EDT by Bryn M. Reeves
Modified: 2015-03-25 06:53 EDT (History)
20 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If the TUR checker thread becomes unresponsive, multipathd was starting a synchronous TUR check, which could also block. Consequence: Multipathd could hang when a device that uses the TUR checker failed. Fix: Multipathd no longer starts a synchronous TUR check when the TUR checker thread becomes unresponsive. It simply marks the path as failed. Result: Multipathd will no longer hang when a device using the TUR checker fails.
Story Points: ---
Clone Of:
: 1109995 (view as bug list)
Environment:
Last Closed: 2014-10-14 03:41:25 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Do not fall back to sync TURs when pthread_cancel fails (837 bytes, patch)
2013-08-15 11:51 EDT, Bryn M. Reeves
no flags Details | Diff

  None (edit)
Description Bryn M. Reeves 2013-08-15 11:50:47 EDT
Description of problem:
Multipath currently has logic in the TUR checker's checker thread handling to fall back to synchronous mode in the event that the thread stops responding (i.e. a pthread_cancel() fails):

428                 if (ct->thread) {
429                         /* pthread cancel failed. continue in sync mode */
430                         pthread_mutex_unlock(&ct->lock);
431                         condlog(3, "%d:%d: tur thread not responding, "
432                                 "using sync mode", TUR_DEVT(ct));
433                         return tur_check(c->fd, c->timeout, c->message,
434                                          ct->wwid);
435                 }

This turns out to be the wrong thing to do: if the async thread has gone off into the weeds (generally it's blocked in the kernel while SCSI EH work takes place) the last thing we want to do is to cause multipathd to block in the same place.

Instead just report the path as down.

Version-Release number of selected component (if applicable):
multipath-tools-0.4.9-*.el6

How reproducible:
Difficult

Steps to Reproduce:
1. Requires fabric faults and driver handling that cause an SG_IO ioctl() from the daemon to block uninterruptibly for long enough that the next check sees the thread as stuck, attempts to cancel it, fails and then switches to sync mode.

Fault injection is probably the best way to recreate this under controlled conditions.

Actual results:

multipathd switches to sync TUR and blocks in D state in SG_IO ioctl()

Expected results:
multipathd reports the path as unavailable and continues other normal activity.

Additional info:
Comment 1 Bryn M. Reeves 2013-08-15 11:51:43 EDT
Created attachment 786975 [details]
Do not fall back to sync TURs when pthread_cancel fails
Comment 7 Ben Marzinski 2014-02-10 16:28:45 EST
RHEL-6.4.z packages are available at

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/997570/
Comment 23 Ben Marzinski 2014-02-24 13:25:41 EST
Patch added. Thanks.
Comment 31 errata-xmlrpc 2014-10-14 03:41:25 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1555.html

Note You need to log in before you can comment on or make changes to this bug.