997570 – multipath may hang if tur checker thread becomes unresponsive

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 997570 - multipath may hang if tur checker thread becomes unresponsive

Summary: multipath may hang if tur checker thread becomes unresponsive

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ben Marzinski
QA Contact:	yanfu,wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1069434 1069435 1109995
TreeView+	depends on / blocked

Reported:	2013-08-15 15:50 UTC by Bryn M. Reeves
Modified:	2019-07-11 07:44 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: If the TUR checker thread becomes unresponsive, multipathd was starting a synchronous TUR check, which could also block. Consequence: Multipathd could hang when a device that uses the TUR checker failed. Fix: Multipathd no longer starts a synchronous TUR check when the TUR checker thread becomes unresponsive. It simply marks the path as failed. Result: Multipathd will no longer hang when a device using the TUR checker fails.
Clone Of:
Clones:	1109995 (view as bug list)
Environment:
Last Closed:	2014-10-14 07:41:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Do not fall back to sync TURs when pthread_cancel fails (837 bytes, patch) 2013-08-15 15:51 UTC, Bryn M. Reeves	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1555	0	normal	SHIPPED_LIVE	device-mapper-multipath bug fix and enhancement update	2014-10-14 01:27:56 UTC

Description Bryn M. Reeves 2013-08-15 15:50:47 UTC

Description of problem:
Multipath currently has logic in the TUR checker's checker thread handling to fall back to synchronous mode in the event that the thread stops responding (i.e. a pthread_cancel() fails):

428                 if (ct->thread) {
429                         /* pthread cancel failed. continue in sync mode */
430                         pthread_mutex_unlock(&ct->lock);
431                         condlog(3, "%d:%d: tur thread not responding, "
432                                 "using sync mode", TUR_DEVT(ct));
433                         return tur_check(c->fd, c->timeout, c->message,
434                                          ct->wwid);
435                 }

This turns out to be the wrong thing to do: if the async thread has gone off into the weeds (generally it's blocked in the kernel while SCSI EH work takes place) the last thing we want to do is to cause multipathd to block in the same place.

Instead just report the path as down.

Version-Release number of selected component (if applicable):
multipath-tools-0.4.9-*.el6

How reproducible:
Difficult

Steps to Reproduce:
1. Requires fabric faults and driver handling that cause an SG_IO ioctl() from the daemon to block uninterruptibly for long enough that the next check sees the thread as stuck, attempts to cancel it, fails and then switches to sync mode.

Fault injection is probably the best way to recreate this under controlled conditions.

Actual results:

multipathd switches to sync TUR and blocks in D state in SG_IO ioctl()

Expected results:
multipathd reports the path as unavailable and continues other normal activity.

Additional info:

Comment 1 Bryn M. Reeves 2013-08-15 15:51:43 UTC

Created attachment 786975 [details]
Do not fall back to sync TURs when pthread_cancel fails

Comment 7 Ben Marzinski 2014-02-10 21:28:45 UTC

RHEL-6.4.z packages are available at

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/997570/

Comment 23 Ben Marzinski 2014-02-24 18:25:41 UTC

Patch added. Thanks.

Comment 31 errata-xmlrpc 2014-10-14 07:41:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1555.html

Note You need to log in before you can comment on or make changes to this bug.

abisogia
agk
bdonahue
bmarzins
bmr
cww
dwysocha
gavin
heinzm
jcastillo
jkurik
jmagrini
msnitzer
prajnoha
prockai
rbalakri
sauchter
vanhoof
yanwang
zkabelac