1104801 – Storage Domain Monitor threads stopped responding

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1104801 - Storage Domain Monitor threads stopped responding

Summary: Storage Domain Monitor threads stopped responding

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	6.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	6.7
Assignee:	Ben Marzinski
QA Contact:	yanfu,wang
Docs Contact:
URL:
Whiteboard:	storage
Duplicates (1):	1102829 (view as bug list)
Depends On:
Blocks:	1075802 1159933 1172231
TreeView+	depends on / blocked

Reported:	2014-06-04 17:43 UTC by Gordon Watson
Modified:	2020-05-14 14:55 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-09-21 20:30:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Gordon Watson 2014-06-04 17:43:42 UTC

Description of problem:

A customer encountered a condition where their storage domain monitor threads stopped responding.

At this same time the Data Center went 'Non-Responsive', all VMs on this host (single host in the DC) went into a 'NotResponding' state and the single data Storage Domain went 'Inactive'.

A few hours prior to this a FibreChannel lun had been been removed from the host, which resulted in the typical multipath and sd errors reported in the 'messages' file.


Version-Release number of selected component (if applicable):

RHEV 3.1
RHEL 6.5 host with 'vdsm-4.13.2-0.11'


How reproducible:

Not.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Details in a subsequent update.

Comment 9 Nir Soffer 2014-06-11 13:07:54 UTC

Whats happens is this:                                                                                                                                                                                             

1. On 2014-05-29 14:01:28,423, vdsm is handling a getDeviceList request in
   Thread-4115067

   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,422::
       BindingXMLRPC::167::vds::(wrapper) client [1.249.90.201]
   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423::
       task::579::TaskManager.Task::(_updateState)
       Task=`eefb2341-a993-4ee2-b4d6-835e692cc6cd`::moving from state init -> state
       preparing
   vdsm.log.7.xz:Thread-4115067::INFO::2014-05-29 14:01:28,423::
       logUtils::44::dispatcher::(wrapper) Run and protect:
       getDeviceList(storageType=3, options={})

2. Thread-4115067 enter storage.sdc.refreshStorage, taking one lock

   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423::
       misc::811::SamplingMethod::(__call__) Got in to sampling method

3. Thread-4115067 run multipath -r command

   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:30,521::
       multipath::112::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n
       /sbin/multipath -r' (cwd None)

4. multipath never returns. This cause all other threads to block forever when
   trying to enter storage.sdc.refreshStorage as Gordon show in comment 1.

   Thread-23::DEBUG::2014-05-29 14:03:29,939::
       misc::809:: SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   Thread-21::DEBUG::2014-05-29 14:04:28,308::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   Thread-4115080::DEBUG::2014-05-29 14:04:40,958::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   Thread-19::DEBUG::2014-05-29 14:05:25,198::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)

vdsm runs many commands (such as multipath) without a timeout, assuming that
commands will never block forever. We may need to fix this, but even if we did,
I'm not sure that you could continue to work with the storage server correctly
in this situation.

The root cause of this is removing a lun on the storage server, without
removing the lun first on the host. Unfortunately, removal of lun on a storage
server while a host is connected to the storage server is not supported
currently. There is no user interface allowing removal of a device before the
device is removed on the storage server.

Ben: can you take a look at this? How vdsm should deal with multipath
command that block forever?

Comment 10 Nir Soffer 2014-06-11 18:52:17 UTC

*** Bug 1102829 has been marked as a duplicate of this bug. ***

Comment 11 Ben Marzinski 2014-06-30 16:07:03 UTC

Is it possible to see where the multipath command is blocked?

Can you attach gdb to the process and get a backtrace?

Otherwise, could you try running it with either the command option "-v 3" or with "verbosity 3" set in the defaults section of multipath.conf.

What do you get from

# cat /proc/<pid>/stack

Comment 12 Gordon Watson 2014-07-07 19:45:29 UTC

Ben,

If you're asking me (as the reporter of the bug) for this information then it is no longer available. The problem occurred weeks ago and the host is no longer in that state.

Regards, GFW.

Comment 13 Ben Marzinski 2014-07-09 14:21:41 UTC

This issue is that I'm not sure how multipath got in a state where it didn't return from running "-r".   It is possible that if there were devices that had recently failed, it may time a long time for scsi commands to the devices to time out, and that could cause "multipath -r" to hang for a while (on the order of minutes).

Without some more information, it's really impossible to tell why multipath got stuck.  It may be possible to simply kill multipath with a signal if it takes too long, but it's always possible that it's stuck in some uninterruptible wait in the kernel.

Comment 14 Nir Soffer 2014-07-09 14:32:07 UTC

(In reply to Ben Marzinski from comment #13)
Ben, how do you suggest to deal with multipath command that block forever, possibly stuck in uninterruptible wait in the kernel?

Comment 15 Ben Marzinski 2014-08-13 16:19:54 UTC

Ideally, we should figure out why multipath is getting stuck and fix it.  multipath -r is not a command that should block indefinitely. Do you know if it's interruptible or not?  Is it possible to reproduce this.  Without fixing multipath, I have no idea what vdsm should do.  It could timeout, try to kill the process, and then rerun the command.  But unless we understand what's wrong with multipath, we can guarantee that this will actually solve the problem.

Access to a system that can recreate this problem would be a big help in figuring out what's going wrong.

Comment 16 Nir Soffer 2014-09-03 08:31:15 UTC

This seems to be another duplicate of bug 880738.

Comment 17 Nir Soffer 2014-09-03 08:35:39 UTC

This issue is cause by multipath -r blocking forever - we don't have any info why this happened and we don't have access to machine reproducing this.

From the vdsm side, we can do nothing about this. From multipath side, seem that there is insufficient data to handle this.

I suggest to close this, unless Ben want to continue with this.

Comment 18 Ben Marzinski 2014-09-03 17:30:26 UTC

I'm fine with leaving this open to see if we can eventually figure out what's going on here.  But without more information or ideally a reproducer, I'm not sure what happened.

Comment 19 Ben Marzinski 2014-09-03 17:32:10 UTC

I'm moving this over to multipath, since that appears to be where the root problem is.

Comment 24 Ben Marzinski 2015-09-21 20:30:41 UTC

Closing, since we never got enough information to figure out what was going on here.

Note You need to log in before you can comment on or make changes to this bug.

agk
amureini
bazulay
bmarzins
dswegen
dwysocha
ebenahar
gwatson
heinzm
iheim
lpeer
msnitzer
nsoffer
prajnoha
prockai
rbalakri
scohen
yanwang
yeylon
zkabelac