Bug 1104801

Summary:	Storage Domain Monitor threads stopped responding
Product:	Red Hat Enterprise Linux 6	Reporter:	Gordon Watson <gwatson>
Component:	device-mapper-multipath	Assignee:	Ben Marzinski <bmarzins>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	yanfu,wang <yanwang>
Severity:	medium	Docs Contact:
Priority:	high
Version:	6.5	CC:	agk, amureini, bazulay, bmarzins, dswegen, dwysocha, ebenahar, gwatson, heinzm, iheim, lpeer, msnitzer, nsoffer, prajnoha, prockai, rbalakri, scohen, yanwang, yeylon, zkabelac
Target Milestone:	pre-dev-freeze
Target Release:	6.7
Hardware:	x86_64
OS:	Linux
Whiteboard:	storage
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-09-21 20:30:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1075802, 1159933, 1172231

Description Gordon Watson 2014-06-04 17:43:42 UTC

Description of problem:

A customer encountered a condition where their storage domain monitor threads stopped responding.

At this same time the Data Center went 'Non-Responsive', all VMs on this host (single host in the DC) went into a 'NotResponding' state and the single data Storage Domain went 'Inactive'.

A few hours prior to this a FibreChannel lun had been been removed from the host, which resulted in the typical multipath and sd errors reported in the 'messages' file.


Version-Release number of selected component (if applicable):

RHEV 3.1
RHEL 6.5 host with 'vdsm-4.13.2-0.11'


How reproducible:

Not.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Details in a subsequent update.

Comment 9 Nir Soffer 2014-06-11 13:07:54 UTC

Whats happens is this:                                                                                                                                                                                             

1. On 2014-05-29 14:01:28,423, vdsm is handling a getDeviceList request in
   Thread-4115067

   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,422::
       BindingXMLRPC::167::vds::(wrapper) client [1.249.90.201]
   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423::
       task::579::TaskManager.Task::(_updateState)
       Task=`eefb2341-a993-4ee2-b4d6-835e692cc6cd`::moving from state init -> state
       preparing
   vdsm.log.7.xz:Thread-4115067::INFO::2014-05-29 14:01:28,423::
       logUtils::44::dispatcher::(wrapper) Run and protect:
       getDeviceList(storageType=3, options={})

2. Thread-4115067 enter storage.sdc.refreshStorage, taking one lock

   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423::
       misc::811::SamplingMethod::(__call__) Got in to sampling method

3. Thread-4115067 run multipath -r command

   vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:30,521::
       multipath::112::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n
       /sbin/multipath -r' (cwd None)

4. multipath never returns. This cause all other threads to block forever when
   trying to enter storage.sdc.refreshStorage as Gordon show in comment 1.

   Thread-23::DEBUG::2014-05-29 14:03:29,939::
       misc::809:: SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   Thread-21::DEBUG::2014-05-29 14:04:28,308::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   Thread-4115080::DEBUG::2014-05-29 14:04:40,958::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)
   Thread-19::DEBUG::2014-05-29 14:05:25,198::
       misc::809::SamplingMethod::(__call__) Trying to enter sampling method
       (storage.sdc.refreshStorage)

vdsm runs many commands (such as multipath) without a timeout, assuming that
commands will never block forever. We may need to fix this, but even if we did,
I'm not sure that you could continue to work with the storage server correctly
in this situation.

The root cause of this is removing a lun on the storage server, without
removing the lun first on the host. Unfortunately, removal of lun on a storage
server while a host is connected to the storage server is not supported
currently. There is no user interface allowing removal of a device before the
device is removed on the storage server.

Ben: can you take a look at this? How vdsm should deal with multipath
command that block forever?

Comment 10 Nir Soffer 2014-06-11 18:52:17 UTC

*** Bug 1102829 has been marked as a duplicate of this bug. ***

Comment 11 Ben Marzinski 2014-06-30 16:07:03 UTC

Is it possible to see where the multipath command is blocked?

Can you attach gdb to the process and get a backtrace?

Otherwise, could you try running it with either the command option "-v 3" or with "verbosity 3" set in the defaults section of multipath.conf.

What do you get from

# cat /proc/<pid>/stack

Comment 12 Gordon Watson 2014-07-07 19:45:29 UTC

Ben,

If you're asking me (as the reporter of the bug) for this information then it is no longer available. The problem occurred weeks ago and the host is no longer in that state.

Regards, GFW.

Comment 13 Ben Marzinski 2014-07-09 14:21:41 UTC

This issue is that I'm not sure how multipath got in a state where it didn't return from running "-r".   It is possible that if there were devices that had recently failed, it may time a long time for scsi commands to the devices to time out, and that could cause "multipath -r" to hang for a while (on the order of minutes).

Without some more information, it's really impossible to tell why multipath got stuck.  It may be possible to simply kill multipath with a signal if it takes too long, but it's always possible that it's stuck in some uninterruptible wait in the kernel.

Comment 14 Nir Soffer 2014-07-09 14:32:07 UTC

(In reply to Ben Marzinski from comment #13)
Ben, how do you suggest to deal with multipath command that block forever, possibly stuck in uninterruptible wait in the kernel?

Comment 15 Ben Marzinski 2014-08-13 16:19:54 UTC

Ideally, we should figure out why multipath is getting stuck and fix it.  multipath -r is not a command that should block indefinitely. Do you know if it's interruptible or not?  Is it possible to reproduce this.  Without fixing multipath, I have no idea what vdsm should do.  It could timeout, try to kill the process, and then rerun the command.  But unless we understand what's wrong with multipath, we can guarantee that this will actually solve the problem.

Access to a system that can recreate this problem would be a big help in figuring out what's going wrong.

Comment 16 Nir Soffer 2014-09-03 08:31:15 UTC

This seems to be another duplicate of bug 880738.

Comment 17 Nir Soffer 2014-09-03 08:35:39 UTC

This issue is cause by multipath -r blocking forever - we don't have any info why this happened and we don't have access to machine reproducing this.

From the vdsm side, we can do nothing about this. From multipath side, seem that there is insufficient data to handle this.

I suggest to close this, unless Ben want to continue with this.

Comment 18 Ben Marzinski 2014-09-03 17:30:26 UTC

I'm fine with leaving this open to see if we can eventually figure out what's going on here.  But without more information or ideally a reproducer, I'm not sure what happened.

Comment 19 Ben Marzinski 2014-09-03 17:32:10 UTC

I'm moving this over to multipath, since that appears to be where the root problem is.

Comment 24 Ben Marzinski 2015-09-21 20:30:41 UTC

Closing, since we never got enough information to figure out what was going on here.