Bug 1104801
Summary: | Storage Domain Monitor threads stopped responding | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Gordon Watson <gwatson> |
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | yanfu,wang <yanwang> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 6.5 | CC: | agk, amureini, bazulay, bmarzins, dswegen, dwysocha, ebenahar, gwatson, heinzm, iheim, lpeer, msnitzer, nsoffer, prajnoha, prockai, rbalakri, scohen, yanwang, yeylon, zkabelac |
Target Milestone: | pre-dev-freeze | ||
Target Release: | 6.7 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | storage | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-09-21 20:30:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1075802, 1159933, 1172231 |
Description
Gordon Watson
2014-06-04 17:43:42 UTC
Whats happens is this: 1. On 2014-05-29 14:01:28,423, vdsm is handling a getDeviceList request in Thread-4115067 vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,422:: BindingXMLRPC::167::vds::(wrapper) client [1.249.90.201] vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423:: task::579::TaskManager.Task::(_updateState) Task=`eefb2341-a993-4ee2-b4d6-835e692cc6cd`::moving from state init -> state preparing vdsm.log.7.xz:Thread-4115067::INFO::2014-05-29 14:01:28,423:: logUtils::44::dispatcher::(wrapper) Run and protect: getDeviceList(storageType=3, options={}) 2. Thread-4115067 enter storage.sdc.refreshStorage, taking one lock vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423:: misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:28,423:: misc::811::SamplingMethod::(__call__) Got in to sampling method 3. Thread-4115067 run multipath -r command vdsm.log.7.xz:Thread-4115067::DEBUG::2014-05-29 14:01:30,521:: multipath::112::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n /sbin/multipath -r' (cwd None) 4. multipath never returns. This cause all other threads to block forever when trying to enter storage.sdc.refreshStorage as Gordon show in comment 1. Thread-23::DEBUG::2014-05-29 14:03:29,939:: misc::809:: SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) Thread-21::DEBUG::2014-05-29 14:04:28,308:: misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) Thread-4115080::DEBUG::2014-05-29 14:04:40,958:: misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) Thread-19::DEBUG::2014-05-29 14:05:25,198:: misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) vdsm runs many commands (such as multipath) without a timeout, assuming that commands will never block forever. We may need to fix this, but even if we did, I'm not sure that you could continue to work with the storage server correctly in this situation. The root cause of this is removing a lun on the storage server, without removing the lun first on the host. Unfortunately, removal of lun on a storage server while a host is connected to the storage server is not supported currently. There is no user interface allowing removal of a device before the device is removed on the storage server. Ben: can you take a look at this? How vdsm should deal with multipath command that block forever? *** Bug 1102829 has been marked as a duplicate of this bug. *** Is it possible to see where the multipath command is blocked? Can you attach gdb to the process and get a backtrace? Otherwise, could you try running it with either the command option "-v 3" or with "verbosity 3" set in the defaults section of multipath.conf. What do you get from # cat /proc/<pid>/stack Ben, If you're asking me (as the reporter of the bug) for this information then it is no longer available. The problem occurred weeks ago and the host is no longer in that state. Regards, GFW. This issue is that I'm not sure how multipath got in a state where it didn't return from running "-r". It is possible that if there were devices that had recently failed, it may time a long time for scsi commands to the devices to time out, and that could cause "multipath -r" to hang for a while (on the order of minutes). Without some more information, it's really impossible to tell why multipath got stuck. It may be possible to simply kill multipath with a signal if it takes too long, but it's always possible that it's stuck in some uninterruptible wait in the kernel. (In reply to Ben Marzinski from comment #13) Ben, how do you suggest to deal with multipath command that block forever, possibly stuck in uninterruptible wait in the kernel? Ideally, we should figure out why multipath is getting stuck and fix it. multipath -r is not a command that should block indefinitely. Do you know if it's interruptible or not? Is it possible to reproduce this. Without fixing multipath, I have no idea what vdsm should do. It could timeout, try to kill the process, and then rerun the command. But unless we understand what's wrong with multipath, we can guarantee that this will actually solve the problem. Access to a system that can recreate this problem would be a big help in figuring out what's going wrong. This seems to be another duplicate of bug 880738. This issue is cause by multipath -r blocking forever - we don't have any info why this happened and we don't have access to machine reproducing this. From the vdsm side, we can do nothing about this. From multipath side, seem that there is insufficient data to handle this. I suggest to close this, unless Ben want to continue with this. I'm fine with leaving this open to see if we can eventually figure out what's going on here. But without more information or ideally a reproducer, I'm not sure what happened. I'm moving this over to multipath, since that appears to be where the root problem is. Closing, since we never got enough information to figure out what was going on here. |