Bug 722955 - [device-mapper-multipath] [scale] multipath scan takes 10 minutes to return at certain occasions
Summary: [device-mapper-multipath] [scale] multipath scan takes 10 minutes to return a...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks: 682015
TreeView+ depends on / blocked
 
Reported: 2011-07-18 15:31 UTC by Haim
Modified: 2014-01-13 00:49 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-09-22 12:13:42 UTC
Target Upstream Version:


Attachments (Terms of Use)
strace dump (1.09 MB, application/x-gzip)
2011-07-18 15:31 UTC, Haim
no flags Details

Description Haim 2011-07-18 15:31:48 UTC
Created attachment 513639 [details]
strace dump

Description of problem:

host connected to 500 Luns via FC, sometimes, we see that it takes 10 minutes and more for command /sbin/multipath to return. 
running strace, it appears to be hanging on the following command: 

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fae7147fa70) = 9438
--- SIGCHLD (Child exited) @ 0 (0) ---
close(29)                               = 0
read(28, "36006016066102900cc07b2daf2ade01"..., 127) = 34
read(28, "", 93)                        = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9438
close(28)                               = 0
ioctl(27, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 01, c0, 00, 80, 00], mx_sb_len=128, iovec_count=0, dxfer_len=256, timeout=60000, flags=0

attached full strace of /sbin/multipath

multiapth.conf:



# RHEV REVISION 0.6

defaults {
    polling_interval        5
    getuid_callout          "/sbin/scsi_id -g -u -d /dev/%n"
    no_path_retry           fail
    user_friendly_names     no
    flush_on_last_del       yes
    fast_io_fail_tmo        5
    dev_loss_tmo            30
    max_fds		    4096
}

device-mapper-multipath-0.4.9-41.el6.x86_64

Comment 3 Ben Marzinski 2011-08-04 23:18:57 UTC
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x7fae7147fa70) = 9438
> --- SIGCHLD (Child exited) @ 0 (0) ---
> close(29)                               = 0
> read(28, "36006016066102900cc07b2daf2ade01"..., 127) = 34
> read(28, "", 93)                        = 0
> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9438
> close(28)                               = 0

This is the getuid_callout.  You can manually run this to verify how long it
takes.  It should return almost instantly.

> ioctl(27, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 01, c0, 00, 80, 00],
> mx_sb_len=128, iovec_count=0, dxfer_len=256, timeout=60000, flags=0

This is the prio checker function. Again, it should return quickly, but the ioctl itself has a timeout of 1 minute which, looking at the return value, isn't getting hit.

Do you know which one of these is the slow one?

Comment 4 Denise Dumas 2011-09-20 17:16:06 UTC
Ping Haim - if this is a blocker, we need more information here.


Note You need to log in before you can comment on or make changes to this bug.