Bug 722955

Summary: [device-mapper-multipath] [scale] multipath scan takes 10 minutes to return at certain occasions
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Storage QE <storage-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: agk, bmarzins, coughlan, cpelland, czhang, ddumas, dwysocha, heinzm, mbroz, mgoldboi, prajnoha, prockai, smizrahi, yeylon, ykaul, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-22 12:13:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 682015    
Attachments:
Description Flags
strace dump none

Description Haim 2011-07-18 15:31:48 UTC
Created attachment 513639 [details]
strace dump

Description of problem:

host connected to 500 Luns via FC, sometimes, we see that it takes 10 minutes and more for command /sbin/multipath to return. 
running strace, it appears to be hanging on the following command: 

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fae7147fa70) = 9438
--- SIGCHLD (Child exited) @ 0 (0) ---
close(29)                               = 0
read(28, "36006016066102900cc07b2daf2ade01"..., 127) = 34
read(28, "", 93)                        = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9438
close(28)                               = 0
ioctl(27, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 01, c0, 00, 80, 00], mx_sb_len=128, iovec_count=0, dxfer_len=256, timeout=60000, flags=0

attached full strace of /sbin/multipath

multiapth.conf:



# RHEV REVISION 0.6

defaults {
    polling_interval        5
    getuid_callout          "/sbin/scsi_id -g -u -d /dev/%n"
    no_path_retry           fail
    user_friendly_names     no
    flush_on_last_del       yes
    fast_io_fail_tmo        5
    dev_loss_tmo            30
    max_fds		    4096
}

device-mapper-multipath-0.4.9-41.el6.x86_64

Comment 3 Ben Marzinski 2011-08-04 23:18:57 UTC
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x7fae7147fa70) = 9438
> --- SIGCHLD (Child exited) @ 0 (0) ---
> close(29)                               = 0
> read(28, "36006016066102900cc07b2daf2ade01"..., 127) = 34
> read(28, "", 93)                        = 0
> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9438
> close(28)                               = 0

This is the getuid_callout.  You can manually run this to verify how long it
takes.  It should return almost instantly.

> ioctl(27, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 01, c0, 00, 80, 00],
> mx_sb_len=128, iovec_count=0, dxfer_len=256, timeout=60000, flags=0

This is the prio checker function. Again, it should return quickly, but the ioctl itself has a timeout of 1 minute which, looking at the return value, isn't getting hit.

Do you know which one of these is the slow one?

Comment 4 Denise Dumas 2011-09-20 17:16:06 UTC
Ping Haim - if this is a blocker, we need more information here.