Hide Forgot
Created attachment 513639 [details] strace dump Description of problem: host connected to 500 Luns via FC, sometimes, we see that it takes 10 minutes and more for command /sbin/multipath to return. running strace, it appears to be hanging on the following command: clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fae7147fa70) = 9438 --- SIGCHLD (Child exited) @ 0 (0) --- close(29) = 0 read(28, "36006016066102900cc07b2daf2ade01"..., 127) = 34 read(28, "", 93) = 0 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9438 close(28) = 0 ioctl(27, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 01, c0, 00, 80, 00], mx_sb_len=128, iovec_count=0, dxfer_len=256, timeout=60000, flags=0 attached full strace of /sbin/multipath multiapth.conf: # RHEV REVISION 0.6 defaults { polling_interval 5 getuid_callout "/sbin/scsi_id -g -u -d /dev/%n" no_path_retry fail user_friendly_names no flush_on_last_del yes fast_io_fail_tmo 5 dev_loss_tmo 30 max_fds 4096 } device-mapper-multipath-0.4.9-41.el6.x86_64
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x7fae7147fa70) = 9438 > --- SIGCHLD (Child exited) @ 0 (0) --- > close(29) = 0 > read(28, "36006016066102900cc07b2daf2ade01"..., 127) = 34 > read(28, "", 93) = 0 > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9438 > close(28) = 0 This is the getuid_callout. You can manually run this to verify how long it takes. It should return almost instantly. > ioctl(27, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 01, c0, 00, 80, 00], > mx_sb_len=128, iovec_count=0, dxfer_len=256, timeout=60000, flags=0 This is the prio checker function. Again, it should return quickly, but the ioctl itself has a timeout of 1 minute which, looking at the return value, isn't getting hit. Do you know which one of these is the slow one?
Ping Haim - if this is a blocker, we need more information here.