Bug 716468

Summary: device-mapper good path checks not adhering to scsi block timeout settings
Product: Red Hat Enterprise Linux 5 Reporter: Dave Sullivan <dsulliva>
Component: device-mapper-multipathAssignee: LVM and device-mapper development team <lvm-team>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.6CC: agk, bmarzins, bmr, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-24 20:15:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Dave Sullivan 2011-06-24 14:49:02 UTC
Description of problem:

From my understanding and I haven't looked at the code yet.  Multipath time checks of good paths was modified to use the scsi block timeout assuming that timeout was below the default 20 second good path check.

The typical scsi block timeout is 60 seconds.  So the default multipath time check for good paths is 20 seconds. This can be validated by

multipathd -k
multipathd>show paths

2:0:0:16 sdq  65:0    0   [active][ready] XXXXXXX... 15/20
2:0:0:17 sdr  65:16   0   [active][ready] XXXXXXX... 15/20
2:0:0:18 sds  65:32   1   [active][ready] XXXXXXX... 15/20
2:0:0:19 sdt  65:48   0   [active][ready] XXXXXXX... 15/20
2:0:0:20 sdu  65:64   0   [active][ready] XXXXXXX... 15/20

We changed the scsi block timeout to 10 seconds and using /etc/rc.local

for i in $(ls -d /sys/block/sd*/device/timeout); do echo "10" > $i; done

#validate they are changed with
cat /sys/block/*/device/timeout

Then restart multipath
/etc/init.d/multipathd restart

multipathd -k
multipathd>show paths
2:0:0:15 sdp  8:240   0   [active][ready] XXX....... 3/10
2:0:0:16 sdq  65:0    0   [active][ready] XXX....... 3/10
2:0:0:17 sdr  65:16   0   [active][ready] XXX....... 3/10
2:0:0:18 sds  65:32   1   [active][ready] XXX....... 3/10
2:0:0:19 sdt  65:48   0   [active][ready] XXX....... 3/10
2:0:0:20 sdu  65:64   0   [active][ready] XXX....... 3/10
2:0:0:21 sdv  65:80   1   [active][ready] XXX....... 3/10
2:0:0:22 sdw  65:96   1   [active][ready] XXX....... 3/10
2:0:0:23 sdx  65:112  1   [active][ready] XXX....... 3/10
2:0:0:24 sdy  65:128  1   [active][ready] XXX....... 3/10
2:0:1:0  sdz  65:144  0   [active][ready] XXX....... 3/10

So after the system is running for a while, it flips back to the default of 20 seconds. There must be something in the code that is flipping it back, like a hardware handler recheck.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

RHEL5u6 and backend storage is an EMC CX4-240

Comment 1 Ben Marzinski 2011-06-24 20:15:51 UTC
This is not correct.  The only thing that determines how often multipathd checks the paths is the multipath.conf polling_interval parameter.  This defaults to 5 seconds, but if a path is active, it will increase to 4 * polling_interval (or 20 seconds).  You are thinking of the checker_timeout.  If this parameter is not set in multipath.conf, multipathd will use the scsi timeout.  This is used as a timeout for scsi commands.  If the device doesn't respond to a scsi command within this time, the checker assumes the device has failed.

Comment 2 Dave Sullivan 2011-06-24 20:43:13 UTC
Sorry if I was not clear I was referring to the active path check, checker_timeout.  4*polling interval.  So we don't set the checker_timeout, which means it should use scsi timeout.  

As soon as I restart multipath I see 10 seconds, but after a while it switches back to 20 seconds automatically.

I'll take a look at setting the checker_timeout.