Description of problem: The directio timeout is hardcoded to be 30 seconds, which can cause false failures on storage systems with longer timeouts. As it says in checker.h: * Overloaded storage response time can be very long. * SG_IO timouts after DEF_TIMEOUT milliseconds, and checkers interprets this * as a path failure. multipathd then proactively evicts the path from the DM * multipath table in this case. * * This generaly snow balls and ends up in full eviction and IO errors for end * users. Bad. This may also cause SCSI bus resets, causing disruption for all * local and external storage hardware users. * * Provision a long timeout. Longer than any real-world application would cope * with. I propose redefining DIRECTIO_TIMEOUT to be DEF_TIMEOUT. Version-Release number of selected component (if applicable): device-mapper-multipath-0.4.7-17
Now that the synchronous checkers have a configurable timeout, directio could use that as well for the asynchornous timeout. On most machines, it defaults to 60 seconds, but it can be changed in multipath.conf
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
This bug hasn't had any activity since 2009. I don't like the idea of changing the default timeout for the directio checker on the last rhel 5 release. If someone has a good reason why we should be doing this, let me know. The code change itself is simple.