Bug 156691
Summary: | multipath-tools: add configurable timer for queue_if_no_path | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Lan Tran <tranlan> |
Component: | device-mapper-multipath | Assignee: | Alasdair Kergon <agk> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | agk, christophe.varoqui, dmo, lmb, tranlan |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | U3 | Doc Type: | Enhancement |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-03-08 15:44:50 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Lan Tran
2005-05-03 13:15:01 UTC
I would expect that most hardware vendors will want to use queue_if_no_path to deal with potential all-path transient errors. My concern is that without a mechanism in place to limit the queueing, then in the case of a permanent all-path failure, e.g. storage goes down (which should technically never happen, right?! :), then the queueing may occur infinitely until the system resources are all consumed and the system could potentially hang? I don't know if there is some mechanism at the layer above or below dm-mpath to prevent this, but if not, this seems like a very serious problem to me. Especially in system oom situations, how would the system be able to prevent this infinite queueing if it was occurring? I think I would rather have a single I/O fail during a transient error than potentially hang a customer's system during a permanent failure. Any thoughts? Per a discussion in one of the weekly meetings, the idea was brought up to provide a timer mechanism to limit queueing in dm. Two minutes was the suggested default, at least for the interim. Any chance of getting this into RHEL4 U2? This should be a configurable timer, and be handled by multipath-tools setting/clearing the queue_if_no_path flag. I'd probably be able to take a patch still for SP2. It should be a configurable timer. As I recall from the discussion, the concern with relying on the user-space to turn off queueing is if in a low memory scenario, the userspace may not be able to get the message to dm-multipath to stop queuing, thus putting a configurable mechanism in the kernel would be more reliable way of preventing and recovering from this situation? Although I guess you could argue that by the time that happens, your system may be hosed anyways. I'm planning to start working on adding a configurable userspace timer to multipath-tools. At a high level, I was just thinking of adding a userspace configurable parameter (i.e. through hwtable or multipath.conf) that can optionally be set and works only if queue_if_no_path is enabled. If set, then each multipath will have an associated timer. In multipathd, keeping a global list of multipath timers that is checked at intervals (either in checkerloop thread or maybe a separate thread) to see if any timers have expired. If any have expired, then update the map in dm to disable queue_if_no_path for that multipaht device. (At some point later, maybe when any paths are restored, then can reenable the queue_if_no_path.) A multipath timer gets added to this list and starts timing when all the paths are detected as failed, but if any paths are restored, the timer is stopped by removing it from the timer list. This seems pretty simple for now. Any thoughts? Thanks! That's fine, however I think there's some virtue in keeping this in the kernel, or else the kernel might never be able to recover that memory if multipathd dies. Yeah, I think that's a very good point; it would appear to be more reliable to have a timer mechanism in the kernel versus relying on user-space. I think userspace vs. kernelspace mechanism was discussed before in the Thurs. meetings, and at the last one I believe Alisdair had mentioned that he wanted to first see how well a userspace timing mechanism worked out first... Indeed - this goes in the kernel if we're unable to make it work effectively in userspace. Whyever would multipathd die? :-) This already exists in the upstream code, in the form of the no_path_retry option. With this option set to fail, this works like "fail_if_no_path" With this option set to queue, this works like "queue_if_no_path" With this option set to a number, this option queues IOs for the specified number of retrys. To get the number of seconds that this will queue for multiply this number by the check interval (default 5 sec), after that, it turns off queueing, which fails the IOs. Is the option "no_path_retry" present under Rh4 U2 release ? Nope. It might make it into U3. This bug associated with RHEL The upstream code has been pulled in. Did this code make into RHEL 4.0 U3? Thanks. -H |