Bug 722754

Summary: [vdsm][error-handling][lvm-conf]vdsm should add disable_after_error_count in lvm.conf
Product: Red Hat Enterprise Linux 6 Reporter: Moran Goldboim <mgoldboi>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Moran Goldboim <mgoldboi>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3CC: abaron, bazulay, danken, fsimonce, iheim, tdosek, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.9-87 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:31:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Moran Goldboim 2011-07-17 11:31:52 UTC
Description of problem:
in case of multipath path error lvm commands are taking a long time (in my deployment it took 4.5 minutes) adding disable_after_error_count parameter to lvm.conf should put the device as disabled after a specific amount of errors has happened and make lvm and vdsm respond quicker.

Version-Release number of selected component (if applicable):
vdsm-4.9-81.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1.make a multipath path faulty
2.lvm pvs vgs takes around 4.5 -5 minutes (50 SDs connected)
3.
  
Actual results:


Expected results:


Additional info:
changing this config value on lvm.conf makes operations to run much faster.

Comment 2 Dan Kenigsberg 2011-07-17 11:59:22 UTC
How and when would the device be re-enabled?

Comment 3 Federico Simoncelli 2011-07-20 13:41:11 UTC
Some additional notes, I just talked with Moran and "multipath path error" should be intended as "the storage is unreachable through all its paths".
Looking at the disable_after_error_count code introduced with the patch:

http://sources.redhat.com/git/gitweb.cgi?p=lvm2.git;a=commitdiff;h=74b228ee945934c3b979cbb70a29b3a721f5c683

The error_count is a one-shot value for each lvm command.
Summarizing: using disable_after_error_count has no side effects (eg: it's not permanently disabling a device) and would improve the lvm responsiveness when one storage is completely unreachable.

Comment 4 Dan Kenigsberg 2011-07-21 07:00:15 UTC
which value should we put in disable_after_error_count so we do not have to many false negatives? Moran, this question is directly also to your team :-)

Comment 5 Federico Simoncelli 2011-07-21 07:55:30 UTC
BZ#722754 Limit lvm retries to broken devices
Change-Id: I74dfdea05943f72c7b89eba42246fc8f26bf0035

http://gerrit.usersys.redhat.com/730

Comment 7 Tomas Dosek 2011-08-10 06:12:16 UTC
Verified - vdsm-4.9-91 - disable_after_error_count parameter is now set to 3.

Comment 8 errata-xmlrpc 2011-12-06 07:31:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html