Bug 1695879

Summary: large number of luns reports "failed errno 24"/"Too many open files" when running lvm commands on RHEL7.6 [rhel-7.6.z]
Product: Red Hat Enterprise Linux 7 Reporter: RAD team bot copy to z-stream <autobot-eus-copy>
Component: lvm2Assignee: Marian Csontos <mcsontos>
lvm2 sub component: Default / Unclassified QA Contact: cluster-qe <cluster-qe>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: agk, cmarthal, heinzm, jbrassow, jmagrini, kailas.t.kadam, mcsontos, mjuricek, msnitzer, pdwyer, prajnoha, teigland, yhuang, zkabelac
Version: 7.6Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.180-10.el7_6.7 Doc Type: If docs needed, set a value
Doc Text:
On a system with around 1000 devices, LVM would fail to open devices because of the default open file limit. This fix avoids the problem by closing non-LVM devices before reaching the limit.
Story Points: ---
Clone Of: 1691277 Environment:
Last Closed: 2019-04-23 14:28:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1691277    
Bug Blocks:    

Description RAD team bot copy to z-stream 2019-04-03 20:21:57 UTC
This bug has been copied from bug #1691277 and has been proposed to be backported to 7.6 z-stream (EUS).

Comment 6 David Teigland 2019-04-04 14:38:53 UTC
I've not been able to reproduce this with or without lvmetad.  I have a suspicion it could be a side effect of the udev problems, the EMFILE errors first appear immediately after those udev issues.  Could you try setting obtain_device_list_from_udev=0 in lvm.conf and see if this still happens?

Comment 8 David Teigland 2019-04-04 16:06:26 UTC
There are other commits in the stable branch that fix this issue (so stable and 7.7 do not have this problem.)   So, it looks like your testing has validated the original fix for bug 1691277 (there's no more problem when not using lvmetad), but has also uncovered other fixes that would be needed to make lvmetad work with this many devices.

One or more of the following commits from stable would need to be backported to 7.6.z, but it's not clear how many of them can be cherry-picked directly.  Some may depend on other unrelated changes.  This may become more backporting than is appropriate for zstream.


commit 9799c8da07b77844451c64bcbbce0d9d43ce2552
Author: David Teigland <teigland>
Date:   Tue Nov 6 16:03:17 2018 -0600

    devices: reuse bcache fd when getting block size
    
    This avoids an unnecessary open() on the device.

commit f7ffba204e06ae432ae2c7943cb41eec5b8e8bb1
Author: David Teigland <teigland>
Date:   Tue Jun 26 12:05:39 2018 -0500

    devs: use bcache fd for read ahead ioctl
    
    to avoid an unnecessary open of the device in
    most cases.

commit 73578e36faa78c616716617a83083cc3a31ba03f
Author: David Teigland <teigland>
Date:   Fri May 11 14:28:46 2018 -0500

    dev_cache: remove the lvmcache check when closing fd
    
    This is no longer used since devices are not held
    open in dev_cache.

commit 3e3cb22f2a115f71f883a75c7840ab271bd83454
Author: David Teigland <teigland>
Date:   Fri May 11 14:25:08 2018 -0500

    dev_cache: fix close in utility functions
    
    All these functions are now used as utilities,
    e.g. for ioctl (not for io), and need to
    open/close the device each time they are called.
    (Many of the opens can probably be eliminated by
    just using the bcache fd for the ioctl.)

commit ccab54677c9f92cf1bd11895251799c043a57602
Author: David Teigland <teigland>
Date:   Fri May 11 13:53:19 2018 -0500

    dev_cache: fix close in dev_get_block_size

Comment 9 Marian Csontos 2019-04-09 12:42:17 UTC
To summarize, when there are many PVs in the system:

- async io, supposed to speed things up, can not used because of Bug 1656498,
- and lvmetad, supposed to speed things up, can not be used because of this bug.

So sync io without lvmetad is the only option. Is not that hurting performance badly?

Comment 10 David Teigland 2019-04-09 15:05:16 UTC
(In reply to Marian Csontos from comment #9)
> To summarize, when there are many PVs in the system:
> 
> - async io, supposed to speed things up, can not used because of Bug 1656498,

That shouldn't be related to the number of devices.  It's caused by other unknown software that's using all the aio contexts.  A user can simply increase the number of aio contexts on the system if they want lvm to use aio instead of falling back to sync io.

> - and lvmetad, supposed to speed things up, can not be used because of this bug.

Just to clarify, this specific bug appears to be fixed, but there are other issues mentioned in comment 8 that will cause similar problems at around 1000 devices.  That issue can also be avoided by simply increasing the open fd limit.  If this is a problem we could open a new bug to do zstream backports of some of the other commits in comment 8.

> So sync io without lvmetad is the only option. Is not that hurting performance badly?

Comment 11 Marian Csontos 2019-04-10 14:15:11 UTC
Martin, are you happy with the above explanation?

Comment 13 Marian Csontos 2019-04-11 12:48:25 UTC
David, could you provide a doc string, please.

Comment 15 kailas 2019-04-22 15:25:56 UTC
Hello,

When we are going to release the patch for this.

Comment 17 errata-xmlrpc 2019-04-23 14:28:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0814