Bug 261521
Summary: | pvdisplay of 250 luns with 4 paths each (1000 paths) takes many hours or days and consumes 4+GB of RAM | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Dave Wysochanski <dwysocha> | ||||||
Component: | lvm2 | Assignee: | Milan Broz <mbroz> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Corey Marthaler <cmarthal> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 4.5 | CC: | agk, ahecox, andriusb, berthiaume_wayne, bhinson, bmr, coughlan, csm, dwysocha, evuraan, jbrassow, marting, mbroz, prockai, pvrabec, rjones, rsarraf, tao | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2008-0776 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-07-24 20:07:34 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 442308, 442309 | ||||||||
Attachments: |
|
Description
Dave Wysochanski
2007-08-28 19:35:40 UTC
Created attachment 177301 [details]
lvmdump file of system with the problem
This problem continues to haunt us here at Bloomberg. Has there been any progress on this issue? More importantly, the issue also causes extremely long boot times (1-2 hours) because the vgscan in the init scripts has similar behavior. It appears to perform (number of devices)^(number of devices) device stats (about ~800k in our case) - for every device it finds, it appears to recheck every device including the ones already checked. Also, the filter in lvm.conf is configured to ignore /dev/sd* devices (the LUNs); however this appears to only apply as to whether or not it will consider any metadata on the devices - it still stats the device nonetheless. I have changed the priority on this to medium instead of low... given the nature of the problem and the fact that the machine, on boot, is out of service so long it seems to merit that. Please change it back if I am wrong. Created attachment 207341 [details]
lvmdump file
I have not made any progress on this. I am about to leave on a short vacation but will try to take at least a brief look when I get back next week. I think I know roughly why this is but not sure how hard it is to fix. Probably not easy but maybe there is something we can do to improve the situation. If there is anything you would like us to provide or test please let us know. presuming that vacation is over do we have anything to report about this? Not yet - other things getting in the way sorry. Did you set up the VG specifically to contain a large number of PVs or are you just using the default settings? (See man pvcreate --metadatacopies and --metadatasize etc.) [We know about the two performance enhancements needed (lack of internal metadata caching so operations are repeated needlessly; lack of automated VG metadata area mangement).] In our testing here the first point you make about repeated operations seems to be our likely problem. I am working on getting answers to how this was created... since I didn't do it I really don't know. I have confirmed that the default settings were used in creation of the PVs. Customer in IT 133260 seeing this as well. I've reproduced this internally with about 500 (small) PVs created with default options. Using pvcreate with --metadatacopies 0 gets rid of the huge delays on VG/LV/PV operations. Does the suggestion in bug 229560 make sense here (add VG name to .cache file)? What progress do we see on this? Customer wants to know. Largely the answer is removing the majority of MDAs as discussed. It's the direction upstream seems to be taking. There are also tool updates coming down the pipe which will help with managing such a setup. Anything new to report on this? It's a month on from the last update and I am sure to get hammered soon! ----- Additional Comments From thoss.com 2008-02-01 03:39 EDT ------- Is there any update at the RedHat site for that Bugzilla ? Do you need any assistance from IBM ? This event sent from IssueTracker by jkachuck issue 136514 There are basically two steps to speed up this process we are working on 1) use internal cache for device labels 2) use internal cache for metadata areas A solution for problem 1) was just submitted in upstream code (but need some subsequent patches for non-mda PVs), we are working on 2) issue. I will update this bugzilla when patches are ready. Then some testing on affected configuration would be nice of course. Any further updates regarding the availability of patches. So it's almost 2 months from the last update at this point, the Solaris and AIX people are laughing about how long this is taking and the lack of patch availability. I have to admit that this is less than optimal in terms of support for an "Enterprise" solution. The fix for this BZ is planned for RHEL 4.7. A prerequisite is to get the change reviewed and accepted upstream, and thoroughly tested. This work is underway, and continues to be a high priority. Setting this bug to POST status because crucial patch (solving the activation time) is now in upstream CVS. (Several previous commits were already in tree and solved partial problems - like caching of device labels (see comment #36). Anyway, several steps are needed now to prepare test package for RHEL4, I will update this bugzilla when we have packages ready. Testing build for RHEL4 already exist now. If anyone want test it before it reach public beta testing phase, please contact Red Hat support. (For reference, upstream package containing fixes is LVM2 2.02.35 release.) Thanks for your patience. Added storage-related partners for their heads-up and request for testing. ----- Additional Comments From mgrf.com 2008-07-01 05:33 EDT ------- Hello Red Hat, Can you please post your test results for the improved fix ? Thx This event sent from IssueTracker by jkachuck issue 136514 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0776.html |