Bug 851146
Summary: | 3.1 - VDSM [Scalability] When performing storage actions vdsm stop sampling Storage Domains and engine moves host to non-operational | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Omri Hochman <ohochman> | ||||
Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||
Status: | CLOSED ERRATA | QA Contact: | Rami Vaknin <rvaknin> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.3 | CC: | abaron, aburden, bazulay, chetan, cpelland, iheim, jbiddle, lpeer, rvaknin, yeylon, ykaul | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage scale | ||||||
Fixed In Version: | vdsm-4.9.6-39.0 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, a single expensive action involving findDomain in the Storage Pool Manager would lock the storage pool, preventing VDSM to sample the storage domain connections and switching the host to a status of 'Non-operational'.
Now, findDomain calls for different storage domains run in parallel to prevent the storage pool lock.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-12-04 19:08:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 613180 | ||||||
Attachments: |
|
Description
Omri Hochman
2012-08-23 11:01:39 UTC
Created attachment 606513 [details]
vdsm.log
- I've encountered this issue when I tested the patch that Fixed BZ#844656 (Domain Monitor that crashed) - I noticed that with the patch the domain monitor stayed alive (no crash occurred), but domain monitor stopped sampling storage domains connections while vdsm ran storage actions (described above). Just to clarify things - the issue reproduces on "clean" host with vdsm build vdsm-4.9.6-29 - and it got nothing to do with the patch I've tested. Regression? (In reply to comment #5) > Regression? Yes,It's a regression, I've tested this scenario with: older rhevm3.0 / vdsm-4.9-113.3.el6_3.x86_64, and the lastCheck was being updated after running storage actions, such as Active/Deactive storage domain. [root@puma13 ~]# rpm -qa | grep vdsm vdsm-cli-4.9-113.3.el6_3.x86_64 vdsm-4.9-113.3.el6_3.x86_64 Every 2.0s: vdsClient -s 0 getVdsStats | grep lastCheck Sun Aug 26 16:26:48 2012 storageDomains = {'4cf818d1-e11a-424b-8dfe-adbdfd098fd3': {'delay': '0.000579118728638', 'lastCheck': '9.6', 'code': 0, 'valid': True}, '807dfa85-033a-4819-9d9c-ae9a9c95485c': {'delay': '0.000559091567993', 'lastCheck': '7.7', 'code': 0, 'valid': True}, 'a52f81dd-ed17-4f32-b609-cdc64ee8bf7f': {'delay': '0.000572204589844', 'lastCheck': '1.3', 'code': 0, 'valid': True}} Looks like it still reproduces on vdsm-4.9.6-31.0.5.git94d6da5.el6.x86_64, most of hosts moving to non-operational durihg the above vdsm operations. The last check value is too high in few hosts (more that 150) [root@puma06 vdsm]# vdsClient -s 0 getVdsStats | grep --color last | tr "{" "\n" storageDomains = '6170bb13-f896-4054-9638-998833d724b3': 'delay': '0.029284954071', 'lastCheck': '134.1', 'code': 0, 'valid': True}, '421bcb9f-b8b1-4f12-a00d-b42382cd7944': 'delay': '0.0311081409454', 'lastCheck': '6.6', 'code': 0, 'valid': True}, '5d6acf3d-e264-4ada-958e-ec49eeab928f': 'delay': '0.0390121936798', 'lastCheck': '138.9', 'code': 0, 'valid': True}, 'b4705bf4-5779-4c0b-b22d-43841a51c780': 'delay': '0.00913619995117', 'lastCheck': '138.1', 'code': 0, 'valid': True}, 'f220e7ca-ce77-46a3-8296-bc4302135f77': 'delay': '0.00961494445801', 'lastCheck': '131.9', 'code': 0, 'valid': True}, '1450c107-f1e3-458e-9e02-60f6185ff8c9': 'delay': '0.34281206131', 'lastCheck': '2.0', 'code': 0, 'valid': True}, 'c8463ae1-e9d4-4e51-a582-7f7752830fce': 'delay': '0.0121309757233', 'lastCheck': '0.1', 'code': 0, 'valid': True}, '8843f2bf-8652-45b3-8bf3-bc525c67bffa': 'delay': '0.0129961967468', 'lastCheck': '135.2', 'code': 0, 'valid': True}, '7fc8a0ce-f05f-4f5c-be38-3be4cec4655d': 'delay': '0.00928497314453', 'lastCheck': '134.7', 'code': 0, 'valid': True}, 'd807283b-5249-418c-b722-d20616eb9937': 'delay': '0.00939893722534', 'lastCheck': '133.8', 'code': 0, 'valid': True}, 'd607a030-05cd-4b36-b645-b1fd5efb8f4d': 'delay': '0.0229818820953', 'lastCheck': '137.1', 'code': 0, 'valid': True}, 'e5eb1f01-2c70-4390-b887-d48b91ebecba': 'delay': '0.0772261619568', 'lastCheck': '138.2', 'code': 0, 'valid': True}, 'b9662432-b05c-44a8-9bf1-9107d5b779c8': 'delay': '0.132785081863', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '41ac0a4e-f5f9-4782-b7c2-e036539d4398': 'delay': '0.0223870277405', 'lastCheck': '136.2', 'code': 0, 'valid': True}, '3f63f56a-9a66-4867-a9c8-d97901eee31b': 'delay': '0.677000045776', 'lastCheck': '2.0', 'code': 0, 'valid': True}, '393f44a7-1295-4538-a29b-a4cc3ee217b5': 'delay': '0.00942301750183', 'lastCheck': '137.0', 'code': 0, 'valid': True}, 'fb33c8fb-0c38-4021-8853-03045498b2fe': 'delay': '0.102398872375', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '022ff07a-8559-4ad5-bb0a-7edfeb600930': 'delay': '0.111602067947', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '512bbcd8-437b-4fdd-a132-287c5142ff87': 'delay': '0.318045139313', 'lastCheck': '3.4', 'code': 0, 'valid': True}, 'b996502d-d247-4e3d-9920-db5c0d4d47f5': 'delay': '0.0286719799042', 'lastCheck': '133.4', 'code': 0, 'valid': True}, '4fdf1fc5-b213-4c48-a155-361ea03db093': 'delay': '0.0291049480438', 'lastCheck': '134.1', 'code': 0, 'valid': True}, 'f54a095b-3f75-4399-a365-9fac046a3314': The bug also reproduces in data center of multiple host with only 1 iscsi storage domain. This bug was initially created by me and Omri to tackle a problem in the SPM where a single expensive action involving findDomain would block all the stats thread (because they were all stuck on a single lock present in the sdCache.produce). At that time it was particularly evident and in fact Omri even verified an early version of the fix at: http://gerrit.ovirt.org/#/c/6822/ Now sadly we are stuck on verifying this because the bug description was vague (not mentioning the SPM) and because of an LVM scalability issue that is now more evident. Anyway the current fix for this is at: http://gerrit.ovirt.org/#/c/7511/ As far as I understand the LVM scalability bug is tracked somewhere else (bug 838602 I suppose), so the options are: 1. we close this as a duplicate of bug 838602 (because its fix would resolve this too) 2. the original issue can't be exposed, it can't be verified and therefore we close this as NOTABUG or WORKSFORME 3. the problem can be exposed, therefore we should match the title/description to what was originally intended and proceed to commit my patch. Number 3 looks like the way to go if we can indeed reproduce and show the patch solves it. Reducing number of LVM calls is dealt with elsewhere. Unable to reproduce: vdsm-4.9.6-40.0.el6_3.x86_64, RHEVM (Build SI23). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1508.html |