RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 851146 - 3.1 - VDSM [Scalability] When performing storage actions vdsm stop sampling Storage Domains and engine moves host to non-operational
Summary: 3.1 - VDSM [Scalability] When performing storage actions vdsm stop sampling S...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm
Version: 6.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Federico Simoncelli
QA Contact: Rami Vaknin
URL:
Whiteboard: storage scale
Depends On:
Blocks: 613180
TreeView+ depends on / blocked
 
Reported: 2012-08-23 11:01 UTC by Omri Hochman
Modified: 2022-07-09 05:38 UTC (History)
11 users (show)

Fixed In Version: vdsm-4.9.6-39.0
Doc Type: Bug Fix
Doc Text:
Previously, a single expensive action involving findDomain in the Storage Pool Manager would lock the storage pool, preventing VDSM to sample the storage domain connections and switching the host to a status of 'Non-operational'. Now, findDomain calls for different storage domains run in parallel to prevent the storage pool lock.
Clone Of:
Environment:
Last Closed: 2012-12-04 19:08:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm.log (535.01 KB, application/octet-stream)
2012-08-23 11:02 UTC, Omri Hochman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:1508 0 normal SHIPPED_LIVE Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update 2012-12-04 23:48:05 UTC

Description Omri Hochman 2012-08-23 11:01:39 UTC
VDSM [Scalability] When performing storage actions vdsm stop sampling Storage Domains connections it cause lastCheck to increase over 60.

Environment:
*************
Two RHEL6.3 Hosts (vdsm-4.9.6-29), 35 iSCSI SD's (EMC Storage machine),
No VM's.     

Description:
*************
When performing storage action such Attach/Activate/Deactivate/Detach to Storage Domains - VDSM takes lock on the Storage Pool, that prevent from VDSM to sample the Storage Domains connections and  getVdsStats  'lastCheck' increases over '60' --> In this situation (lastCheck>60) - rhevm-engine behave like there is a storage problem and switch the Host to Non-operational.   

scenario: 
**********
(environment described above)
- 'maintenance' SD's / Or 'Activate'  SD's (or any other Storage actions) 
- while When performing storage actions run on the HSM Host: 
watch -d 'vdsClient -s 0 getVdsStats | grep last'  

Results: 
*********
lastCheck increase over 60.


vdsClient -s 0 getVdsStats | grep last   Thu Aug 23 13:30:02 2012
*****************************************************
storageDomains = {'8c18aa7c-a140-4a59-92a4-570079156688': {'delay': '0.150921106339', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '7f985d40-96dc-4e80-9ff0-7f
511eac0e61': {'delay': '0.0909550189972', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '88fb922c-e870-4a3b-87b8-97bdb88ae87d': {'delay': '0.0091450214386', 'lastCheck
': '171.9', 'code': 0, 'valid': True}, '8ab6b418-9b49-4276-b61f-9dc31e0e25d0': {'delay': '0.0926730632782', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'cf3c32cc-09f
9-4ad0-a043-a2418ba4cfe6': {'delay': '0.02006483078', 'lastCheck': '168.1', 'code': 0, 'valid': True}, 'f7795d92-b5b0-413a-ab2e-4768ec230607': {'delay': '0.186009168625',
 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'c303bdc6-b05b-4581-a0e1-60f0c9810b61': {'delay': '0.00875878334045', 'lastCheck': '174.7', 'code': 0, 'valid': True}, '
6a6aae93-07be-476e-9e42-f1422062ffc3': {'delay': '0.0902950763702', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '0f8ee367-1453-49f5-a91b-23b85b852f9d': {'delay': '0.
0912981033325', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'fa367aa6-7b97-4a87-93c7-707dfb052f5d': {'delay': '0.147267103195', 'lastCheck': '174.5', 'code': 0, 'val
id': True}, 'faf85e15-58c5-4dd7-aae7-300c0ce1c255': {'delay': '0.180124998093', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '652e6392-d1ce-4f7f-8f65-b5a33ae41daa': {
'delay': '0.130205869675', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '64029faa-83ad-4768-bc6b-a52805390cb5': {'delay': '0.183727025986', 'lastCheck': '174.5', 'cod
e': 0, 'valid': True}, 'c9e0f15d-a481-461f-b086-506470cdec20': {'delay': '0.169247865677', 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'faef9299-8994-4707-92ab-7674b
fbd4968': {'delay': '0.0870730876923', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'acd956b0-bcaa-4bca-8a53-a17be1056889': {'delay': '0.139814138412', 'lastCheck': '
174.5', 'code': 0, 'valid': True}, 'a81a6bb0-5d5b-49cc-9a83-387a505c7264': {'delay': '0.1811439991', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '668be28b-acd5-4a93-
849c-c52b584db885': {'delay': '0.120023965836', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '41a29501-b371-4402-9a64-8072e9acf457': {'delay': '0.186011075974', 'last
Check': '174.5', 'code': 0, 'valid': True}, '8a42bc9d-a81d-4558-8697-c59f098ce41e': {'delay': '0.0894820690155', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'ee07cd4
d-3466-4de6-946c-8d4ff20303d6': {'delay': '0.18048620224', 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'ea351e0c-d84e-4d3f-85e4-e88c3289fa80': {'delay': '0.158120870
59', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '7edfde48-dc68-410e-8bf7-fc794fa06e75': {'delay': '0.153728961945', 'lastCheck': '174.5', 'code': 0, 'valid': True},
 'de270b27-3c16-4330-a5ec-af3bb9fe7a37': {'delay': '0.0910458564758', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '79af18ce-4e62-433a-a4f9-07d3f2ee7e3f': {'delay': '
0.0916359424591', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '43f04908-d944-467f-89aa-8391a6c335d2': {'delay': '0.0200650691986', 'lastCheck': '168.1', 'code': 0, '
valid': True}, '192cf765-05bf-4814-af51-ae9f9df819fb': {'delay': '0.183037042618', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '8caa14e2-8f08-415f-89f0-d47f5e154faf'
: {'delay': '0.159591197968', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '97938cca-58f8-41ed-a4bb-314caf7726cd': {'delay': '0.00953507423401', 'lastCheck': '172.6',
 'code': 0, 'valid': True}, 'b2f58b0e-3b71-441c-b127-a10588b98332': {'delay': '0.00966286659241', 'lastCheck': '174.7', 'code': 0, 'valid': True}, '4824db5f-af6b-4525-9ce
7-aa2ef1489a64': {'delay': '0.174260854721', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '411bac76-921f-47c1-8107-aad113f2cc17': {'delay': '0.0112969875336', 'lastCh
eck': '174.7', 'code': 0, 'valid': True}, 'c6caad53-65e3-4135-a11c-f5811e1fa412': {'delay': '0.141134977341', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '9852d729-8
8b5-40e7-be22-2d80c3aff734': {'delay': '0.00971794128418', 'lastCheck': '174.9', 'code': 0, 'valid': True}, '5d75e8bc-a608-4269-8b37-a1b36b2dd4be': {'delay': '0.145969152
451', 'lastCheck': '174.5', 'code': 0, 'valid': True}}

Comment 1 Omri Hochman 2012-08-23 11:02:16 UTC
Created attachment 606513 [details]
vdsm.log

Comment 3 Omri Hochman 2012-08-23 11:11:08 UTC
- I've encountered this issue when I tested the patch that Fixed BZ#844656 
(Domain Monitor that crashed) -  I noticed that with the patch the domain monitor stayed alive (no crash occurred), but domain monitor stopped sampling storage domains connections while vdsm ran storage actions (described above).

Comment 4 Omri Hochman 2012-08-23 14:00:53 UTC
Just to clarify things - the issue reproduces on "clean" host with vdsm build vdsm-4.9.6-29 - and it got nothing to do with the patch I've tested.

Comment 5 Ayal Baron 2012-08-23 21:03:10 UTC
Regression?

Comment 6 Omri Hochman 2012-08-26 15:32:00 UTC
(In reply to comment #5)
> Regression?

Yes,It's a regression, I've tested this scenario with: older rhevm3.0 / vdsm-4.9-113.3.el6_3.x86_64, and the lastCheck was being updated after running storage actions, such as Active/Deactive storage domain.

[root@puma13 ~]# rpm -qa | grep vdsm
vdsm-cli-4.9-113.3.el6_3.x86_64
vdsm-4.9-113.3.el6_3.x86_64

   
Every 2.0s: vdsClient -s 0 getVdsStats | grep lastCheck                                                Sun Aug 26 16:26:48 2012

        storageDomains = {'4cf818d1-e11a-424b-8dfe-adbdfd098fd3': {'delay': '0.000579118728638', 'lastCheck': '9.6', 'code': 0,
 'valid': True}, '807dfa85-033a-4819-9d9c-ae9a9c95485c': {'delay': '0.000559091567993', 'lastCheck': '7.7', 'code': 0, 'valid':
 True}, 'a52f81dd-ed17-4f32-b609-cdc64ee8bf7f': {'delay': '0.000572204589844', 'lastCheck': '1.3', 'code': 0, 'valid': True}}

Comment 8 Rami Vaknin 2012-08-31 20:37:32 UTC
Looks like it still reproduces on vdsm-4.9.6-31.0.5.git94d6da5.el6.x86_64, most of hosts moving to non-operational durihg the above vdsm operations.

The last check value is too high in few hosts (more that 150)

Comment 9 Rami Vaknin 2012-08-31 20:45:13 UTC
[root@puma06 vdsm]# vdsClient -s 0 getVdsStats | grep --color last | tr "{" "\n"
	storageDomains = 
'6170bb13-f896-4054-9638-998833d724b3': 
'delay': '0.029284954071', 'lastCheck': '134.1', 'code': 0, 'valid': True}, '421bcb9f-b8b1-4f12-a00d-b42382cd7944': 
'delay': '0.0311081409454', 'lastCheck': '6.6', 'code': 0, 'valid': True}, '5d6acf3d-e264-4ada-958e-ec49eeab928f': 
'delay': '0.0390121936798', 'lastCheck': '138.9', 'code': 0, 'valid': True}, 'b4705bf4-5779-4c0b-b22d-43841a51c780': 
'delay': '0.00913619995117', 'lastCheck': '138.1', 'code': 0, 'valid': True}, 'f220e7ca-ce77-46a3-8296-bc4302135f77': 
'delay': '0.00961494445801', 'lastCheck': '131.9', 'code': 0, 'valid': True}, '1450c107-f1e3-458e-9e02-60f6185ff8c9': 
'delay': '0.34281206131', 'lastCheck': '2.0', 'code': 0, 'valid': True}, 'c8463ae1-e9d4-4e51-a582-7f7752830fce': 
'delay': '0.0121309757233', 'lastCheck': '0.1', 'code': 0, 'valid': True}, '8843f2bf-8652-45b3-8bf3-bc525c67bffa': 
'delay': '0.0129961967468', 'lastCheck': '135.2', 'code': 0, 'valid': True}, '7fc8a0ce-f05f-4f5c-be38-3be4cec4655d': 
'delay': '0.00928497314453', 'lastCheck': '134.7', 'code': 0, 'valid': True}, 'd807283b-5249-418c-b722-d20616eb9937': 
'delay': '0.00939893722534', 'lastCheck': '133.8', 'code': 0, 'valid': True}, 'd607a030-05cd-4b36-b645-b1fd5efb8f4d': 
'delay': '0.0229818820953', 'lastCheck': '137.1', 'code': 0, 'valid': True}, 'e5eb1f01-2c70-4390-b887-d48b91ebecba': 
'delay': '0.0772261619568', 'lastCheck': '138.2', 'code': 0, 'valid': True}, 'b9662432-b05c-44a8-9bf1-9107d5b779c8': 
'delay': '0.132785081863', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '41ac0a4e-f5f9-4782-b7c2-e036539d4398': 
'delay': '0.0223870277405', 'lastCheck': '136.2', 'code': 0, 'valid': True}, '3f63f56a-9a66-4867-a9c8-d97901eee31b': 
'delay': '0.677000045776', 'lastCheck': '2.0', 'code': 0, 'valid': True}, '393f44a7-1295-4538-a29b-a4cc3ee217b5': 
'delay': '0.00942301750183', 'lastCheck': '137.0', 'code': 0, 'valid': True}, 'fb33c8fb-0c38-4021-8853-03045498b2fe': 
'delay': '0.102398872375', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '022ff07a-8559-4ad5-bb0a-7edfeb600930': 
'delay': '0.111602067947', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '512bbcd8-437b-4fdd-a132-287c5142ff87': 
'delay': '0.318045139313', 'lastCheck': '3.4', 'code': 0, 'valid': True}, 'b996502d-d247-4e3d-9920-db5c0d4d47f5': 
'delay': '0.0286719799042', 'lastCheck': '133.4', 'code': 0, 'valid': True}, '4fdf1fc5-b213-4c48-a155-361ea03db093': 
'delay': '0.0291049480438', 'lastCheck': '134.1', 'code': 0, 'valid': True}, 'f54a095b-3f75-4399-a365-9fac046a3314':

Comment 10 Rami Vaknin 2012-09-13 14:06:26 UTC
The bug also reproduces in data center of multiple host with only 1 iscsi storage domain.

Comment 11 Federico Simoncelli 2012-09-30 14:47:11 UTC
This bug was initially created by me and Omri to tackle a problem in the SPM where a single expensive action involving findDomain would block all the stats thread (because they were all stuck on a single lock present in the sdCache.produce).

At that time it was particularly evident and in fact Omri even verified an early version of the fix at: http://gerrit.ovirt.org/#/c/6822/

Now sadly we are stuck on verifying this because the bug description was vague (not mentioning the SPM) and because of an LVM scalability issue that is now more evident.

Anyway the current fix for this is at:

http://gerrit.ovirt.org/#/c/7511/

As far as I understand the LVM scalability bug is tracked somewhere else (bug 838602 I suppose), so the options are:

1. we close this as a duplicate of bug 838602 (because its fix would resolve this too)
2. the original issue can't be exposed, it can't be verified and therefore we close this as NOTABUG or WORKSFORME
3. the problem can be exposed, therefore we should match the title/description to what was originally intended and proceed to commit my patch.

Comment 12 Ayal Baron 2012-09-30 21:26:25 UTC
Number 3 looks like the way to go if we can indeed reproduce and show the patch solves it.
Reducing number of LVM calls is dealt with elsewhere.

Comment 16 Omri Hochman 2012-11-04 13:20:36 UTC
Unable to reproduce:  vdsm-4.9.6-40.0.el6_3.x86_64,  RHEVM (Build SI23).

Comment 20 errata-xmlrpc 2012-12-04 19:08:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html


Note You need to log in before you can comment on or make changes to this bug.