Bug 851146

Summary: 3.1 - VDSM [Scalability] When performing storage actions vdsm stop sampling Storage Domains and engine moves host to non-operational
Product: Red Hat Enterprise Linux 6 Reporter: Omri Hochman <ohochman>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Rami Vaknin <rvaknin>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: abaron, aburden, bazulay, chetan, cpelland, iheim, jbiddle, lpeer, rvaknin, yeylon, ykaul
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage scale
Fixed In Version: vdsm-4.9.6-39.0 Doc Type: Bug Fix
Doc Text:
Previously, a single expensive action involving findDomain in the Storage Pool Manager would lock the storage pool, preventing VDSM to sample the storage domain connections and switching the host to a status of 'Non-operational'. Now, findDomain calls for different storage domains run in parallel to prevent the storage pool lock.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 19:08:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 613180    
Attachments:
Description Flags
vdsm.log none

Description Omri Hochman 2012-08-23 11:01:39 UTC
VDSM [Scalability] When performing storage actions vdsm stop sampling Storage Domains connections it cause lastCheck to increase over 60.

Environment:
*************
Two RHEL6.3 Hosts (vdsm-4.9.6-29), 35 iSCSI SD's (EMC Storage machine),
No VM's.     

Description:
*************
When performing storage action such Attach/Activate/Deactivate/Detach to Storage Domains - VDSM takes lock on the Storage Pool, that prevent from VDSM to sample the Storage Domains connections and  getVdsStats  'lastCheck' increases over '60' --> In this situation (lastCheck>60) - rhevm-engine behave like there is a storage problem and switch the Host to Non-operational.   

scenario: 
**********
(environment described above)
- 'maintenance' SD's / Or 'Activate'  SD's (or any other Storage actions) 
- while When performing storage actions run on the HSM Host: 
watch -d 'vdsClient -s 0 getVdsStats | grep last'  

Results: 
*********
lastCheck increase over 60.


vdsClient -s 0 getVdsStats | grep last   Thu Aug 23 13:30:02 2012
*****************************************************
storageDomains = {'8c18aa7c-a140-4a59-92a4-570079156688': {'delay': '0.150921106339', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '7f985d40-96dc-4e80-9ff0-7f
511eac0e61': {'delay': '0.0909550189972', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '88fb922c-e870-4a3b-87b8-97bdb88ae87d': {'delay': '0.0091450214386', 'lastCheck
': '171.9', 'code': 0, 'valid': True}, '8ab6b418-9b49-4276-b61f-9dc31e0e25d0': {'delay': '0.0926730632782', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'cf3c32cc-09f
9-4ad0-a043-a2418ba4cfe6': {'delay': '0.02006483078', 'lastCheck': '168.1', 'code': 0, 'valid': True}, 'f7795d92-b5b0-413a-ab2e-4768ec230607': {'delay': '0.186009168625',
 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'c303bdc6-b05b-4581-a0e1-60f0c9810b61': {'delay': '0.00875878334045', 'lastCheck': '174.7', 'code': 0, 'valid': True}, '
6a6aae93-07be-476e-9e42-f1422062ffc3': {'delay': '0.0902950763702', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '0f8ee367-1453-49f5-a91b-23b85b852f9d': {'delay': '0.
0912981033325', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'fa367aa6-7b97-4a87-93c7-707dfb052f5d': {'delay': '0.147267103195', 'lastCheck': '174.5', 'code': 0, 'val
id': True}, 'faf85e15-58c5-4dd7-aae7-300c0ce1c255': {'delay': '0.180124998093', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '652e6392-d1ce-4f7f-8f65-b5a33ae41daa': {
'delay': '0.130205869675', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '64029faa-83ad-4768-bc6b-a52805390cb5': {'delay': '0.183727025986', 'lastCheck': '174.5', 'cod
e': 0, 'valid': True}, 'c9e0f15d-a481-461f-b086-506470cdec20': {'delay': '0.169247865677', 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'faef9299-8994-4707-92ab-7674b
fbd4968': {'delay': '0.0870730876923', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'acd956b0-bcaa-4bca-8a53-a17be1056889': {'delay': '0.139814138412', 'lastCheck': '
174.5', 'code': 0, 'valid': True}, 'a81a6bb0-5d5b-49cc-9a83-387a505c7264': {'delay': '0.1811439991', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '668be28b-acd5-4a93-
849c-c52b584db885': {'delay': '0.120023965836', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '41a29501-b371-4402-9a64-8072e9acf457': {'delay': '0.186011075974', 'last
Check': '174.5', 'code': 0, 'valid': True}, '8a42bc9d-a81d-4558-8697-c59f098ce41e': {'delay': '0.0894820690155', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'ee07cd4
d-3466-4de6-946c-8d4ff20303d6': {'delay': '0.18048620224', 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'ea351e0c-d84e-4d3f-85e4-e88c3289fa80': {'delay': '0.158120870
59', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '7edfde48-dc68-410e-8bf7-fc794fa06e75': {'delay': '0.153728961945', 'lastCheck': '174.5', 'code': 0, 'valid': True},
 'de270b27-3c16-4330-a5ec-af3bb9fe7a37': {'delay': '0.0910458564758', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '79af18ce-4e62-433a-a4f9-07d3f2ee7e3f': {'delay': '
0.0916359424591', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '43f04908-d944-467f-89aa-8391a6c335d2': {'delay': '0.0200650691986', 'lastCheck': '168.1', 'code': 0, '
valid': True}, '192cf765-05bf-4814-af51-ae9f9df819fb': {'delay': '0.183037042618', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '8caa14e2-8f08-415f-89f0-d47f5e154faf'
: {'delay': '0.159591197968', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '97938cca-58f8-41ed-a4bb-314caf7726cd': {'delay': '0.00953507423401', 'lastCheck': '172.6',
 'code': 0, 'valid': True}, 'b2f58b0e-3b71-441c-b127-a10588b98332': {'delay': '0.00966286659241', 'lastCheck': '174.7', 'code': 0, 'valid': True}, '4824db5f-af6b-4525-9ce
7-aa2ef1489a64': {'delay': '0.174260854721', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '411bac76-921f-47c1-8107-aad113f2cc17': {'delay': '0.0112969875336', 'lastCh
eck': '174.7', 'code': 0, 'valid': True}, 'c6caad53-65e3-4135-a11c-f5811e1fa412': {'delay': '0.141134977341', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '9852d729-8
8b5-40e7-be22-2d80c3aff734': {'delay': '0.00971794128418', 'lastCheck': '174.9', 'code': 0, 'valid': True}, '5d75e8bc-a608-4269-8b37-a1b36b2dd4be': {'delay': '0.145969152
451', 'lastCheck': '174.5', 'code': 0, 'valid': True}}

Comment 1 Omri Hochman 2012-08-23 11:02:16 UTC
Created attachment 606513 [details]
vdsm.log

Comment 3 Omri Hochman 2012-08-23 11:11:08 UTC
- I've encountered this issue when I tested the patch that Fixed BZ#844656 
(Domain Monitor that crashed) -  I noticed that with the patch the domain monitor stayed alive (no crash occurred), but domain monitor stopped sampling storage domains connections while vdsm ran storage actions (described above).

Comment 4 Omri Hochman 2012-08-23 14:00:53 UTC
Just to clarify things - the issue reproduces on "clean" host with vdsm build vdsm-4.9.6-29 - and it got nothing to do with the patch I've tested.

Comment 5 Ayal Baron 2012-08-23 21:03:10 UTC
Regression?

Comment 6 Omri Hochman 2012-08-26 15:32:00 UTC
(In reply to comment #5)
> Regression?

Yes,It's a regression, I've tested this scenario with: older rhevm3.0 / vdsm-4.9-113.3.el6_3.x86_64, and the lastCheck was being updated after running storage actions, such as Active/Deactive storage domain.

[root@puma13 ~]# rpm -qa | grep vdsm
vdsm-cli-4.9-113.3.el6_3.x86_64
vdsm-4.9-113.3.el6_3.x86_64

   
Every 2.0s: vdsClient -s 0 getVdsStats | grep lastCheck                                                Sun Aug 26 16:26:48 2012

        storageDomains = {'4cf818d1-e11a-424b-8dfe-adbdfd098fd3': {'delay': '0.000579118728638', 'lastCheck': '9.6', 'code': 0,
 'valid': True}, '807dfa85-033a-4819-9d9c-ae9a9c95485c': {'delay': '0.000559091567993', 'lastCheck': '7.7', 'code': 0, 'valid':
 True}, 'a52f81dd-ed17-4f32-b609-cdc64ee8bf7f': {'delay': '0.000572204589844', 'lastCheck': '1.3', 'code': 0, 'valid': True}}

Comment 8 Rami Vaknin 2012-08-31 20:37:32 UTC
Looks like it still reproduces on vdsm-4.9.6-31.0.5.git94d6da5.el6.x86_64, most of hosts moving to non-operational durihg the above vdsm operations.

The last check value is too high in few hosts (more that 150)

Comment 9 Rami Vaknin 2012-08-31 20:45:13 UTC
[root@puma06 vdsm]# vdsClient -s 0 getVdsStats | grep --color last | tr "{" "\n"
	storageDomains = 
'6170bb13-f896-4054-9638-998833d724b3': 
'delay': '0.029284954071', 'lastCheck': '134.1', 'code': 0, 'valid': True}, '421bcb9f-b8b1-4f12-a00d-b42382cd7944': 
'delay': '0.0311081409454', 'lastCheck': '6.6', 'code': 0, 'valid': True}, '5d6acf3d-e264-4ada-958e-ec49eeab928f': 
'delay': '0.0390121936798', 'lastCheck': '138.9', 'code': 0, 'valid': True}, 'b4705bf4-5779-4c0b-b22d-43841a51c780': 
'delay': '0.00913619995117', 'lastCheck': '138.1', 'code': 0, 'valid': True}, 'f220e7ca-ce77-46a3-8296-bc4302135f77': 
'delay': '0.00961494445801', 'lastCheck': '131.9', 'code': 0, 'valid': True}, '1450c107-f1e3-458e-9e02-60f6185ff8c9': 
'delay': '0.34281206131', 'lastCheck': '2.0', 'code': 0, 'valid': True}, 'c8463ae1-e9d4-4e51-a582-7f7752830fce': 
'delay': '0.0121309757233', 'lastCheck': '0.1', 'code': 0, 'valid': True}, '8843f2bf-8652-45b3-8bf3-bc525c67bffa': 
'delay': '0.0129961967468', 'lastCheck': '135.2', 'code': 0, 'valid': True}, '7fc8a0ce-f05f-4f5c-be38-3be4cec4655d': 
'delay': '0.00928497314453', 'lastCheck': '134.7', 'code': 0, 'valid': True}, 'd807283b-5249-418c-b722-d20616eb9937': 
'delay': '0.00939893722534', 'lastCheck': '133.8', 'code': 0, 'valid': True}, 'd607a030-05cd-4b36-b645-b1fd5efb8f4d': 
'delay': '0.0229818820953', 'lastCheck': '137.1', 'code': 0, 'valid': True}, 'e5eb1f01-2c70-4390-b887-d48b91ebecba': 
'delay': '0.0772261619568', 'lastCheck': '138.2', 'code': 0, 'valid': True}, 'b9662432-b05c-44a8-9bf1-9107d5b779c8': 
'delay': '0.132785081863', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '41ac0a4e-f5f9-4782-b7c2-e036539d4398': 
'delay': '0.0223870277405', 'lastCheck': '136.2', 'code': 0, 'valid': True}, '3f63f56a-9a66-4867-a9c8-d97901eee31b': 
'delay': '0.677000045776', 'lastCheck': '2.0', 'code': 0, 'valid': True}, '393f44a7-1295-4538-a29b-a4cc3ee217b5': 
'delay': '0.00942301750183', 'lastCheck': '137.0', 'code': 0, 'valid': True}, 'fb33c8fb-0c38-4021-8853-03045498b2fe': 
'delay': '0.102398872375', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '022ff07a-8559-4ad5-bb0a-7edfeb600930': 
'delay': '0.111602067947', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '512bbcd8-437b-4fdd-a132-287c5142ff87': 
'delay': '0.318045139313', 'lastCheck': '3.4', 'code': 0, 'valid': True}, 'b996502d-d247-4e3d-9920-db5c0d4d47f5': 
'delay': '0.0286719799042', 'lastCheck': '133.4', 'code': 0, 'valid': True}, '4fdf1fc5-b213-4c48-a155-361ea03db093': 
'delay': '0.0291049480438', 'lastCheck': '134.1', 'code': 0, 'valid': True}, 'f54a095b-3f75-4399-a365-9fac046a3314':

Comment 10 Rami Vaknin 2012-09-13 14:06:26 UTC
The bug also reproduces in data center of multiple host with only 1 iscsi storage domain.

Comment 11 Federico Simoncelli 2012-09-30 14:47:11 UTC
This bug was initially created by me and Omri to tackle a problem in the SPM where a single expensive action involving findDomain would block all the stats thread (because they were all stuck on a single lock present in the sdCache.produce).

At that time it was particularly evident and in fact Omri even verified an early version of the fix at: http://gerrit.ovirt.org/#/c/6822/

Now sadly we are stuck on verifying this because the bug description was vague (not mentioning the SPM) and because of an LVM scalability issue that is now more evident.

Anyway the current fix for this is at:

http://gerrit.ovirt.org/#/c/7511/

As far as I understand the LVM scalability bug is tracked somewhere else (bug 838602 I suppose), so the options are:

1. we close this as a duplicate of bug 838602 (because its fix would resolve this too)
2. the original issue can't be exposed, it can't be verified and therefore we close this as NOTABUG or WORKSFORME
3. the problem can be exposed, therefore we should match the title/description to what was originally intended and proceed to commit my patch.

Comment 12 Ayal Baron 2012-09-30 21:26:25 UTC
Number 3 looks like the way to go if we can indeed reproduce and show the patch solves it.
Reducing number of LVM calls is dealt with elsewhere.

Comment 16 Omri Hochman 2012-11-04 13:20:36 UTC
Unable to reproduce:  vdsm-4.9.6-40.0.el6_3.x86_64,  RHEVM (Build SI23).

Comment 20 errata-xmlrpc 2012-12-04 19:08:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html