Bug 851146

Summary:

3.1 - VDSM [Scalability] When performing storage actions vdsm stop sampling Storage Domains and engine moves host to non-operational

Product:

Red Hat Enterprise Linux 6

Reporter:

Omri Hochman <ohochman>

Component:

vdsm

Assignee:

Federico Simoncelli <fsimonce>

Status:

CLOSED ERRATA

QA Contact:

Rami Vaknin <rvaknin>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.3

CC:

abaron, aburden, bazulay, chetan, cpelland, iheim, jbiddle, lpeer, rvaknin, yeylon, ykaul

Target Milestone:

Keywords:

Regression

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage scale

Fixed In Version:

vdsm-4.9.6-39.0

Doc Type:

Bug Fix

Doc Text:

Previously, a single expensive action involving findDomain in the Storage Pool Manager would lock the storage pool, preventing VDSM to sample the storage domain connections and switching the host to a status of 'Non-operational'. Now, findDomain calls for different storage domains run in parallel to prevent the storage pool lock.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-12-04 19:08:04 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

613180

Attachments:

Description	Flags
vdsm.log	none

Description Omri Hochman 2012-08-23 11:01:39 UTC

VDSM [Scalability] When performing storage actions vdsm stop sampling Storage Domains connections it cause lastCheck to increase over 60.

Environment:
*************
Two RHEL6.3 Hosts (vdsm-4.9.6-29), 35 iSCSI SD's (EMC Storage machine),
No VM's.     

Description:
*************
When performing storage action such Attach/Activate/Deactivate/Detach to Storage Domains - VDSM takes lock on the Storage Pool, that prevent from VDSM to sample the Storage Domains connections and  getVdsStats  'lastCheck' increases over '60' --> In this situation (lastCheck>60) - rhevm-engine behave like there is a storage problem and switch the Host to Non-operational.   

scenario: 
**********
(environment described above)
- 'maintenance' SD's / Or 'Activate'  SD's (or any other Storage actions) 
- while When performing storage actions run on the HSM Host: 
watch -d 'vdsClient -s 0 getVdsStats | grep last'  

Results: 
*********
lastCheck increase over 60.


vdsClient -s 0 getVdsStats | grep last   Thu Aug 23 13:30:02 2012
*****************************************************
storageDomains = {'8c18aa7c-a140-4a59-92a4-570079156688': {'delay': '0.150921106339', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '7f985d40-96dc-4e80-9ff0-7f
511eac0e61': {'delay': '0.0909550189972', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '88fb922c-e870-4a3b-87b8-97bdb88ae87d': {'delay': '0.0091450214386', 'lastCheck
': '171.9', 'code': 0, 'valid': True}, '8ab6b418-9b49-4276-b61f-9dc31e0e25d0': {'delay': '0.0926730632782', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'cf3c32cc-09f
9-4ad0-a043-a2418ba4cfe6': {'delay': '0.02006483078', 'lastCheck': '168.1', 'code': 0, 'valid': True}, 'f7795d92-b5b0-413a-ab2e-4768ec230607': {'delay': '0.186009168625',
 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'c303bdc6-b05b-4581-a0e1-60f0c9810b61': {'delay': '0.00875878334045', 'lastCheck': '174.7', 'code': 0, 'valid': True}, '
6a6aae93-07be-476e-9e42-f1422062ffc3': {'delay': '0.0902950763702', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '0f8ee367-1453-49f5-a91b-23b85b852f9d': {'delay': '0.
0912981033325', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'fa367aa6-7b97-4a87-93c7-707dfb052f5d': {'delay': '0.147267103195', 'lastCheck': '174.5', 'code': 0, 'val
id': True}, 'faf85e15-58c5-4dd7-aae7-300c0ce1c255': {'delay': '0.180124998093', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '652e6392-d1ce-4f7f-8f65-b5a33ae41daa': {
'delay': '0.130205869675', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '64029faa-83ad-4768-bc6b-a52805390cb5': {'delay': '0.183727025986', 'lastCheck': '174.5', 'cod
e': 0, 'valid': True}, 'c9e0f15d-a481-461f-b086-506470cdec20': {'delay': '0.169247865677', 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'faef9299-8994-4707-92ab-7674b
fbd4968': {'delay': '0.0870730876923', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'acd956b0-bcaa-4bca-8a53-a17be1056889': {'delay': '0.139814138412', 'lastCheck': '
174.5', 'code': 0, 'valid': True}, 'a81a6bb0-5d5b-49cc-9a83-387a505c7264': {'delay': '0.1811439991', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '668be28b-acd5-4a93-
849c-c52b584db885': {'delay': '0.120023965836', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '41a29501-b371-4402-9a64-8072e9acf457': {'delay': '0.186011075974', 'last
Check': '174.5', 'code': 0, 'valid': True}, '8a42bc9d-a81d-4558-8697-c59f098ce41e': {'delay': '0.0894820690155', 'lastCheck': '167.4', 'code': 0, 'valid': True}, 'ee07cd4
d-3466-4de6-946c-8d4ff20303d6': {'delay': '0.18048620224', 'lastCheck': '174.5', 'code': 0, 'valid': True}, 'ea351e0c-d84e-4d3f-85e4-e88c3289fa80': {'delay': '0.158120870
59', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '7edfde48-dc68-410e-8bf7-fc794fa06e75': {'delay': '0.153728961945', 'lastCheck': '174.5', 'code': 0, 'valid': True},
 'de270b27-3c16-4330-a5ec-af3bb9fe7a37': {'delay': '0.0910458564758', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '79af18ce-4e62-433a-a4f9-07d3f2ee7e3f': {'delay': '
0.0916359424591', 'lastCheck': '167.4', 'code': 0, 'valid': True}, '43f04908-d944-467f-89aa-8391a6c335d2': {'delay': '0.0200650691986', 'lastCheck': '168.1', 'code': 0, '
valid': True}, '192cf765-05bf-4814-af51-ae9f9df819fb': {'delay': '0.183037042618', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '8caa14e2-8f08-415f-89f0-d47f5e154faf'
: {'delay': '0.159591197968', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '97938cca-58f8-41ed-a4bb-314caf7726cd': {'delay': '0.00953507423401', 'lastCheck': '172.6',
 'code': 0, 'valid': True}, 'b2f58b0e-3b71-441c-b127-a10588b98332': {'delay': '0.00966286659241', 'lastCheck': '174.7', 'code': 0, 'valid': True}, '4824db5f-af6b-4525-9ce
7-aa2ef1489a64': {'delay': '0.174260854721', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '411bac76-921f-47c1-8107-aad113f2cc17': {'delay': '0.0112969875336', 'lastCh
eck': '174.7', 'code': 0, 'valid': True}, 'c6caad53-65e3-4135-a11c-f5811e1fa412': {'delay': '0.141134977341', 'lastCheck': '174.5', 'code': 0, 'valid': True}, '9852d729-8
8b5-40e7-be22-2d80c3aff734': {'delay': '0.00971794128418', 'lastCheck': '174.9', 'code': 0, 'valid': True}, '5d75e8bc-a608-4269-8b37-a1b36b2dd4be': {'delay': '0.145969152
451', 'lastCheck': '174.5', 'code': 0, 'valid': True}}

Comment 1 Omri Hochman 2012-08-23 11:02:16 UTC

Created attachment 606513 [details]
vdsm.log

Comment 3 Omri Hochman 2012-08-23 11:11:08 UTC

- I've encountered this issue when I tested the patch that Fixed BZ#844656 
(Domain Monitor that crashed) -  I noticed that with the patch the domain monitor stayed alive (no crash occurred), but domain monitor stopped sampling storage domains connections while vdsm ran storage actions (described above).

Comment 4 Omri Hochman 2012-08-23 14:00:53 UTC

Just to clarify things - the issue reproduces on "clean" host with vdsm build vdsm-4.9.6-29 - and it got nothing to do with the patch I've tested.

Comment 5 Ayal Baron 2012-08-23 21:03:10 UTC

Regression?

Comment 6 Omri Hochman 2012-08-26 15:32:00 UTC

(In reply to comment #5)
> Regression?

Yes,It's a regression, I've tested this scenario with: older rhevm3.0 / vdsm-4.9-113.3.el6_3.x86_64, and the lastCheck was being updated after running storage actions, such as Active/Deactive storage domain.

[root@puma13 ~]# rpm -qa | grep vdsm
vdsm-cli-4.9-113.3.el6_3.x86_64
vdsm-4.9-113.3.el6_3.x86_64

   
Every 2.0s: vdsClient -s 0 getVdsStats | grep lastCheck                                                Sun Aug 26 16:26:48 2012

        storageDomains = {'4cf818d1-e11a-424b-8dfe-adbdfd098fd3': {'delay': '0.000579118728638', 'lastCheck': '9.6', 'code': 0,
 'valid': True}, '807dfa85-033a-4819-9d9c-ae9a9c95485c': {'delay': '0.000559091567993', 'lastCheck': '7.7', 'code': 0, 'valid':
 True}, 'a52f81dd-ed17-4f32-b609-cdc64ee8bf7f': {'delay': '0.000572204589844', 'lastCheck': '1.3', 'code': 0, 'valid': True}}

Comment 8 Rami Vaknin 2012-08-31 20:37:32 UTC

Looks like it still reproduces on vdsm-4.9.6-31.0.5.git94d6da5.el6.x86_64, most of hosts moving to non-operational durihg the above vdsm operations.

The last check value is too high in few hosts (more that 150)

Comment 9 Rami Vaknin 2012-08-31 20:45:13 UTC

[root@puma06 vdsm]# vdsClient -s 0 getVdsStats | grep --color last | tr "{" "\n"
	storageDomains = 
'6170bb13-f896-4054-9638-998833d724b3': 
'delay': '0.029284954071', 'lastCheck': '134.1', 'code': 0, 'valid': True}, '421bcb9f-b8b1-4f12-a00d-b42382cd7944': 
'delay': '0.0311081409454', 'lastCheck': '6.6', 'code': 0, 'valid': True}, '5d6acf3d-e264-4ada-958e-ec49eeab928f': 
'delay': '0.0390121936798', 'lastCheck': '138.9', 'code': 0, 'valid': True}, 'b4705bf4-5779-4c0b-b22d-43841a51c780': 
'delay': '0.00913619995117', 'lastCheck': '138.1', 'code': 0, 'valid': True}, 'f220e7ca-ce77-46a3-8296-bc4302135f77': 
'delay': '0.00961494445801', 'lastCheck': '131.9', 'code': 0, 'valid': True}, '1450c107-f1e3-458e-9e02-60f6185ff8c9': 
'delay': '0.34281206131', 'lastCheck': '2.0', 'code': 0, 'valid': True}, 'c8463ae1-e9d4-4e51-a582-7f7752830fce': 
'delay': '0.0121309757233', 'lastCheck': '0.1', 'code': 0, 'valid': True}, '8843f2bf-8652-45b3-8bf3-bc525c67bffa': 
'delay': '0.0129961967468', 'lastCheck': '135.2', 'code': 0, 'valid': True}, '7fc8a0ce-f05f-4f5c-be38-3be4cec4655d': 
'delay': '0.00928497314453', 'lastCheck': '134.7', 'code': 0, 'valid': True}, 'd807283b-5249-418c-b722-d20616eb9937': 
'delay': '0.00939893722534', 'lastCheck': '133.8', 'code': 0, 'valid': True}, 'd607a030-05cd-4b36-b645-b1fd5efb8f4d': 
'delay': '0.0229818820953', 'lastCheck': '137.1', 'code': 0, 'valid': True}, 'e5eb1f01-2c70-4390-b887-d48b91ebecba': 
'delay': '0.0772261619568', 'lastCheck': '138.2', 'code': 0, 'valid': True}, 'b9662432-b05c-44a8-9bf1-9107d5b779c8': 
'delay': '0.132785081863', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '41ac0a4e-f5f9-4782-b7c2-e036539d4398': 
'delay': '0.0223870277405', 'lastCheck': '136.2', 'code': 0, 'valid': True}, '3f63f56a-9a66-4867-a9c8-d97901eee31b': 
'delay': '0.677000045776', 'lastCheck': '2.0', 'code': 0, 'valid': True}, '393f44a7-1295-4538-a29b-a4cc3ee217b5': 
'delay': '0.00942301750183', 'lastCheck': '137.0', 'code': 0, 'valid': True}, 'fb33c8fb-0c38-4021-8853-03045498b2fe': 
'delay': '0.102398872375', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '022ff07a-8559-4ad5-bb0a-7edfeb600930': 
'delay': '0.111602067947', 'lastCheck': '138.2', 'code': 0, 'valid': True}, '512bbcd8-437b-4fdd-a132-287c5142ff87': 
'delay': '0.318045139313', 'lastCheck': '3.4', 'code': 0, 'valid': True}, 'b996502d-d247-4e3d-9920-db5c0d4d47f5': 
'delay': '0.0286719799042', 'lastCheck': '133.4', 'code': 0, 'valid': True}, '4fdf1fc5-b213-4c48-a155-361ea03db093': 
'delay': '0.0291049480438', 'lastCheck': '134.1', 'code': 0, 'valid': True}, 'f54a095b-3f75-4399-a365-9fac046a3314':

Comment 10 Rami Vaknin 2012-09-13 14:06:26 UTC

The bug also reproduces in data center of multiple host with only 1 iscsi storage domain.

Comment 11 Federico Simoncelli 2012-09-30 14:47:11 UTC

This bug was initially created by me and Omri to tackle a problem in the SPM where a single expensive action involving findDomain would block all the stats thread (because they were all stuck on a single lock present in the sdCache.produce).

At that time it was particularly evident and in fact Omri even verified an early version of the fix at: http://gerrit.ovirt.org/#/c/6822/

Now sadly we are stuck on verifying this because the bug description was vague (not mentioning the SPM) and because of an LVM scalability issue that is now more evident.

Anyway the current fix for this is at:

http://gerrit.ovirt.org/#/c/7511/

As far as I understand the LVM scalability bug is tracked somewhere else (bug 838602 I suppose), so the options are:

1. we close this as a duplicate of bug 838602 (because its fix would resolve this too)
2. the original issue can't be exposed, it can't be verified and therefore we close this as NOTABUG or WORKSFORME
3. the problem can be exposed, therefore we should match the title/description to what was originally intended and proceed to commit my patch.

Comment 12 Ayal Baron 2012-09-30 21:26:25 UTC

Number 3 looks like the way to go if we can indeed reproduce and show the patch solves it.
Reducing number of LVM calls is dealt with elsewhere.

Comment 16 Omri Hochman 2012-11-04 13:20:36 UTC

Unable to reproduce:  vdsm-4.9.6-40.0.el6_3.x86_64,  RHEVM (Build SI23).

Comment 20 errata-xmlrpc 2012-12-04 19:08:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html