Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 879930

Summary: ovirt-engine-backend [Scalability]: The queries getstorage_domains_by_storagepoolid && getdisksvmguid caused postmaster processes to consume constantly 100%cpu.
Product: Red Hat Enterprise Virtualization Manager Reporter: Omri Hochman <ohochman>
Component: ovirt-engineAssignee: mkublin <mkublin>
Status: CLOSED ERRATA QA Contact: vvyazmin <vvyazmin>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.1.0CC: bazulay, dyasny, hateya, iheim, lpeer, mkublin, Rhev-m-bugs, sgrinber, tvvcox, yeylon, ykaul, yzaslavs
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: sf1 Doc Type: Bug Fix
Doc Text:
The getdisksvmguid query from the GetVmStatsVDSCommand was run on every running virtual machine, causing the postmaster processes on remote database to consume 100% CPU power. This query is now no longer run, which reduces the CPU usage to 30% when GetVmStatsVDSCommand is run.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-10 21:23:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pg_log
none
engine.log none

Description Omri Hochman 2012-11-25 15:33:03 UTC
Created attachment 651553 [details]
pg_log

ovirt-engine-backend[Scalability]: The queries getstorage_domains_by_storagepoolid && getdisksvmguid caused postmaster processes to consume constantly  100%cpu.    

Description:
*************
On Scale environment (details below), postmaster processes on the remote DB physical machine consume constantly 100% cpu, investigate pg_log for queries  that takes more than 1 second to return showed that the following :  getstorage_domains_by_storagepoolid && getdisksvmguid are very frequent and takes long time. 

RHEVM Environment:
*******************
- RHEVM (Build IC24.4) installed on physical  
- Postgresql remote DB on another physical machine.

Objects in RHEVM: 
*****************
- Total 31 Hosts. 
- Total 50 iSCSI Storage Domains +  1 ISO + 1 Export
- Total 1400+ running XP VM's (1NIC ,1HD) ,
- 2300 Users/Groups

pg_log (queries that took more than 1 second) :
**********************************************
LOG:  duration: 3339.596 ms  execute S_8: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '4c67e140-55ad-4371-a1c1-dd0f80ae5623', $2 = NULL, $3 = 'f'
LOG:  duration: 3681.452 ms  execute S_2: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '8ddcec08-f7ed-4890-98ec-d510dac88c5f', $2 = NULL, $3 = 'f'
LOG:  duration: 5153.521 ms  execute S_8: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '1f116259-f701-4620-bd78-bdbba46da6c7', $2 = NULL, $3 = 'f'
LOG:  duration: 1607.459 ms  execute S_2: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '4700471d-6d82-4bb3-ba94-09f28879c880', $2 = NULL, $3 = 'f'
LOG:  duration: 4932.313 ms  execute S_1: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '76d207cd-eee9-41df-beb4-d59cfea75ed8', $2 = NULL, $3 = 'f'
LOG:  duration: 4033.496 ms  execute S_2: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '8ac38784-4a6c-4a9e-8be2-c98128c2a297', $2 = NULL, $3 = 'f'
LOG:  duration: 5088.718 ms  execute S_10: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '921b887f-0110-472a-ad2f-439bdd61cfc5', $2 = NULL, $3 = 'f'
LOG:  duration: 1996.767 ms  execute S_3: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = 'e52b853c-fd23-49c0-8182-90e5c5726277', $2 = NULL, $3 = 'f'
LOG:  duration: 1414.949 ms  execute S_11: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '69c6fbca-cd24-4a42-8e6a-accadf55436d', $2 = NULL, $3 = 'f'
LOG:  duration: 1699.736 ms  execute S_2: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '98fdadf7-4bea-489f-b3ff-96c25e662a2d', $2 = NULL, $3 = 'f'
LOG:  duration: 2528.628 ms  execute S_2: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = 'd9cf59cc-a8f0-4b29-af5a-2ba0c03fdec7', $2 = NULL, $3 = 'f'
LOG:  duration: 1034.817 ms  execute S_1: select * from  getdisksvmguid($1, $2, $3)
DETAIL:  parameters: $1 = '5e3204fc-1271-4317-a516-d2958eae3cd6', $2 = NULL, $3 = 'f'
LOG:  duration: 1109.804 ms  execute <unnamed>: select * from  getstorage_domains_by_storagepoolid($1, $2, $3)
DETAIL:  parameters: $1 = '4a90a284-adbb-465c-bffd-e1703b2c5a66', $2 = NULL, $3 = 'f'
LOG:  duration: 1071.572 ms  execute S_25: select * from  getstorage_domains_by_storagepoolid($1, $2, $3)
DETAIL:  parameters: $1 = '4a90a284-adbb-465c-bffd-e1703b2c5a66', $2 = NULL, $3 = 'f'
LOG:  duration: 1262.678 ms  execute S_25: select * from  getstorage_domains_by_storagepoolid($1, $2, $3)
DETAIL:  parameters: $1 = '4a90a284-adbb-465c-bffd-e1703b2c5a66', $2 = NULL, $3 = 'f'
LOG:  duration: 1680.993 ms  execute S_27: select * from  getstorage_domains_by_storagepoolid($1, $2, $3)
DETAIL:  parameters: $1 = '4a90a284-adbb-465c-bffd-e1703b2c5a66', $2 = NULL, $3 = 'f'
LOG:  duration: 1221.055 ms  execute S_24: select * from  getstorage_domains_by_storagepoolid($1, $2, $3)
DETAIL:  parameters: $1 = '4a90a284-adbb-465c-bffd-e1703b2c5a66', $2 = NULL, $3 = 'f'



TOP - on Remote DB Machine:
********************************
top - 15:11:26 up 5 days,  5:04,  4 users,  load average: 16.58, 16.64, 16.89
Tasks: 369 total,  16 running, 353 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.5%us,  0.1%sy,  0.0%ni,  0.3%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  32870284k total,  3148328k used, 29721956k free,   154756k buffers
Swap: 16506872k total,        0k used, 16506872k free,   887716k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                      
27399 postgres  20   0  255m  64m  25m R 90.7  0.2  24:33.73 postmaster                                                                                                   
27405 postgres  20   0  239m  56m  26m R 88.7  0.2  19:18.91 postmaster                                                                                                   
27390 postgres  20   0  255m  73m  28m R 81.5  0.2  18:56.06 postmaster                                                                                                   
27394 postgres  20   0  239m  54m  25m R 80.8  0.2  22:02.85 postmaster                                                                                                   
27402 postgres  20   0  256m  72m  26m R 80.8  0.2  23:10.96 postmaster                                                                                                   
27408 postgres  20   0  255m  72m  26m R 80.8  0.2  22:51.95 postmaster                                                                                                   
27415 postgres  20   0  256m  72m  26m R 80.1  0.2  21:09.54 postmaster                                                                                                   
27395 postgres  20   0  256m  72m  26m R 78.8  0.2  22:50.98 postmaster                                                                                                   
27409 postgres  20   0  239m  55m  26m R 77.8  0.2  16:50.49 postmaster                                                                                                   
27398 postgres  20   0  240m  56m  26m R 77.5  0.2  23:00.27 postmaster                                                                                                   
27411 postgres  20   0  254m  70m  25m R 75.8  0.2  22:34.93 postmaster                                                                                                   
27416 postgres  20   0  252m  68m  26m R 73.2  0.2  21:04.19 postmaster                                                                                                   
27417 postgres  20   0  235m  50m  25m R 70.2  0.2  16:26.87 postmaster                                                                                                   
27400 postgres  20   0  252m  68m  26m S 53.3  0.2  19:51.41 postmaster                                                                                                   
27410 postgres  20   0  249m  62m  25m R 48.7  0.2   4:43.03 postmaster                                                                                                   
27413 postgres  20   0  250m  65m  26m R 38.7  0.2  24:23.52 postmaster                                                                                                   
27406 postgres  20   0  255m  68m  25m S 15.6  0.2  19:50.49 postmaster

Comment 1 Omri Hochman 2012-11-25 15:34:36 UTC
Created attachment 651554 [details]
engine.log

Comment 2 mkublin 2012-11-26 12:54:19 UTC
I have a patch that was tested on that environment, we checked it with Omri and it shows good results

Comment 3 mkublin 2012-11-27 07:39:02 UTC
http://gerrit.ovirt.org/#/c/9468/ , patch upstream.

Comment 4 mkublin 2012-11-28 07:40:37 UTC
Merged upstream

Comment 6 mkublin 2012-12-12 07:34:13 UTC
I can back port these changes easily to downstream version 3.0.x and 3.1.x

Comment 12 Cheryn Tan 2013-04-03 06:51:16 UTC
This bug is currently attached to errata RHEA-2013:14491. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.

* Consequence: What happens when the bug presents.

* Fix: What was done to fix the bug.

* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes

Thanks in advance.

Comment 13 vvyazmin@redhat.com 2013-04-19 05:14:37 UTC
No issues are found

Verified on RHEVM 3.2 - SF13.1 environment:

RHEVM: rhevm-3.2.0-10.19.beta2.el6ev.noarch
VDSM: vdsm-4.10.2-15.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.3.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64


Tested on environment with 800 VM's and 52 hosts (50 of them was fake host) on FC and iSCSI Data Center

Comment 14 errata-xmlrpc 2013-06-10 21:23:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html