Bug 1666610 - [scale] RHV-M cause high load on PostgreSQL database
Summary: [scale] RHV-M cause high load on PostgreSQL database
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Michal Skrivanek
QA Contact: Daniel Gur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-16 08:14 UTC by Roman Hodain
Modified: 2023-09-07 19:39 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-16 14:13:25 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Article) 3641742 0 None None None 2020-03-06 20:40:14 UTC

Description Roman Hodain 2019-01-16 08:14:29 UTC
Description of problem:
Engine causes high load on the system. It may reach load 40 caused by the database. It is not clear what is causing this issue.

Version-Release number of selected component (if applicable):
4.1.11

How reproducible:
Big environment with thousands of VMs

Steps to Reproduce:
Not clear

Actual results:
The system is overloaded and the engine stops working.

Expected results:
The laod is kept under control

Comment 10 Michal Skrivanek 2019-01-16 13:30:36 UTC
There seems to be way too many issues with pool, and the restart attempt could be heavy. It's happening several times a second. Like the below, just repeated for various pools

2019-01-15 02:53:14,347+01 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler4) [5b19d260] VmPool 'f14b40a4-ecfc-4196-8788-c96762c18b8e' is missin
g 1 prestarted VMs, attempting to prestart 1 VMs
2019-01-15 02:53:14,348+01 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler4) [5b19d260] Failed to prestart any VMs for VmPool 'f14b40a4-ecfc-41
96-8788-c96762c18b8e'
2019-01-15 02:53:14,348+01 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler4) [5b19d260] No VMs available for prestarting

Comment 11 Michal Skrivanek 2019-01-16 13:33:17 UTC
Roman, the messages do not seem to be in debug, I can't see why is it failing. Was the debug level set correctly?

Comment 12 Roy Golan 2019-01-17 08:03:21 UTC
A possible suspect can be the DWH, according to the pg_stat activity I see it takes a lot of time to complete, probably taking a lot of CPU
to complete. The rest of the queries do look like the complete well under a second.

If you can completely stop DWH and see how the system react that will be good.
Also a vacuum will help, donesn't have to be a full vacuum, reclaiming disk space isn't our problem I guess.

Comment 14 Roman Hodain 2019-01-17 14:23:09 UTC
(In reply to Michal Skrivanek from comment #10)
> There seems to be way too many issues with pool, and the restart attempt
> could be heavy. It's happening several times a second. Like the below, just
> repeated for various pools
> 
> 2019-01-15 02:53:14,347+01 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor]
> (DefaultQuartzScheduler4) [5b19d260] VmPool
> 'f14b40a4-ecfc-4196-8788-c96762c18b8e' is missin
> g 1 prestarted VMs, attempting to prestart 1 VMs
> 2019-01-15 02:53:14,348+01 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor]
> (DefaultQuartzScheduler4) [5b19d260] Failed to prestart any VMs for VmPool
> 'f14b40a4-ecfc-41
> 96-8788-c96762c18b8e'
> 2019-01-15 02:53:14,348+01 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor]
> (DefaultQuartzScheduler4) [5b19d260] No VMs available for prestarting

I do not know why the VMs are not prestarted, but I expect that some of the VMs are not is status down. There is no logging in that area. We just did not find any VM that can be started.  We use this to get the VMs:
public Stream<Guid> selectNonPrestartedVms(Guid vmPoolId, ErrorProcessor errorProcessor) {
        return selectVms(vmPoolId,                                               
                VMStatus.Down,                                                   
                vmId -> isNonPrestartedVmFree(vmId, errorProcessor),             
                true);                                                           
    }  

If there is not VM in down state then we do not print anything.

Anyway this happens even when the load is about 1, so it is most probably not related.

The DEBUG logs were enabled from 2019-01-14 14:02:02 till the end of the file.

Comment 16 Roman Hodain 2019-01-17 14:44:47 UTC
(In reply to Roy Golan from comment #12)
> A possible suspect can be the DWH, according to the pg_stat activity I see
> it takes a lot of time to complete, probably taking a lot of CPU
> to complete. The rest of the queries do look like the complete well under a
> second.
> 
> If you can completely stop DWH and see how the system react that will be
> good.
> Also a vacuum will help, donesn't have to be a full vacuum, reclaiming disk
> space isn't our problem I guess.

Yes, I can ask the customer to disable the DWH.. It should not be a problem, bu I am not sure if that helps. The selects end under one sec, but the amount of selects is very high. HEre is the list of selects that happened in one minute. I do not paste sql that are under 100 per second. The first number is the number of occurrences:

    106  select * from public.updatenumanodestatistics(CAST ($1 AS uuid),CAST ($2 AS int8),CAST ($3 AS int4),CAST ($4 AS numeric),CAST ($5 AS numeric),CAST ($6 AS numeric),CAST ($7 AS int4)) as result
    265  select * from  getstoragedomainidsbystoragepoolidandstatus($1, $2)
    269  select * from public.updatevds_interface_statistics(CAST ($1 AS uuid),CAST ($2 AS numeric),CAST ($3 AS numeric),CAST ($4 AS int8),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS int8),CAST ($9 AS int8),CAST ($10 AS int4),CAST ($11 AS float8),CAST ($12 AS uuid)) as result
    271  select * from  getvdsbyvdsid($1, $2, $3)
    286  select * from  getqosbyqosid($1)
    392  select * from  getvdsstaticbyvdsid($1)
    443  select * from  getvmdynamicbyvmguid($1)
    445  select * from ismemballoonenabled($1,$2) as result
    445  SELECT n.nspname,p.proname,p.prorettype,p.proargtypes, t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes  , p.oid  FROM pg_catalog.pg_proc p, pg_catalog.pg_namespace n, pg_catalog.pg_type t  WHERE p.pronamespace=n.oid AND p.prorettype=t.oid  AND n.nspname LIKE 'public' AND p.proname LIKE 'ismemballoonenabled' ORDER BY n.nspname, p.proname, p.oid::text 
    445  SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS, 2 AS PROCEDURE_TYPE , p.proname || '_' || p.oid AS SPECIFIC_NAME  FROM pg_catalog.pg_namespace n, pg_catalog.pg_proc p  LEFT JOIN pg_catalog.pg_description d ON (p.oid=d.objoid)  LEFT JOIN pg_catalog.pg_class c ON (d.classoid=c.oid AND c.relname='pg_proc')  LEFT JOIN pg_catalog.pg_namespace pn ON (c.relnamespace=pn.oid AND pn.nspname='pg_catalog')  WHERE p.pronamespace=n.oid  AND n.nspname LIKE 'public' AND p.proname LIKE 'ismemballoonenabled' ORDER BY PROCEDURE_SCHEM, PROCEDURE_NAME, p.oid::text 
    446  select * from  getdiskvmelementsforvm($1, $2, $3)
    446  select * from  getsnapshotbyvmidandtype($1, $2, $3, $4)
    450  select * from  getsnapshotbyleafguid($1)
    451  select * from  getdbgeneration($1)
    451  select * from  getvmbyvmguid($1, $2, $3)
    454  select * from  getanyadminrolebyuseridandgroupids($1, $2, $3)
    457  select * from  getdisksvmguid($1, $2, $3, $4)
    463  select * from  getdiskvmelementbydiskvmelementid($1, $2, $3, $4)
    504  select * from  getclusterbyclusterid($1, $2, $3)
    892  select * from  getvmdevicebyvmidtypeanddevice($1, $2, $3, $4, $5)
    909  select * from  getvmnetworkinterfaceviewbyvmid($1, $2, $3)
   1328  select * from  getvmdevicebyvmidandtype($1, $2)
   2027  select * from  checkifsnapshotexistsbyvmidandtype($1, $2)
   2155  select * from  getusersbyvmguid($1)
   2740  select * from public.updatevmdynamic(CAST ($1 AS text),CAST ($2 AS varchar),CAST ($3 AS varchar),CAST ($4 AS uuid),CAST ($5 AS varchar),CAST ($6 AS uuid),CAST ($7 AS uuid),CAST ($8 AS int4),CAST ($9 AS uuid),CAST ($10 AS varchar),CAST ($11 AS varchar),CAST ($12 AS varchar),CAST ($13 AS timestamptz),CAST ($14 AS timestamptz),CAST ($15 AS bool),CAST ($16 AS int4),CAST ($17 AS int4),CAST ($18 AS int4),CAST ($19 AS varchar),CAST ($20 AS int4),CAST ($21 AS int4),CAST ($22 AS int4),CAST ($23 AS varchar),CAST ($24 AS int4),CAST ($25 AS numeric),CAST ($26 AS varchar),CAST ($27 AS bool),CAST ($28 AS varchar),CAST ($29 AS varchar),CAST ($30 AS varchar),CAST ($31 AS varchar),CAST ($32 AS int4),CAST ($33 AS int4),CAST ($34 AS int4),CAST ($35 AS int4),CAST ($36 AS varchar),CAST ($37 AS int4),CAST ($38 AS varchar),CAST ($39 AS int4),CAST ($40 AS int4),CAST ($41 AS varchar),CAST ($42 AS int4),CAST ($43 AS varchar),CAST ($44 AS varchar),CAST ($45 AS varchar),CAST ($46 AS varchar),CAST ($47 AS varchar),CAST ($48 AS text)) as result
   4612  select * from  getvmnetworkinterfacetomonitorbyvmid($1)
   4613  select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result
   4613  select * from public.updatevm_interface_statistics(CAST ($1 AS uuid),CAST ($2 AS numeric),CAST ($3 AS numeric),CAST ($4 AS int8),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS int8),CAST ($9 AS int8),CAST ($10 AS int4),CAST ($11 AS float8),CAST ($12 AS uuid)) as result
   4794  select * from public.updatevmstatistics(CAST ($1 AS numeric),CAST ($2 AS numeric),CAST ($3 AS numeric),CAST ($4 AS int4),CAST ($5 AS int4),CAST ($6 AS int4),CAST ($7 AS text),CAST ($8 AS uuid),CAST ($9 AS int8),CAST ($10 AS int8)) as result

Comment 17 Roy Golan 2019-01-17 14:58:36 UTC
How many hosts btw? concider increasing the postgres connections and the engine pool accordingly. We seen that cause issues many times (along with DWH hogging the db)

Comment 18 Ryan Barry 2019-01-21 14:53:56 UTC
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 20 Roman Hodain 2019-01-22 13:20:22 UTC
(In reply to Roy Golan from comment #17)
> How many hosts btw? concider increasing the postgres connections and the
> engine pool accordingly. We seen that cause issues many times (along with
> DWH hogging the db)

The have on 31 hosts. Tath is not that many.

I can also confirm that the DWH does not have impact on the behaviour. It was enabled only for short period of time. The issue happens doe snot matter if DWH is running or not.

Comment 26 Roy Golan 2019-04-22 06:41:57 UTC
(In reply to mlehrer from comment #24)

Summary of the facts pointed out by Roman and Mordechai
1. 31 hosts
2. DWH wasn't running
3. happens on 4.2 as well.
4. engine is running on an ESX VM, 4 cores (threaded?) 40GB
5. 3600 VMs, in several vm pools (how many?)

We need to find a reproducer using hosted engine, regular settings with no config modification(4cores 16GB, default connection pool on db and engine),
disable DWH , and same amount of vms and hosts.

According to Roman's finding on comment 16, this is the list of > 1000 queries the engine produces in 1 minute, ~26,000 selects/minute ~450 selects/second

   1328  select * from   getvmdevicebyvmidandtype
   2027  select * from   checkifsnapshotexistsbyvmidandtype
   2155  select * from   getusersbyvmguid
   2740  select * from   public.updatevmdynamic
   4612  select * from   getvmnetworkinterfacetomonitorbyvmid
   4613  select * from   public.updatedisk_image_dynamic_by_disk_id_and_vm_id
   4613  select * from   public.updatevm_interface_statistics
   4794  select * from   public.updatevmstatistics

The highest amount of queries, ~18,000, are UPDATES. I wonder why there is an update for both vm_dynamic and vm_statistics. also image_dynamic has way too much updates (why?)

Comment 29 Daniel Gur 2019-08-28 13:12:40 UTC
sync2jira

Comment 30 Daniel Gur 2019-08-28 13:16:53 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.