Description of problem: Engine causes high load on the system. It may reach load 40 caused by the database. It is not clear what is causing this issue. Version-Release number of selected component (if applicable): 4.1.11 How reproducible: Big environment with thousands of VMs Steps to Reproduce: Not clear Actual results: The system is overloaded and the engine stops working. Expected results: The laod is kept under control
There seems to be way too many issues with pool, and the restart attempt could be heavy. It's happening several times a second. Like the below, just repeated for various pools 2019-01-15 02:53:14,347+01 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler4) [5b19d260] VmPool 'f14b40a4-ecfc-4196-8788-c96762c18b8e' is missin g 1 prestarted VMs, attempting to prestart 1 VMs 2019-01-15 02:53:14,348+01 WARN [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler4) [5b19d260] Failed to prestart any VMs for VmPool 'f14b40a4-ecfc-41 96-8788-c96762c18b8e' 2019-01-15 02:53:14,348+01 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler4) [5b19d260] No VMs available for prestarting
Roman, the messages do not seem to be in debug, I can't see why is it failing. Was the debug level set correctly?
A possible suspect can be the DWH, according to the pg_stat activity I see it takes a lot of time to complete, probably taking a lot of CPU to complete. The rest of the queries do look like the complete well under a second. If you can completely stop DWH and see how the system react that will be good. Also a vacuum will help, donesn't have to be a full vacuum, reclaiming disk space isn't our problem I guess.
(In reply to Michal Skrivanek from comment #10) > There seems to be way too many issues with pool, and the restart attempt > could be heavy. It's happening several times a second. Like the below, just > repeated for various pools > > 2019-01-15 02:53:14,347+01 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] > (DefaultQuartzScheduler4) [5b19d260] VmPool > 'f14b40a4-ecfc-4196-8788-c96762c18b8e' is missin > g 1 prestarted VMs, attempting to prestart 1 VMs > 2019-01-15 02:53:14,348+01 WARN [org.ovirt.engine.core.bll.VmPoolMonitor] > (DefaultQuartzScheduler4) [5b19d260] Failed to prestart any VMs for VmPool > 'f14b40a4-ecfc-41 > 96-8788-c96762c18b8e' > 2019-01-15 02:53:14,348+01 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] > (DefaultQuartzScheduler4) [5b19d260] No VMs available for prestarting I do not know why the VMs are not prestarted, but I expect that some of the VMs are not is status down. There is no logging in that area. We just did not find any VM that can be started. We use this to get the VMs: public Stream<Guid> selectNonPrestartedVms(Guid vmPoolId, ErrorProcessor errorProcessor) { return selectVms(vmPoolId, VMStatus.Down, vmId -> isNonPrestartedVmFree(vmId, errorProcessor), true); } If there is not VM in down state then we do not print anything. Anyway this happens even when the load is about 1, so it is most probably not related. The DEBUG logs were enabled from 2019-01-14 14:02:02 till the end of the file.
(In reply to Roy Golan from comment #12) > A possible suspect can be the DWH, according to the pg_stat activity I see > it takes a lot of time to complete, probably taking a lot of CPU > to complete. The rest of the queries do look like the complete well under a > second. > > If you can completely stop DWH and see how the system react that will be > good. > Also a vacuum will help, donesn't have to be a full vacuum, reclaiming disk > space isn't our problem I guess. Yes, I can ask the customer to disable the DWH.. It should not be a problem, bu I am not sure if that helps. The selects end under one sec, but the amount of selects is very high. HEre is the list of selects that happened in one minute. I do not paste sql that are under 100 per second. The first number is the number of occurrences: 106 select * from public.updatenumanodestatistics(CAST ($1 AS uuid),CAST ($2 AS int8),CAST ($3 AS int4),CAST ($4 AS numeric),CAST ($5 AS numeric),CAST ($6 AS numeric),CAST ($7 AS int4)) as result 265 select * from getstoragedomainidsbystoragepoolidandstatus($1, $2) 269 select * from public.updatevds_interface_statistics(CAST ($1 AS uuid),CAST ($2 AS numeric),CAST ($3 AS numeric),CAST ($4 AS int8),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS int8),CAST ($9 AS int8),CAST ($10 AS int4),CAST ($11 AS float8),CAST ($12 AS uuid)) as result 271 select * from getvdsbyvdsid($1, $2, $3) 286 select * from getqosbyqosid($1) 392 select * from getvdsstaticbyvdsid($1) 443 select * from getvmdynamicbyvmguid($1) 445 select * from ismemballoonenabled($1,$2) as result 445 SELECT n.nspname,p.proname,p.prorettype,p.proargtypes, t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes , p.oid FROM pg_catalog.pg_proc p, pg_catalog.pg_namespace n, pg_catalog.pg_type t WHERE p.pronamespace=n.oid AND p.prorettype=t.oid AND n.nspname LIKE 'public' AND p.proname LIKE 'ismemballoonenabled' ORDER BY n.nspname, p.proname, p.oid::text 445 SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS, 2 AS PROCEDURE_TYPE , p.proname || '_' || p.oid AS SPECIFIC_NAME FROM pg_catalog.pg_namespace n, pg_catalog.pg_proc p LEFT JOIN pg_catalog.pg_description d ON (p.oid=d.objoid) LEFT JOIN pg_catalog.pg_class c ON (d.classoid=c.oid AND c.relname='pg_proc') LEFT JOIN pg_catalog.pg_namespace pn ON (c.relnamespace=pn.oid AND pn.nspname='pg_catalog') WHERE p.pronamespace=n.oid AND n.nspname LIKE 'public' AND p.proname LIKE 'ismemballoonenabled' ORDER BY PROCEDURE_SCHEM, PROCEDURE_NAME, p.oid::text 446 select * from getdiskvmelementsforvm($1, $2, $3) 446 select * from getsnapshotbyvmidandtype($1, $2, $3, $4) 450 select * from getsnapshotbyleafguid($1) 451 select * from getdbgeneration($1) 451 select * from getvmbyvmguid($1, $2, $3) 454 select * from getanyadminrolebyuseridandgroupids($1, $2, $3) 457 select * from getdisksvmguid($1, $2, $3, $4) 463 select * from getdiskvmelementbydiskvmelementid($1, $2, $3, $4) 504 select * from getclusterbyclusterid($1, $2, $3) 892 select * from getvmdevicebyvmidtypeanddevice($1, $2, $3, $4, $5) 909 select * from getvmnetworkinterfaceviewbyvmid($1, $2, $3) 1328 select * from getvmdevicebyvmidandtype($1, $2) 2027 select * from checkifsnapshotexistsbyvmidandtype($1, $2) 2155 select * from getusersbyvmguid($1) 2740 select * from public.updatevmdynamic(CAST ($1 AS text),CAST ($2 AS varchar),CAST ($3 AS varchar),CAST ($4 AS uuid),CAST ($5 AS varchar),CAST ($6 AS uuid),CAST ($7 AS uuid),CAST ($8 AS int4),CAST ($9 AS uuid),CAST ($10 AS varchar),CAST ($11 AS varchar),CAST ($12 AS varchar),CAST ($13 AS timestamptz),CAST ($14 AS timestamptz),CAST ($15 AS bool),CAST ($16 AS int4),CAST ($17 AS int4),CAST ($18 AS int4),CAST ($19 AS varchar),CAST ($20 AS int4),CAST ($21 AS int4),CAST ($22 AS int4),CAST ($23 AS varchar),CAST ($24 AS int4),CAST ($25 AS numeric),CAST ($26 AS varchar),CAST ($27 AS bool),CAST ($28 AS varchar),CAST ($29 AS varchar),CAST ($30 AS varchar),CAST ($31 AS varchar),CAST ($32 AS int4),CAST ($33 AS int4),CAST ($34 AS int4),CAST ($35 AS int4),CAST ($36 AS varchar),CAST ($37 AS int4),CAST ($38 AS varchar),CAST ($39 AS int4),CAST ($40 AS int4),CAST ($41 AS varchar),CAST ($42 AS int4),CAST ($43 AS varchar),CAST ($44 AS varchar),CAST ($45 AS varchar),CAST ($46 AS varchar),CAST ($47 AS varchar),CAST ($48 AS text)) as result 4612 select * from getvmnetworkinterfacetomonitorbyvmid($1) 4613 select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result 4613 select * from public.updatevm_interface_statistics(CAST ($1 AS uuid),CAST ($2 AS numeric),CAST ($3 AS numeric),CAST ($4 AS int8),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS int8),CAST ($9 AS int8),CAST ($10 AS int4),CAST ($11 AS float8),CAST ($12 AS uuid)) as result 4794 select * from public.updatevmstatistics(CAST ($1 AS numeric),CAST ($2 AS numeric),CAST ($3 AS numeric),CAST ($4 AS int4),CAST ($5 AS int4),CAST ($6 AS int4),CAST ($7 AS text),CAST ($8 AS uuid),CAST ($9 AS int8),CAST ($10 AS int8)) as result
How many hosts btw? concider increasing the postgres connections and the engine pool accordingly. We seen that cause issues many times (along with DWH hogging the db)
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both
(In reply to Roy Golan from comment #17) > How many hosts btw? concider increasing the postgres connections and the > engine pool accordingly. We seen that cause issues many times (along with > DWH hogging the db) The have on 31 hosts. Tath is not that many. I can also confirm that the DWH does not have impact on the behaviour. It was enabled only for short period of time. The issue happens doe snot matter if DWH is running or not.
(In reply to mlehrer from comment #24) Summary of the facts pointed out by Roman and Mordechai 1. 31 hosts 2. DWH wasn't running 3. happens on 4.2 as well. 4. engine is running on an ESX VM, 4 cores (threaded?) 40GB 5. 3600 VMs, in several vm pools (how many?) We need to find a reproducer using hosted engine, regular settings with no config modification(4cores 16GB, default connection pool on db and engine), disable DWH , and same amount of vms and hosts. According to Roman's finding on comment 16, this is the list of > 1000 queries the engine produces in 1 minute, ~26,000 selects/minute ~450 selects/second 1328 select * from getvmdevicebyvmidandtype 2027 select * from checkifsnapshotexistsbyvmidandtype 2155 select * from getusersbyvmguid 2740 select * from public.updatevmdynamic 4612 select * from getvmnetworkinterfacetomonitorbyvmid 4613 select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id 4613 select * from public.updatevm_interface_statistics 4794 select * from public.updatevmstatistics The highest amount of queries, ~18,000, are UPDATES. I wonder why there is an update for both vm_dynamic and vm_statistics. also image_dynamic has way too much updates (why?)
sync2jira