Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1766815

Summary:	[Tracker] Slowdown after upgrading from 4.1 to 4.2
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-engine	Assignee:	Nobody <nobody>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Michael Burman <mburman>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.2.8-4	CC:	dholler, mburman, michal.skrivanek, mlehrer, mperina, pelauter, rgolan
Target Milestone:	ovirt-4.5.0	Keywords:	Performance, Tracking
Target Release:	4.5.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-20 09:47:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Network	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1754363, 1769463, 1790746, 1811866
Bug Blocks:

Description Germano Veit Michel 2019-10-30 00:39:44 UTC

Description of problem:

After upgrading from RHV 4.1 to 4.2, a customer with big environment reports significant slowdown in the Admin Portal, particularly when checking the Hosts tab.

Upon checking postgres, there are always several queries being executed, most of them related to networking.
In addition to the problem described in BZ1754363, which shows up even in small environments, there seem to be
more problems in 4.2.

Also, on every engine thread dump there are always a few of these executing and waiting on postgres:

        at org.ovirt.engine.core.vdsbroker.NetworkImplementationDetailsUtils.build(NetworkImplementationDetailsUtils.java:103)
        at org.ovirt.engine.core.vdsbroker.NetworkImplementationDetailsUtils.calculateNetworkImplementationDetails(NetworkImplementationDetailsUtils.java:69)
        at org.ovirt.engine.core.bll.network.host.HostNicsUtil.findHostNics(HostNicsUtil.java:93)
        at org.ovirt.engine.core.bll.network.host.HostNicsUtil.findHostNics(HostNicsUtil.java:47)
        at org.ovirt.engine.core.bll.network.host.GetVdsInterfacesByVdsIdQuery.executeQueryCommand(GetVdsInterfacesByVdsIdQuery.java:24)

As times goes by, the function progresses and is doing something different, but its always executing some postgres query. For example, up in the stack:

Sometimes it is doing this query:
        at org.ovirt.engine.core.dao.network.NetworkAttachmentDaoImpl.getNetworkAttachmentByNicIdAndNetworkId(NetworkAttachmentDaoImpl.java:96)   [1]

Or this:
        at org.ovirt.engine.core.dao.VdsDynamicDaoImpl.get(VdsDynamicDaoImpl.java:135) [2]

Or this:
        at org.ovirt.engine.core.dao.ClusterDaoImpl.get(ClusterDaoImpl.java:47) [3]

Or:
        at org.ovirt.engine.core.dao.network.InterfaceDaoImpl.get(InterfaceDaoImpl.java:243)  [5]
        at org.ovirt.engine.core.vdsbroker.CalculateBaseNic.getNicByName(CalculateBaseNic.java:

This repeats several times on every thread dump. There are always a few workers with getNetworkAttachmentByNicIdAndNetworkId down the stack.

The customer complains about slow UI on some tabs (i.e.: hosts tab), and the stacks look like this:
        Some of the SQL above, [1], [2], [3], [4] or [5]
        [...]
        at org.ovirt.engine.core.vdsbroker.NetworkImplementationDetailsUtils.build(NetworkImplementationDetailsUtils.java:88)
        at org.ovirt.engine.core.vdsbroker.NetworkImplementationDetailsUtils.build(NetworkImplementationDetailsUtils.java:103)
        at org.ovirt.engine.core.vdsbroker.NetworkImplementationDetailsUtils.calculateNetworkImplementationDetails(NetworkImplementationDetailsUtils.java:69)
        at org.ovirt.engine.core.bll.network.host.HostNicsUtil.findHostNics(HostNicsUtil.java:93)
        at org.ovirt.engine.core.bll.network.host.HostNicsUtil.findHostNics(HostNicsUtil.java:47)
        at org.ovirt.engine.core.bll.network.host.GetVdsInterfacesByVdsIdQuery.executeQueryCommand(GetVdsInterfacesByVdsIdQuery.java:24)
        [...]
        at org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl.runQuery(GenericApiGWTServiceImpl.java:90)

One interesting thing is that vds_interface_view has 12 to 14k entries depending on the environment. This comes from the size of the environment, they have between  136 and 150 hosts, with roughly 100 networks attached to each host. So I assume this is probably related to the thread dumps above as those functions read this table.

Version-Release number of selected component (if applicable):
4.2.8

How reproducible:
Unknown

Steps to Reproduce:
1. 100+ hosts
2. 100+ networks
3.

Actual results:
Slowdown

Expected results:
Similar performance to 4.1

Additional info:
- No signs of performance contention on RHV-M. Memory usage is around 65%, CPUs are 80% idle, iowait is 0.1%, %steal is 0. Several environments running on multiple platforms (VMWare, KVM), they all present the problem after
- Mostly 16 vCPUs + 16GB or RAM. 
- RHV 4.1 was fine even with 8 vCPUs.
upgrading to 4.2.

Comment 18 Germano Veit Michel 2019-11-28 23:05:20 UTC

I think this explains the slow network related SQLs, look at this....

We did a manual GetVdsInterfacesByNetworkId() which we have seen slowing down VM Start, Migration and more several times...

EXPLAIN ANALYZE SELECT vds_interface_view.* FROM vds_interface_view INNER JOIN vds ON vds.vds_id = vds_interface_view.vds_id INNER JOIN network_cluster ON network_cluster.cluster_id = vds.cluster_id INNER JOIN network ON network.id = network_cluster.network_id AND network.name = vds_interface_view.network_name WHERE network.id = $network_id"

                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1684.64..539466.72 rows=1 width=439) (actual time=279.250..154050.623 rows=138 loops=1)
   Join Filter: (vds_interface.vds_id = vds_static_1.vds_id)
   Rows Removed by Join Filter: 18906
   ->  Hash Join  (cost=62.93..299.53 rows=1 width=64) (actual time=0.381..4.403 rows=138 loops=1)
         Hash Cond: (cluster.cluster_id = network_cluster.cluster_id)
         ->  Hash Left Join  (cost=58.60..294.67 rows=138 width=80) (actual time=0.325..3.851 rows=138 loops=1)
               Hash Cond: (vds_static_1.vds_id = vds_spm_id_map.vds_id)
               ->  Hash Join  (cost=53.49..287.67 rows=138 width=80) (actual time=0.278..3.382 rows=138 loops=1)
                     Hash Cond: (vds_static_1.cluster_id = cluster.cluster_id)
                     ->  Hash Join  (cost=52.33..284.61 rows=138 width=64) (actual time=0.260..2.902 rows=138 loops=1)
                           Hash Cond: (vds_statistics.vds_id = vds_static_1.vds_id)
                           ->  Seq Scan on vds_statistics  (cost=0.00..230.38 rows=138 width=16) (actual time=0.006..1.778 rows=138 loops=1)
                           ->  Hash  (cost=50.61..50.61 rows=138 width=48) (actual time=0.244..0.244 rows=138 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 19kB
                                 ->  Hash Join  (cost=15.38..50.61 rows=138 width=48) (actual time=0.112..0.225 rows=138 loops=1)
                                       Hash Cond: (vds_dynamic.vds_id = vds_static_1.vds_id)
                                       ->  Index Only Scan using pk_vds_dynamic on vds_dynamic  (cost=0.27..32.95 rows=312 width=16) (actual time=0.012..0.095 rows=138 loops=1)
                                             Heap Fetches: 21
                                       ->  Hash  (cost=13.38..13.38 rows=138 width=32) (actual time=0.089..0.089 rows=138 loops=1)
                                             Buckets: 1024  Batches: 1  Memory Usage: 17kB
                                             ->  Seq Scan on vds_static vds_static_1  (cost=0.00..13.38 rows=138 width=32) (actual time=0.007..0.070 rows=138 loops=1)
                     ->  Hash  (cost=1.07..1.07 rows=7 width=32) (actual time=0.005..0.005 rows=7 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Seq Scan on cluster  (cost=0.00..1.07 rows=7 width=32) (actual time=0.002..0.002 rows=7 loops=1)
               ->  Hash  (cost=3.38..3.38 rows=138 width=16) (actual time=0.037..0.037 rows=138 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 15kB
                     ->  Seq Scan on vds_spm_id_map  (cost=0.00..3.38 rows=138 width=16) (actual time=0.006..0.024 rows=138 loops=1)
         ->  Hash  (cost=4.31..4.31 rows=2 width=32) (actual time=0.037..0.037 rows=7 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Index Only Scan using pk_network_cluster on network_cluster  (cost=0.27..4.31 rows=2 width=32) (actual time=0.032..0.033 rows=7 loops=1)
                     Index Cond: (network_id = '8d174c8a-1360-420b-a400-82d7d49ebb71'::uuid)
                     Heap Fetches: 0
   ->  Hash Join  (cost=1621.71..539166.40 rows=63 width=471) (actual time=50.255..1116.246 rows=138 loops=138)
         Hash Cond: ((vds_interface.network_name)::text = (network.name)::text)
         ->  Hash Join  (cost=1613.41..539125.38 rows=8559 width=455) (actual time=39.214..1115.249 rows=12636 loops=138)
               Hash Cond: (vds_interface.vds_id = vds_static.vds_id)
               ->  Hash Join  (cost=1598.31..538992.59 rows=8559 width=392) (actual time=39.210..1112.320 rows=12636 loops=138)
                     Hash Cond: (vds_interface_statistics.id = vds_interface.id)
                     ->  Seq Scan on vds_interface_statistics  (cost=0.00..537276.59 rows=8559 width=72) (actual time=39.133..1106.252 rows=12636 loops=138)
                     ->  Hash  (cost=1440.36..1440.36 rows=12636 width=336) (actual time=9.589..9.589 rows=12636 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 1953kB
                           ->  Seq Scan on vds_interface  (cost=0.00..1440.36 rows=12636 width=336) (actual time=0.035..5.640 rows=12636 loops=1)
               ->  Hash  (cost=13.38..13.38 rows=138 width=63) (actual time=0.053..0.053 rows=138 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 21kB
                     ->  Seq Scan on vds_static  (cost=0.00..13.38 rows=138 width=63) (actual time=0.001..0.034 rows=138 loops=1)
         ->  Hash  (cost=8.29..8.29 rows=1 width=29) (actual time=0.016..0.016 rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Index Scan using pk_network on network  (cost=0.27..8.29 rows=1 width=29) (actual time=0.013..0.013 rows=1 loops=1)
                     Index Cond: (id = '8d174c8a-1360-420b-a400-82d7d49ebb71'::uuid)
 Planning time: 4.336 ms
 Execution time: 154050.945 ms

The problem seems to be related to 'statistics' table:

               ->  Hash Join  (cost=1598.31..538992.59 rows=8559 width=392) (actual time=39.210..1112.320 rows=12636 loops=138)
                     Hash Cond: (vds_interface_statistics.id = vds_interface.id)
                     ->  Seq Scan on vds_interface_statistics  (cost=0.00..537276.59 rows=8559 width=72) (actual time=39.133..1106.252 rows=12636 loops=138)
                     ->  Hash  (cost=1440.36..1440.36 rows=12636 width=336) (actual time=9.589..9.589 rows=12636 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 1953kB

The table looks bloated...

SELECT n_live_tup, n_dead_tup from pg_stat_user_tables where relname = 'vds_interface_statistics'"

 n_live_tup | n_dead_tup
------------+------------
       8559 |    1555781

However, customer was on 4.1 which was working fine before. The problem started in 4.2. Both should have similar vaccum seetings, as both have Roy's patch "packaging: setup: More aggressive autovacuum configuration"

Does this ring any bell? Are there more updates on this table in 4.2 than 4.1?

Comment 19 Germano Veit Michel 2019-11-28 23:29:02 UTC

Customer's vacuum settings look the same on 4.1 and 4.2:

autovacuum_max_workers = 6
autovacuum_vacuum_scale_factor = 0.01
autovacuum_analyze_scale_factor = 0.075

Comment 20 Germano Veit Michel 2019-12-03 00:14:52 UTC

2 tables are heavily bloated, for example:

INFO:  "vds_interface_statistics": scanned 30000 of 571234 pages, containing 626 live rows and 1789412 dead rows; 626 rows in sample, 11920 estimated total rows
INFO:   "vm_interface_statistics": scanned 30000 of 141793 pages, containing 555 live rows and 1581668 dead rows; 555 rows in sample,  2623 estimated total rows

I think these may explain some very slow running SQLs (~3min) that slowdown everything.

I'll try to find out if this started in 4.2, hopefully the customer still has a 4.1 env.

Does this ring any bells in relation to any change in 4.2?

Comment 21 Germano Veit Michel 2019-12-03 00:27:14 UTC

Note that for comment #20, we already have vdsRefreshRate at 13.

Mordechai, I recall you built a similar environment with a high number of interfaces and hosts. If you still have it, can you check if the vds_interface_statistics table is getting bloated as time goes by?

Comment 22 Germano Veit Michel 2019-12-03 05:29:51 UTC

(In reply to Germano Veit Michel from comment #20)
> I'll try to find out if this started in 4.2, hopefully the customer still
> has a 4.1 env.

Big environments were already upgraded to 4.2, so we won't know this.

Comment 23 Roy Golan 2019-12-03 07:21:14 UTC

(In reply to Germano Veit Michel from comment #21)
> Note that for comment #20, we already have vdsRefreshRate at 13.
> 
> Mordechai, I recall you built a similar environment with a high number of
> interfaces and hosts. If you still have it, can you check if the
> vds_interface_statistics table is getting bloated as time goes by?

I expect the stats table to get bloated in time because we are bombarding them with reads and writes, so we 
don't give enough time for the autovaccum to kick in. The autovacuum workers have a limit when they work
on a table, and when that expire they go to sleep and wakes again later. 

Just so we won't loose focus, the root cause is too much reads/writes, as identified by mlehrer which cause
massive needless reads (maybe also writes?). 

To ease the situation a bit I an offer 
a periodic full vaccum using the provided bin/engine-vacuum tool (please have look).

I can think about tuning the stats table only for more aggressive autovacuum, but I don't want to do
that without some testing. More aggressive vacuum takes CPU time. If the machine has enough cores idling 
while the engine is running, we can consider this.

Comment 24 mlehrer 2019-12-03 08:52:11 UTC

(In reply to Germano Veit Michel from comment #21)
> Note that for comment #20, we already have vdsRefreshRate at 13.
> 
> Mordechai, I recall you built a similar environment with a high number of
> interfaces and hosts. If you still have it, can you check if the
> vds_interface_statistics table is getting bloated as time goes by?

#Including vds_interface_view numbers as we had to remove some hosts for 4.4 testing efforts

[root@rhev-green-01 ~]#  /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT count(*) from vds_interface_view"
 count 
-------
 25496
(1 row)




               relname               | n_dead_tup | n_live_tup 
-------------------------------------+------------+------------
 vds_interface_statistics            |       2469 |      25496
 disk_image_dynamic                  |        796 |       3886
 vm_interface_statistics             |        789 |       4275
 vm_statistics                       |        747 |       3764
 vm_guest_agent_interfaces           |        497 |      59080
 vm_dynamic                          |        422 |       3764
 vds_interface                       |        318 |      42027
 network_attachments                 |        176 |      24456
 vm_device                           |        133 |      61103
 snapshots                           |         72 |       3764
 permissions                         |         71 |       6853
 base_disks                          |         70 |       3885
 image_storage_domain_map            |         70 |       3890
 disk_vm_element                     |         67 |       3828
 vm_interface                        |         66 |       4300
 vm_static                           |         58 |       3804
 storage_domain_dynamic              |         57 |         16
 vm_icon_defaults                    |         50 |         50
 numa_node                           |         50 |        364
 numa_node_cpu_map                   |         48 |       1348
 storage_domains_ovf_info            |         43 |         24
 vm_init                             |         42 |         28
 dwh_history_timekeeping             |         38 |         11
 job                                 |         33 |          0
 step                                |         32 |          0
 storage_pool_iso_map                |         32 |         13
 dns_resolver_configuration          |         32 |        338
 vds_statistics                      |         27 |        337
 command_entities                    |         26 |          0
 storage_domain_static               |         24 |         16
 engine_sessions                     |         23 |          0
 external_variable                   |         23 |          1
 name_server                         |         22 |        365
 images                              |         20 |       3888
 storage_pool                        |         19 |         10
 image_transfers                     |         19 |          0
 vds_spm_id_map                      |         17 |        337


#table sizes

        table_name        | table_size 
--------------------------+------------
 audit_log                | 115 MB
 vm_ovf_generations       | 21 MB
 vds_interface            | 14 MB
 vm_device                | 12 MB
 vds_interface_statistics | 11 MB
(5 rows)


INFO:  vacuuming "public.vds_interface_statistics"
INFO:  scanned index "pk_vds_interface_statistics" to remove 1 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "idx_vds_interface_statistics_vds_id" to remove 1 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  "vds_interface_statistics": removed 1 row versions in 1 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "pk_vds_interface_statistics" now contains 25496 row versions in 253 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "idx_vds_interface_statistics_vds_id" now contains 25496 row versions in 553 pages
DETAIL:  1 index row versions were removed.
59 index pages have been deleted, 59 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vds_interface_statistics": found 2987 removable, 21259 nonremovable row versions in 919 out of 1402 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 965030376
There were 64837 unused item pointers.
Skipped 0 pages due to buffer pins, 226 frozen pages.
0 pages are entirely empty.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s.
INFO:  analyzing "public.vds_interface_statistics"
INFO:  "vds_interface_statistics": scanned 1402 of 1402 pages, containing 25496 live rows and 0 dead rows; 25496 rows in sample, 25496 estimated total rows
VACUUM
engine=# 


engine=# SELECT n_live_tup, n_dead_tup from pg_stat_user_tables where relname = 'vm_interface_statistics';
 n_live_tup | n_dead_tup 
------------+------------
       4275 |       1340
(1 row)

engine=# 


engine=# VACUUM ANALYZE VERBOSE vm_interface_statistics ;
INFO:  vacuuming "public.vm_interface_statistics"
INFO:  index "pk_vm_interface_statistics" now contains 4275 row versions in 42 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "idx_vm_interface_statistics_vm_id" now contains 4275 row versions in 53 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vm_interface_statistics": found 1075 removable, 4275 nonremovable row versions in 186 out of 186 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 965031104
There were 11201 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  analyzing "public.vm_interface_statistics"
INFO:  "vm_interface_statistics": scanned 186 of 186 pages, containing 4275 live rows and 0 dead rows; 4275 rows in sample, 4275 estimated total rows
VACUUM
engine=# 


To clean up the dead tables a vacuum full is really necessary as described in previous comment https://bugzilla.redhat.com/show_bug.cgi?id=1766815#c23 using bin/engine-vacuum

Comment 25 Germano Veit Michel 2019-12-04 00:13:41 UTC

Thanks Roy and Mordechai!

The problem is we already performed a full vacuum before (not just once).
The bloat seems to come back very quickly, see how it escalates and then stabilizes when its already very bloated:

$ /usr/share/ovirt-engine/bin/engine-vacuum.sh -f -v
~~~
INFO:  vacuuming "public.vds_interface_statistics"
INFO:  "vds_interface_statistics": found 505839 removable, 12633 nonremovable row versions in 532927 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU 0.63s/0.52u sec elapsed 1.16 sec.
~~~

$ egrep 'vds_interface_statistics' *00.log | grep scanned | awk -F ' ' '{print $1,$2,$9,$10,$13,$14}' | column -t
vacuum-2019-11-29_17:00.log:INFO:  "vds_interface_statistics":  12633  live  361794   dead
vacuum-2019-11-29_18:00.log:INFO:  "vds_interface_statistics":  12633  live  568347   dead
vacuum-2019-11-29_19:00.log:INFO:  "vds_interface_statistics":  12633  live  1188109  dead
vacuum-2019-11-29_20:00.log:INFO:  "vds_interface_statistics":  9892   live  1454671  dead
vacuum-2019-11-29_21:00.log:INFO:  "vds_interface_statistics":  7201   live  1537761  dead
vacuum-2019-11-29_22:00.log:INFO:  "vds_interface_statistics":  6723   live  1585418  dead
vacuum-2019-11-29_23:00.log:INFO:  "vds_interface_statistics":  5455   live  1621861  dead
vacuum-2019-11-30_00:00.log:INFO:  "vds_interface_statistics":  4315   live  1645450  dead
vacuum-2019-11-30_01:00.log:INFO:  "vds_interface_statistics":  4848   live  1657870  dead
vacuum-2019-11-30_02:00.log:INFO:  "vds_interface_statistics":  4154   live  1674611  dead
vacuum-2019-11-30_03:00.log:INFO:  "vds_interface_statistics":  3337   live  1679935  dead
vacuum-2019-11-30_04:00.log:INFO:  "vds_interface_statistics":  2536   live  1699765  dead
vacuum-2019-11-30_05:00.log:INFO:  "vds_interface_statistics":  3206   live  1708461  dead
vacuum-2019-11-30_06:00.log:INFO:  "vds_interface_statistics":  2656   live  1717015  dead
vacuum-2019-11-30_07:00.log:INFO:  "vds_interface_statistics":  2881   live  1720743  dead
vacuum-2019-11-30_08:00.log:INFO:  "vds_interface_statistics":  2208   live  1728987  dead
vacuum-2019-11-30_09:00.log:INFO:  "vds_interface_statistics":  2315   live  1734696  dead
vacuum-2019-11-30_10:00.log:INFO:  "vds_interface_statistics":  2259   live  1742178  dead
vacuum-2019-11-30_11:00.log:INFO:  "vds_interface_statistics":  2165   live  1745003  dead
vacuum-2019-11-30_12:00.log:INFO:  "vds_interface_statistics":  1957   live  1751855  dead
vacuum-2019-11-30_13:00.log:INFO:  "vds_interface_statistics":  1745   live  1754312  dead
vacuum-2019-11-30_14:00.log:INFO:  "vds_interface_statistics":  2015   live  1759946  dead
vacuum-2019-11-30_15:00.log:INFO:  "vds_interface_statistics":  1917   live  1759559  dead
vacuum-2019-11-30_16:00.log:INFO:  "vds_interface_statistics":  1298   live  1764150  dead
vacuum-2019-11-30_17:00.log:INFO:  "vds_interface_statistics":  1259   live  1766954  dead
vacuum-2019-11-30_18:00.log:INFO:  "vds_interface_statistics":  1628   live  1771477  dead
vacuum-2019-11-30_19:00.log:INFO:  "vds_interface_statistics":  1353   live  1771258  dead
vacuum-2019-11-30_20:00.log:INFO:  "vds_interface_statistics":  1838   live  1770989  dead
vacuum-2019-11-30_21:00.log:INFO:  "vds_interface_statistics":  1354   live  1772730  dead
vacuum-2019-11-30_22:00.log:INFO:  "vds_interface_statistics":  888    live  1775524  dead
vacuum-2019-11-30_23:00.log:INFO:  "vds_interface_statistics":  1391   live  1778192  dead
vacuum-2019-12-01_00:00.log:INFO:  "vds_interface_statistics":  851    live  1772568  dead
vacuum-2019-12-01_01:00.log:INFO:  "vds_interface_statistics":  1208   live  1757525  dead
vacuum-2019-12-01_02:00.log:INFO:  "vds_interface_statistics":  1667   live  1761451  dead
vacuum-2019-12-01_03:00.log:INFO:  "vds_interface_statistics":  1133   live  1758314  dead
vacuum-2019-12-01_04:00.log:INFO:  "vds_interface_statistics":  880    live  1762851  dead
vacuum-2019-12-01_05:00.log:INFO:  "vds_interface_statistics":  1057   live  1764992  dead
vacuum-2019-12-01_06:00.log:INFO:  "vds_interface_statistics":  560    live  1768721  dead
vacuum-2019-12-01_07:00.log:INFO:  "vds_interface_statistics":  717    live  1768880  dead
vacuum-2019-12-01_08:00.log:INFO:  "vds_interface_statistics":  1174   live  1769432  dead
vacuum-2019-12-01_09:00.log:INFO:  "vds_interface_statistics":  917    live  1771564  dead
vacuum-2019-12-01_10:00.log:INFO:  "vds_interface_statistics":  1024   live  1771444  dead
vacuum-2019-12-01_11:00.log:INFO:  "vds_interface_statistics":  1200   live  1771819  dead
vacuum-2019-12-01_12:00.log:INFO:  "vds_interface_statistics":  814    live  1774972  dead
vacuum-2019-12-01_13:00.log:INFO:  "vds_interface_statistics":  930    live  1775267  dead
vacuum-2019-12-01_14:00.log:INFO:  "vds_interface_statistics":  632    live  1778116  dead
vacuum-2019-12-01_15:00.log:INFO:  "vds_interface_statistics":  675    live  1777556  dead
vacuum-2019-12-01_16:00.log:INFO:  "vds_interface_statistics":  563    live  1778871  dead
vacuum-2019-12-01_17:00.log:INFO:  "vds_interface_statistics":  920    live  1780616  dead
vacuum-2019-12-01_18:00.log:INFO:  "vds_interface_statistics":  797    live  1780982  dead
vacuum-2019-12-01_19:00.log:INFO:  "vds_interface_statistics":  625    live  1783784  dead
vacuum-2019-12-01_20:00.log:INFO:  "vds_interface_statistics":  329    live  1785584  dead
vacuum-2019-12-01_21:00.log:INFO:  "vds_interface_statistics":  647    live  1784801  dead
vacuum-2019-12-01_22:00.log:INFO:  "vds_interface_statistics":  1015   live  1782021  dead
vacuum-2019-12-01_23:00.log:INFO:  "vds_interface_statistics":  338    live  1786441  dead
vacuum-2019-12-02_00:00.log:INFO:  "vds_interface_statistics":  526    live  1785436  dead
vacuum-2019-12-02_01:00.log:INFO:  "vds_interface_statistics":  601    live  1763758  dead
vacuum-2019-12-02_02:00.log:INFO:  "vds_interface_statistics":  448    live  1767185  dead
vacuum-2019-12-02_03:00.log:INFO:  "vds_interface_statistics":  539    live  1767878  dead
vacuum-2019-12-02_04:00.log:INFO:  "vds_interface_statistics":  417    live  1767778  dead
vacuum-2019-12-02_05:00.log:INFO:  "vds_interface_statistics":  1125   live  1768370  dead
vacuum-2019-12-02_06:00.log:INFO:  "vds_interface_statistics":  504    live  1765809  dead
vacuum-2019-12-02_07:00.log:INFO:  "vds_interface_statistics":  735    live  1765032  dead
vacuum-2019-12-02_08:00.log:INFO:  "vds_interface_statistics":  660    live  1766448  dead
vacuum-2019-12-02_09:00.log:INFO:  "vds_interface_statistics":  322    live  1767497  dead
vacuum-2019-12-02_10:00.log:INFO:  "vds_interface_statistics":  655    live  1764297  dead
vacuum-2019-12-02_11:00.log:INFO:  "vds_interface_statistics":  881    live  1769607  dead
vacuum-2019-12-02_12:00.log:INFO:  "vds_interface_statistics":  524    live  1767738  dead
vacuum-2019-12-02_13:00.log:INFO:  "vds_interface_statistics":  383    live  1767235  dead
vacuum-2019-12-02_14:00.log:INFO:  "vds_interface_statistics":  1156   live  1772126  dead
vacuum-2019-12-02_15:00.log:INFO:  "vds_interface_statistics":  611    live  1770586  dead
vacuum-2019-12-02_16:00.log:INFO:  "vds_interface_statistics":  677    live  1771297  dead

(In reply to Roy Golan from comment #23)
> I can think about tuning the stats table only for more aggressive
> autovacuum, but I don't want to do
> that without some testing. More aggressive vacuum takes CPU time. If the
> machine has enough cores idling 
> while the engine is running, we can consider this.

I agree, but I'm afraid to tune this by myself and make it worse, as I've seen these parameters were chosen very carefully and with a lot of testing in your patch that went into 4.1.
The machine has plenty of available resources, I'm more afraid of contention due to vacuum than CPU usage.

Comment 26 Germano Veit Michel 2019-12-04 00:22:32 UTC

Few more from before the last exact hour, to complement comment #25.

vacuum-2019-11-29_16:30.log:INFO:  "vds_interface_statistics":  12633  live  47737   dead
vacuum-2019-11-29_16:35.log:INFO:  "vds_interface_statistics":  12633  live  103014  dead
vacuum-2019-11-29_16:40.log:INFO:  "vds_interface_statistics":  12633  live  160244  dead
vacuum-2019-11-29_16:45.log:INFO:  "vds_interface_statistics":  12633  live  206028  dead
vacuum-2019-11-29_16:50.log:INFO:  "vds_interface_statistics":  12633  live  258780  dead
vacuum-2019-11-29_16:55.log:INFO:  "vds_interface_statistics":  12633  live  316010  dead
...

Comment 27 Germano Veit Michel 2019-12-04 00:28:43 UTC

Despite the removed hosts, Mordechai's lab dead rows do not seem to grow even close to the customers, and the customer has VdsRefreshRate set to 13.
Customer's dead rows are growing at about 166 per second just in the 30k pages in the scanning sample. Probably about 3k per second in total.

There seem to be a difference between customer's and scale labs. Could the API calls be responsible?

Comment 29 Germano Veit Michel 2019-12-09 00:02:42 UTC

Hi Roy,

Would really appreciate your thoughts here to provide a workaround to the customer.

Given the ugly numbers we see (comment #25 and comment #26), I think autovacuum_*_scale_factor and autovacuum_*_threshold are not to blame, it should always evaluate to the table needing an analyze/vacuum.
Do you think raising vacuum_cost_limit from the default 200 to something like 1000 or 2000 would help?

It is perhaps doing the autovacuum regularly as intended, but maybe stopping too soon so the dead tuples accumulate. Do you agree?

Comment 30 Roy Golan 2019-12-09 13:33:15 UTC

As a workaround and given that you say we have CPUs available, we can definitely consider this:

1. increase the workers which are reserved for autovacuum

  autovacuum_max_workers=8 (increased by 2)


2. tune the vacuum gauges of vds_statistics table to more aggresively vacuum
  autovacuum_vacuum_cost_limit=?   (now that we have more workers that eat the share limit it makes sense to increase this)
  
  I don't have a number of hand. Increase by x2 and examine the behaviour of overall cpus and dead tuple count. 

3. lower nap time
  autovacuum_naptime=30  (30 secs)


Pass criteria:
- engine remain serviceable, without getting sluggish on regular actions (login, monitoring...)
- dead tuple count does not increase in time


Mordechai, Domink also FYI

Comment 31 Germano Veit Michel 2020-01-03 01:43:12 UTC

Just a quick update.

The table bloat issue seems solved and actions like VM Start are now OK, but some other actions are still slow (i.e. VM Migration Dialogue, Hosts Tab).
I've requested more data to investigate those.

Comment 32 Roy Golan 2020-01-05 07:36:37 UTC

(In reply to Germano Veit Michel from comment #31)
> Just a quick update.
> 
> The table bloat issue seems solved and actions like VM Start are now OK, but
> some other actions are still slow (i.e. VM Migration Dialogue, Hosts Tab).
> I've requested more data to investigate those.

I guess we can attribute those to slow queries execution on the cluster view, and its related views (network, host nics and so on),
which is the main issue of that bug. Waiting for the data to validate this. 
BTW a simple api query to clusters should reveal this - if you can perform this just as a small test it will be an indicator:

```
time curl -k -u user@internal:pass https://enginefqdn/ovirt-engine/api/clusters
```


Germano I think I may not be updated, is there any code change made on this environment?

Comment 33 Germano Veit Michel 2020-01-06 03:44:49 UTC

(In reply to Roy Golan from comment #32)
> (In reply to Germano Veit Michel from comment #31)
> > Just a quick update.
> > 
> > The table bloat issue seems solved and actions like VM Start are now OK, but
> > some other actions are still slow (i.e. VM Migration Dialogue, Hosts Tab).
> > I've requested more data to investigate those.
> 
> I guess we can attribute those to slow queries execution on the cluster
> view, and its related views (network, host nics and so on),
> which is the main issue of that bug. Waiting for the data to validate this. 
> BTW a simple api query to clusters should reveal this - if you can perform
> this just as a small test it will be an indicator:
> 
> ```
> time curl -k -u user@internal:pass
> https://enginefqdn/ovirt-engine/api/clusters
> ```

It might be it, but from the data we had I couldn't see that as the root cause.
I've requested more extensive tests (including HAR from firefox).

I'll update once we receive the data.

> Germano I think I may not be updated, is there any code change made on this
> environment?

Sorry but I'm a bit confused. Do you want confirmation that the customer has not modified the code and rebuilt ovirt-engine?

Comment 34 Germano Veit Michel 2020-01-08 01:39:10 UTC

Going back to this, I've got more data now...

The customer is now complaining mainly of 3 problems.
1) Hosts tab slow to load
2) Manage Storage Domain slow to load
3) Migrate VM dialog slow to load

(In reply to Roy Golan from comment #32)
> I guess we can attribute those to slow queries execution on the cluster
> view, and its related views (network, host nics and so on),
> which is the main issue of that bug. Waiting for the data to validate this. 
> BTW a simple api query to clusters should reveal this - if you can perform
> this just as a small test it will be an indicator:
> 
> ```
> time curl -k -u user@internal:pass
> https://enginefqdn/ovirt-engine/api/clusters
> ```

This would be problem 1. But I'm not sure its related to the already found problems, see this from customer's logs, cluster_view being quite fast:
2020-01-06 10:00:05,456+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery] (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Executing generic query: SELECT * FROM ((SELECT  cluster_view.* FROM  cluster_view  )  ORDER BY name ASC) as T1 OFFSET (1 -1) LIMIT 2147483647
2020-01-06 10:00:05,461+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery] (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Query SearchQuery took 5 ms

When browsing the hosts tab, the customer complains it still takes a long time to load. It seems the admin portal will request a GetVdsInterfacesByVdsId for each of the 100 hosts being shown per page, this part takes 15s to return: 

2020-01-06 11:23:41,217+09 DEBUG [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Server: RunMultipleQuery invoked! [amount of queries: 100]
2020-01-06 11:23:41,217+09 DEBUG [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query type 'GetVdsInterfacesByVdsId', Parameters 'IdQueryParameters:{refresh='false', filtered='false'}'
...
2020-01-06 11:23:58,076+09 DEBUG [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default task-42) [] Server: RunMultipleQuery result [amount of queries: 100]

This apparently can sometimes take up to 0.25s each (note the size of the env)
2020-01-06 11:23:57,745+09 DEBUG [org.ovirt.engine.core.bll.network.host.GetVdsInterfacesByVdsIdQuery] (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query GetVdsInterfacesByVdsIdQuery took 248 ms

Regarding problem 2 I found some slower GetDeviceList commands, and apparently the engine does 4 of these sequentially to load the Manage Domain tab, but it doesn't explain all the delay.

I'm struggling to explain this slowness and would appreciate some help to indentify if we already found all the problems or there is more.

I'm attaching engine debug logs and timestamps of the problems:
1) Hosts tab slow to load             - Jan  6 11:22:42 to Jan  6 11:23:56
2) Manage Storage Domain slow to load - Jan  6 11:04:48 to Jan  6 11:10:29
3) Migrate VM dialog slow to load     - Jan  6 10:47:33 to Jan  6 10:53:04

Can someone please take a look if there are indications of additional problems we haven't found yet? I think we may not have uncovered everything yet.

Comment 38 Roy Golan 2020-01-08 09:51:27 UTC

(In reply to Germano Veit Michel from comment #34)
> Going back to this, I've got more data now...
> 
> The customer is now complaining mainly of 3 problems.
> 1) Hosts tab slow to load
> 2) Manage Storage Domain slow to load
> 3) Migrate VM dialog slow to load
> 
> (In reply to Roy Golan from comment #32)
> > I guess we can attribute those to slow queries execution on the cluster
> > view, and its related views (network, host nics and so on),
> > which is the main issue of that bug. Waiting for the data to validate this. 
> > BTW a simple api query to clusters should reveal this - if you can perform
> > this just as a small test it will be an indicator:
> > 
> > ```
> > time curl -k -u user@internal:pass
> > https://enginefqdn/ovirt-engine/api/clusters
> > ```
> 
> This would be problem 1. But I'm not sure its related to the already found
> problems, see this from customer's logs, cluster_view being quite fast:
> 2020-01-06 10:00:05,456+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery]
> (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Executing generic
> query: SELECT * FROM ((SELECT  cluster_view.* FROM  cluster_view  )  ORDER
> BY name ASC) as T1 OFFSET (1 -1) LIMIT 2147483647
> 2020-01-06 10:00:05,461+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery]
> (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Query SearchQuery
> took 5 ms
> 
> When browsing the hosts tab, the customer complains it still takes a long
> time to load. It seems the admin portal will request a
> GetVdsInterfacesByVdsId for each of the 100 hosts being shown per page, this
> part takes 15s to return: 
> 
> 2020-01-06 11:23:41,217+09 DEBUG
> [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default
> task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Server: RunMultipleQuery
> invoked! [amount of queries: 100]
> 2020-01-06 11:23:41,217+09 DEBUG
> [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default
> task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query type
> 'GetVdsInterfacesByVdsId', Parameters 'IdQueryParameters:{refresh='false',
> filtered='false'}'
> ...
> 2020-01-06 11:23:58,076+09 DEBUG
> [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default
> task-42) [] Server: RunMultipleQuery result [amount of queries: 100]

100 queries instead of a bulk select. We probabaly have tons of these.
The UI need to stop doing that.



> 
> This apparently can sometimes take up to 0.25s each (note the size of the
> env)
> 2020-01-06 11:23:57,745+09 DEBUG
> [org.ovirt.engine.core.bll.network.host.GetVdsInterfacesByVdsIdQuery]
> (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query
> GetVdsInterfacesByVdsIdQuery took 248 ms
> 
> Regarding problem 2 I found some slower GetDeviceList commands, and
> apparently the engine does 4 of these sequentially to load the Manage Domain
> tab, but it doesn't explain all the delay.
> 
> I'm struggling to explain this slowness and would appreciate some help to
> indentify if we already found all the problems or there is more.
> 
> I'm attaching engine debug logs and timestamps of the problems:
> 1) Hosts tab slow to load             - Jan  6 11:22:42 to Jan  6 11:23:56
> 2) Manage Storage Domain slow to load - Jan  6 11:04:48 to Jan  6 11:10:29
> 3) Migrate VM dialog slow to load     - Jan  6 10:47:33 to Jan  6 10:53:04
> 
> Can someone please take a look if there are indications of additional
> problems we haven't found yet? I think we may not have uncovered everything
> yet.

Comment 39 Germano Veit Michel 2020-01-08 22:52:15 UTC

(In reply to Roy Golan from comment #38)
> 
> 100 queries instead of a bulk select. We probabaly have tons of these.
> The UI need to stop doing that.
> 
Should I open a separate bug to track it?

If you had time to check the logs, did you find anything else that might be contributing?

Comment 40 Germano Veit Michel 2020-01-14 05:48:45 UTC

(In reply to Germano Veit Michel from comment #39)
> (In reply to Roy Golan from comment #38)
> > 
> > 100 queries instead of a bulk select. We probabaly have tons of these.
> > The UI need to stop doing that.
> > 
> Should I open a separate bug to track it?

Opened BZ1790746 to track webadmin queries on hosts tab.

Comment 45 Martin Perina 2021-05-20 09:47:55 UTC

We have finished several fixes, which improved scale performance enough, so closing this tracker. If you find any other important scale issue, please open a new concrete bug about.