Bug 1766815
| Summary: | [Tracker] Slowdown after upgrading from 4.1 to 4.2 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | ovirt-engine | Assignee: | Nobody <nobody> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Michael Burman <mburman> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.8-4 | CC: | dholler, mburman, michal.skrivanek, mlehrer, mperina, pelauter, rgolan |
| Target Milestone: | ovirt-4.5.0 | Keywords: | Performance, Tracking |
| Target Release: | 4.5.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-20 09:47:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1754363, 1769463, 1790746, 1811866 | ||
| Bug Blocks: | |||
|
Description
Germano Veit Michel
2019-10-30 00:39:44 UTC
I think this explains the slow network related SQLs, look at this....
We did a manual GetVdsInterfacesByNetworkId() which we have seen slowing down VM Start, Migration and more several times...
EXPLAIN ANALYZE SELECT vds_interface_view.* FROM vds_interface_view INNER JOIN vds ON vds.vds_id = vds_interface_view.vds_id INNER JOIN network_cluster ON network_cluster.cluster_id = vds.cluster_id INNER JOIN network ON network.id = network_cluster.network_id AND network.name = vds_interface_view.network_name WHERE network.id = $network_id"
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1684.64..539466.72 rows=1 width=439) (actual time=279.250..154050.623 rows=138 loops=1)
Join Filter: (vds_interface.vds_id = vds_static_1.vds_id)
Rows Removed by Join Filter: 18906
-> Hash Join (cost=62.93..299.53 rows=1 width=64) (actual time=0.381..4.403 rows=138 loops=1)
Hash Cond: (cluster.cluster_id = network_cluster.cluster_id)
-> Hash Left Join (cost=58.60..294.67 rows=138 width=80) (actual time=0.325..3.851 rows=138 loops=1)
Hash Cond: (vds_static_1.vds_id = vds_spm_id_map.vds_id)
-> Hash Join (cost=53.49..287.67 rows=138 width=80) (actual time=0.278..3.382 rows=138 loops=1)
Hash Cond: (vds_static_1.cluster_id = cluster.cluster_id)
-> Hash Join (cost=52.33..284.61 rows=138 width=64) (actual time=0.260..2.902 rows=138 loops=1)
Hash Cond: (vds_statistics.vds_id = vds_static_1.vds_id)
-> Seq Scan on vds_statistics (cost=0.00..230.38 rows=138 width=16) (actual time=0.006..1.778 rows=138 loops=1)
-> Hash (cost=50.61..50.61 rows=138 width=48) (actual time=0.244..0.244 rows=138 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 19kB
-> Hash Join (cost=15.38..50.61 rows=138 width=48) (actual time=0.112..0.225 rows=138 loops=1)
Hash Cond: (vds_dynamic.vds_id = vds_static_1.vds_id)
-> Index Only Scan using pk_vds_dynamic on vds_dynamic (cost=0.27..32.95 rows=312 width=16) (actual time=0.012..0.095 rows=138 loops=1)
Heap Fetches: 21
-> Hash (cost=13.38..13.38 rows=138 width=32) (actual time=0.089..0.089 rows=138 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 17kB
-> Seq Scan on vds_static vds_static_1 (cost=0.00..13.38 rows=138 width=32) (actual time=0.007..0.070 rows=138 loops=1)
-> Hash (cost=1.07..1.07 rows=7 width=32) (actual time=0.005..0.005 rows=7 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on cluster (cost=0.00..1.07 rows=7 width=32) (actual time=0.002..0.002 rows=7 loops=1)
-> Hash (cost=3.38..3.38 rows=138 width=16) (actual time=0.037..0.037 rows=138 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 15kB
-> Seq Scan on vds_spm_id_map (cost=0.00..3.38 rows=138 width=16) (actual time=0.006..0.024 rows=138 loops=1)
-> Hash (cost=4.31..4.31 rows=2 width=32) (actual time=0.037..0.037 rows=7 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Only Scan using pk_network_cluster on network_cluster (cost=0.27..4.31 rows=2 width=32) (actual time=0.032..0.033 rows=7 loops=1)
Index Cond: (network_id = '8d174c8a-1360-420b-a400-82d7d49ebb71'::uuid)
Heap Fetches: 0
-> Hash Join (cost=1621.71..539166.40 rows=63 width=471) (actual time=50.255..1116.246 rows=138 loops=138)
Hash Cond: ((vds_interface.network_name)::text = (network.name)::text)
-> Hash Join (cost=1613.41..539125.38 rows=8559 width=455) (actual time=39.214..1115.249 rows=12636 loops=138)
Hash Cond: (vds_interface.vds_id = vds_static.vds_id)
-> Hash Join (cost=1598.31..538992.59 rows=8559 width=392) (actual time=39.210..1112.320 rows=12636 loops=138)
Hash Cond: (vds_interface_statistics.id = vds_interface.id)
-> Seq Scan on vds_interface_statistics (cost=0.00..537276.59 rows=8559 width=72) (actual time=39.133..1106.252 rows=12636 loops=138)
-> Hash (cost=1440.36..1440.36 rows=12636 width=336) (actual time=9.589..9.589 rows=12636 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1953kB
-> Seq Scan on vds_interface (cost=0.00..1440.36 rows=12636 width=336) (actual time=0.035..5.640 rows=12636 loops=1)
-> Hash (cost=13.38..13.38 rows=138 width=63) (actual time=0.053..0.053 rows=138 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 21kB
-> Seq Scan on vds_static (cost=0.00..13.38 rows=138 width=63) (actual time=0.001..0.034 rows=138 loops=1)
-> Hash (cost=8.29..8.29 rows=1 width=29) (actual time=0.016..0.016 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using pk_network on network (cost=0.27..8.29 rows=1 width=29) (actual time=0.013..0.013 rows=1 loops=1)
Index Cond: (id = '8d174c8a-1360-420b-a400-82d7d49ebb71'::uuid)
Planning time: 4.336 ms
Execution time: 154050.945 ms
The problem seems to be related to 'statistics' table:
-> Hash Join (cost=1598.31..538992.59 rows=8559 width=392) (actual time=39.210..1112.320 rows=12636 loops=138)
Hash Cond: (vds_interface_statistics.id = vds_interface.id)
-> Seq Scan on vds_interface_statistics (cost=0.00..537276.59 rows=8559 width=72) (actual time=39.133..1106.252 rows=12636 loops=138)
-> Hash (cost=1440.36..1440.36 rows=12636 width=336) (actual time=9.589..9.589 rows=12636 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1953kB
The table looks bloated...
SELECT n_live_tup, n_dead_tup from pg_stat_user_tables where relname = 'vds_interface_statistics'"
n_live_tup | n_dead_tup
------------+------------
8559 | 1555781
However, customer was on 4.1 which was working fine before. The problem started in 4.2. Both should have similar vaccum seetings, as both have Roy's patch "packaging: setup: More aggressive autovacuum configuration"
Does this ring any bell? Are there more updates on this table in 4.2 than 4.1?
Customer's vacuum settings look the same on 4.1 and 4.2: autovacuum_max_workers = 6 autovacuum_vacuum_scale_factor = 0.01 autovacuum_analyze_scale_factor = 0.075 2 tables are heavily bloated, for example: INFO: "vds_interface_statistics": scanned 30000 of 571234 pages, containing 626 live rows and 1789412 dead rows; 626 rows in sample, 11920 estimated total rows INFO: "vm_interface_statistics": scanned 30000 of 141793 pages, containing 555 live rows and 1581668 dead rows; 555 rows in sample, 2623 estimated total rows I think these may explain some very slow running SQLs (~3min) that slowdown everything. I'll try to find out if this started in 4.2, hopefully the customer still has a 4.1 env. Does this ring any bells in relation to any change in 4.2? Note that for comment #20, we already have vdsRefreshRate at 13. Mordechai, I recall you built a similar environment with a high number of interfaces and hosts. If you still have it, can you check if the vds_interface_statistics table is getting bloated as time goes by? (In reply to Germano Veit Michel from comment #20) > I'll try to find out if this started in 4.2, hopefully the customer still > has a 4.1 env. Big environments were already upgraded to 4.2, so we won't know this. (In reply to Germano Veit Michel from comment #21) > Note that for comment #20, we already have vdsRefreshRate at 13. > > Mordechai, I recall you built a similar environment with a high number of > interfaces and hosts. If you still have it, can you check if the > vds_interface_statistics table is getting bloated as time goes by? I expect the stats table to get bloated in time because we are bombarding them with reads and writes, so we don't give enough time for the autovaccum to kick in. The autovacuum workers have a limit when they work on a table, and when that expire they go to sleep and wakes again later. Just so we won't loose focus, the root cause is too much reads/writes, as identified by mlehrer which cause massive needless reads (maybe also writes?). To ease the situation a bit I an offer a periodic full vaccum using the provided bin/engine-vacuum tool (please have look). I can think about tuning the stats table only for more aggressive autovacuum, but I don't want to do that without some testing. More aggressive vacuum takes CPU time. If the machine has enough cores idling while the engine is running, we can consider this. (In reply to Germano Veit Michel from comment #21) > Note that for comment #20, we already have vdsRefreshRate at 13. > > Mordechai, I recall you built a similar environment with a high number of > interfaces and hosts. If you still have it, can you check if the > vds_interface_statistics table is getting bloated as time goes by? #Including vds_interface_view numbers as we had to remove some hosts for 4.4 testing efforts [root@rhev-green-01 ~]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT count(*) from vds_interface_view" count ------- 25496 (1 row) relname | n_dead_tup | n_live_tup -------------------------------------+------------+------------ vds_interface_statistics | 2469 | 25496 disk_image_dynamic | 796 | 3886 vm_interface_statistics | 789 | 4275 vm_statistics | 747 | 3764 vm_guest_agent_interfaces | 497 | 59080 vm_dynamic | 422 | 3764 vds_interface | 318 | 42027 network_attachments | 176 | 24456 vm_device | 133 | 61103 snapshots | 72 | 3764 permissions | 71 | 6853 base_disks | 70 | 3885 image_storage_domain_map | 70 | 3890 disk_vm_element | 67 | 3828 vm_interface | 66 | 4300 vm_static | 58 | 3804 storage_domain_dynamic | 57 | 16 vm_icon_defaults | 50 | 50 numa_node | 50 | 364 numa_node_cpu_map | 48 | 1348 storage_domains_ovf_info | 43 | 24 vm_init | 42 | 28 dwh_history_timekeeping | 38 | 11 job | 33 | 0 step | 32 | 0 storage_pool_iso_map | 32 | 13 dns_resolver_configuration | 32 | 338 vds_statistics | 27 | 337 command_entities | 26 | 0 storage_domain_static | 24 | 16 engine_sessions | 23 | 0 external_variable | 23 | 1 name_server | 22 | 365 images | 20 | 3888 storage_pool | 19 | 10 image_transfers | 19 | 0 vds_spm_id_map | 17 | 337 #table sizes table_name | table_size --------------------------+------------ audit_log | 115 MB vm_ovf_generations | 21 MB vds_interface | 14 MB vm_device | 12 MB vds_interface_statistics | 11 MB (5 rows) INFO: vacuuming "public.vds_interface_statistics" INFO: scanned index "pk_vds_interface_statistics" to remove 1 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "idx_vds_interface_statistics_vds_id" to remove 1 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: "vds_interface_statistics": removed 1 row versions in 1 pages DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: index "pk_vds_interface_statistics" now contains 25496 row versions in 253 pages DETAIL: 1 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "idx_vds_interface_statistics_vds_id" now contains 25496 row versions in 553 pages DETAIL: 1 index row versions were removed. 59 index pages have been deleted, 59 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "vds_interface_statistics": found 2987 removable, 21259 nonremovable row versions in 919 out of 1402 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 965030376 There were 64837 unused item pointers. Skipped 0 pages due to buffer pins, 226 frozen pages. 0 pages are entirely empty. CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s. INFO: analyzing "public.vds_interface_statistics" INFO: "vds_interface_statistics": scanned 1402 of 1402 pages, containing 25496 live rows and 0 dead rows; 25496 rows in sample, 25496 estimated total rows VACUUM engine=# engine=# SELECT n_live_tup, n_dead_tup from pg_stat_user_tables where relname = 'vm_interface_statistics'; n_live_tup | n_dead_tup ------------+------------ 4275 | 1340 (1 row) engine=# engine=# VACUUM ANALYZE VERBOSE vm_interface_statistics ; INFO: vacuuming "public.vm_interface_statistics" INFO: index "pk_vm_interface_statistics" now contains 4275 row versions in 42 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "idx_vm_interface_statistics_vm_id" now contains 4275 row versions in 53 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "vm_interface_statistics": found 1075 removable, 4275 nonremovable row versions in 186 out of 186 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 965031104 There were 11201 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: analyzing "public.vm_interface_statistics" INFO: "vm_interface_statistics": scanned 186 of 186 pages, containing 4275 live rows and 0 dead rows; 4275 rows in sample, 4275 estimated total rows VACUUM engine=# To clean up the dead tables a vacuum full is really necessary as described in previous comment https://bugzilla.redhat.com/show_bug.cgi?id=1766815#c23 using bin/engine-vacuum Thanks Roy and Mordechai!
The problem is we already performed a full vacuum before (not just once).
The bloat seems to come back very quickly, see how it escalates and then stabilizes when its already very bloated:
$ /usr/share/ovirt-engine/bin/engine-vacuum.sh -f -v
~~~
INFO: vacuuming "public.vds_interface_statistics"
INFO: "vds_interface_statistics": found 505839 removable, 12633 nonremovable row versions in 532927 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU 0.63s/0.52u sec elapsed 1.16 sec.
~~~
$ egrep 'vds_interface_statistics' *00.log | grep scanned | awk -F ' ' '{print $1,$2,$9,$10,$13,$14}' | column -t
vacuum-2019-11-29_17:00.log:INFO: "vds_interface_statistics": 12633 live 361794 dead
vacuum-2019-11-29_18:00.log:INFO: "vds_interface_statistics": 12633 live 568347 dead
vacuum-2019-11-29_19:00.log:INFO: "vds_interface_statistics": 12633 live 1188109 dead
vacuum-2019-11-29_20:00.log:INFO: "vds_interface_statistics": 9892 live 1454671 dead
vacuum-2019-11-29_21:00.log:INFO: "vds_interface_statistics": 7201 live 1537761 dead
vacuum-2019-11-29_22:00.log:INFO: "vds_interface_statistics": 6723 live 1585418 dead
vacuum-2019-11-29_23:00.log:INFO: "vds_interface_statistics": 5455 live 1621861 dead
vacuum-2019-11-30_00:00.log:INFO: "vds_interface_statistics": 4315 live 1645450 dead
vacuum-2019-11-30_01:00.log:INFO: "vds_interface_statistics": 4848 live 1657870 dead
vacuum-2019-11-30_02:00.log:INFO: "vds_interface_statistics": 4154 live 1674611 dead
vacuum-2019-11-30_03:00.log:INFO: "vds_interface_statistics": 3337 live 1679935 dead
vacuum-2019-11-30_04:00.log:INFO: "vds_interface_statistics": 2536 live 1699765 dead
vacuum-2019-11-30_05:00.log:INFO: "vds_interface_statistics": 3206 live 1708461 dead
vacuum-2019-11-30_06:00.log:INFO: "vds_interface_statistics": 2656 live 1717015 dead
vacuum-2019-11-30_07:00.log:INFO: "vds_interface_statistics": 2881 live 1720743 dead
vacuum-2019-11-30_08:00.log:INFO: "vds_interface_statistics": 2208 live 1728987 dead
vacuum-2019-11-30_09:00.log:INFO: "vds_interface_statistics": 2315 live 1734696 dead
vacuum-2019-11-30_10:00.log:INFO: "vds_interface_statistics": 2259 live 1742178 dead
vacuum-2019-11-30_11:00.log:INFO: "vds_interface_statistics": 2165 live 1745003 dead
vacuum-2019-11-30_12:00.log:INFO: "vds_interface_statistics": 1957 live 1751855 dead
vacuum-2019-11-30_13:00.log:INFO: "vds_interface_statistics": 1745 live 1754312 dead
vacuum-2019-11-30_14:00.log:INFO: "vds_interface_statistics": 2015 live 1759946 dead
vacuum-2019-11-30_15:00.log:INFO: "vds_interface_statistics": 1917 live 1759559 dead
vacuum-2019-11-30_16:00.log:INFO: "vds_interface_statistics": 1298 live 1764150 dead
vacuum-2019-11-30_17:00.log:INFO: "vds_interface_statistics": 1259 live 1766954 dead
vacuum-2019-11-30_18:00.log:INFO: "vds_interface_statistics": 1628 live 1771477 dead
vacuum-2019-11-30_19:00.log:INFO: "vds_interface_statistics": 1353 live 1771258 dead
vacuum-2019-11-30_20:00.log:INFO: "vds_interface_statistics": 1838 live 1770989 dead
vacuum-2019-11-30_21:00.log:INFO: "vds_interface_statistics": 1354 live 1772730 dead
vacuum-2019-11-30_22:00.log:INFO: "vds_interface_statistics": 888 live 1775524 dead
vacuum-2019-11-30_23:00.log:INFO: "vds_interface_statistics": 1391 live 1778192 dead
vacuum-2019-12-01_00:00.log:INFO: "vds_interface_statistics": 851 live 1772568 dead
vacuum-2019-12-01_01:00.log:INFO: "vds_interface_statistics": 1208 live 1757525 dead
vacuum-2019-12-01_02:00.log:INFO: "vds_interface_statistics": 1667 live 1761451 dead
vacuum-2019-12-01_03:00.log:INFO: "vds_interface_statistics": 1133 live 1758314 dead
vacuum-2019-12-01_04:00.log:INFO: "vds_interface_statistics": 880 live 1762851 dead
vacuum-2019-12-01_05:00.log:INFO: "vds_interface_statistics": 1057 live 1764992 dead
vacuum-2019-12-01_06:00.log:INFO: "vds_interface_statistics": 560 live 1768721 dead
vacuum-2019-12-01_07:00.log:INFO: "vds_interface_statistics": 717 live 1768880 dead
vacuum-2019-12-01_08:00.log:INFO: "vds_interface_statistics": 1174 live 1769432 dead
vacuum-2019-12-01_09:00.log:INFO: "vds_interface_statistics": 917 live 1771564 dead
vacuum-2019-12-01_10:00.log:INFO: "vds_interface_statistics": 1024 live 1771444 dead
vacuum-2019-12-01_11:00.log:INFO: "vds_interface_statistics": 1200 live 1771819 dead
vacuum-2019-12-01_12:00.log:INFO: "vds_interface_statistics": 814 live 1774972 dead
vacuum-2019-12-01_13:00.log:INFO: "vds_interface_statistics": 930 live 1775267 dead
vacuum-2019-12-01_14:00.log:INFO: "vds_interface_statistics": 632 live 1778116 dead
vacuum-2019-12-01_15:00.log:INFO: "vds_interface_statistics": 675 live 1777556 dead
vacuum-2019-12-01_16:00.log:INFO: "vds_interface_statistics": 563 live 1778871 dead
vacuum-2019-12-01_17:00.log:INFO: "vds_interface_statistics": 920 live 1780616 dead
vacuum-2019-12-01_18:00.log:INFO: "vds_interface_statistics": 797 live 1780982 dead
vacuum-2019-12-01_19:00.log:INFO: "vds_interface_statistics": 625 live 1783784 dead
vacuum-2019-12-01_20:00.log:INFO: "vds_interface_statistics": 329 live 1785584 dead
vacuum-2019-12-01_21:00.log:INFO: "vds_interface_statistics": 647 live 1784801 dead
vacuum-2019-12-01_22:00.log:INFO: "vds_interface_statistics": 1015 live 1782021 dead
vacuum-2019-12-01_23:00.log:INFO: "vds_interface_statistics": 338 live 1786441 dead
vacuum-2019-12-02_00:00.log:INFO: "vds_interface_statistics": 526 live 1785436 dead
vacuum-2019-12-02_01:00.log:INFO: "vds_interface_statistics": 601 live 1763758 dead
vacuum-2019-12-02_02:00.log:INFO: "vds_interface_statistics": 448 live 1767185 dead
vacuum-2019-12-02_03:00.log:INFO: "vds_interface_statistics": 539 live 1767878 dead
vacuum-2019-12-02_04:00.log:INFO: "vds_interface_statistics": 417 live 1767778 dead
vacuum-2019-12-02_05:00.log:INFO: "vds_interface_statistics": 1125 live 1768370 dead
vacuum-2019-12-02_06:00.log:INFO: "vds_interface_statistics": 504 live 1765809 dead
vacuum-2019-12-02_07:00.log:INFO: "vds_interface_statistics": 735 live 1765032 dead
vacuum-2019-12-02_08:00.log:INFO: "vds_interface_statistics": 660 live 1766448 dead
vacuum-2019-12-02_09:00.log:INFO: "vds_interface_statistics": 322 live 1767497 dead
vacuum-2019-12-02_10:00.log:INFO: "vds_interface_statistics": 655 live 1764297 dead
vacuum-2019-12-02_11:00.log:INFO: "vds_interface_statistics": 881 live 1769607 dead
vacuum-2019-12-02_12:00.log:INFO: "vds_interface_statistics": 524 live 1767738 dead
vacuum-2019-12-02_13:00.log:INFO: "vds_interface_statistics": 383 live 1767235 dead
vacuum-2019-12-02_14:00.log:INFO: "vds_interface_statistics": 1156 live 1772126 dead
vacuum-2019-12-02_15:00.log:INFO: "vds_interface_statistics": 611 live 1770586 dead
vacuum-2019-12-02_16:00.log:INFO: "vds_interface_statistics": 677 live 1771297 dead
(In reply to Roy Golan from comment #23)
> I can think about tuning the stats table only for more aggressive
> autovacuum, but I don't want to do
> that without some testing. More aggressive vacuum takes CPU time. If the
> machine has enough cores idling
> while the engine is running, we can consider this.
I agree, but I'm afraid to tune this by myself and make it worse, as I've seen these parameters were chosen very carefully and with a lot of testing in your patch that went into 4.1.
The machine has plenty of available resources, I'm more afraid of contention due to vacuum than CPU usage.
Few more from before the last exact hour, to complement comment #25. vacuum-2019-11-29_16:30.log:INFO: "vds_interface_statistics": 12633 live 47737 dead vacuum-2019-11-29_16:35.log:INFO: "vds_interface_statistics": 12633 live 103014 dead vacuum-2019-11-29_16:40.log:INFO: "vds_interface_statistics": 12633 live 160244 dead vacuum-2019-11-29_16:45.log:INFO: "vds_interface_statistics": 12633 live 206028 dead vacuum-2019-11-29_16:50.log:INFO: "vds_interface_statistics": 12633 live 258780 dead vacuum-2019-11-29_16:55.log:INFO: "vds_interface_statistics": 12633 live 316010 dead ... Despite the removed hosts, Mordechai's lab dead rows do not seem to grow even close to the customers, and the customer has VdsRefreshRate set to 13. Customer's dead rows are growing at about 166 per second just in the 30k pages in the scanning sample. Probably about 3k per second in total. There seem to be a difference between customer's and scale labs. Could the API calls be responsible? Hi Roy, Would really appreciate your thoughts here to provide a workaround to the customer. Given the ugly numbers we see (comment #25 and comment #26), I think autovacuum_*_scale_factor and autovacuum_*_threshold are not to blame, it should always evaluate to the table needing an analyze/vacuum. Do you think raising vacuum_cost_limit from the default 200 to something like 1000 or 2000 would help? It is perhaps doing the autovacuum regularly as intended, but maybe stopping too soon so the dead tuples accumulate. Do you agree? As a workaround and given that you say we have CPUs available, we can definitely consider this: 1. increase the workers which are reserved for autovacuum autovacuum_max_workers=8 (increased by 2) 2. tune the vacuum gauges of vds_statistics table to more aggresively vacuum autovacuum_vacuum_cost_limit=? (now that we have more workers that eat the share limit it makes sense to increase this) I don't have a number of hand. Increase by x2 and examine the behaviour of overall cpus and dead tuple count. 3. lower nap time autovacuum_naptime=30 (30 secs) Pass criteria: - engine remain serviceable, without getting sluggish on regular actions (login, monitoring...) - dead tuple count does not increase in time Mordechai, Domink also FYI Just a quick update. The table bloat issue seems solved and actions like VM Start are now OK, but some other actions are still slow (i.e. VM Migration Dialogue, Hosts Tab). I've requested more data to investigate those. (In reply to Germano Veit Michel from comment #31) > Just a quick update. > > The table bloat issue seems solved and actions like VM Start are now OK, but > some other actions are still slow (i.e. VM Migration Dialogue, Hosts Tab). > I've requested more data to investigate those. I guess we can attribute those to slow queries execution on the cluster view, and its related views (network, host nics and so on), which is the main issue of that bug. Waiting for the data to validate this. BTW a simple api query to clusters should reveal this - if you can perform this just as a small test it will be an indicator: ``` time curl -k -u user@internal:pass https://enginefqdn/ovirt-engine/api/clusters ``` Germano I think I may not be updated, is there any code change made on this environment? (In reply to Roy Golan from comment #32) > (In reply to Germano Veit Michel from comment #31) > > Just a quick update. > > > > The table bloat issue seems solved and actions like VM Start are now OK, but > > some other actions are still slow (i.e. VM Migration Dialogue, Hosts Tab). > > I've requested more data to investigate those. > > I guess we can attribute those to slow queries execution on the cluster > view, and its related views (network, host nics and so on), > which is the main issue of that bug. Waiting for the data to validate this. > BTW a simple api query to clusters should reveal this - if you can perform > this just as a small test it will be an indicator: > > ``` > time curl -k -u user@internal:pass > https://enginefqdn/ovirt-engine/api/clusters > ``` It might be it, but from the data we had I couldn't see that as the root cause. I've requested more extensive tests (including HAR from firefox). I'll update once we receive the data. > Germano I think I may not be updated, is there any code change made on this > environment? Sorry but I'm a bit confused. Do you want confirmation that the customer has not modified the code and rebuilt ovirt-engine? Going back to this, I've got more data now... The customer is now complaining mainly of 3 problems. 1) Hosts tab slow to load 2) Manage Storage Domain slow to load 3) Migrate VM dialog slow to load (In reply to Roy Golan from comment #32) > I guess we can attribute those to slow queries execution on the cluster > view, and its related views (network, host nics and so on), > which is the main issue of that bug. Waiting for the data to validate this. > BTW a simple api query to clusters should reveal this - if you can perform > this just as a small test it will be an indicator: > > ``` > time curl -k -u user@internal:pass > https://enginefqdn/ovirt-engine/api/clusters > ``` This would be problem 1. But I'm not sure its related to the already found problems, see this from customer's logs, cluster_view being quite fast: 2020-01-06 10:00:05,456+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery] (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Executing generic query: SELECT * FROM ((SELECT cluster_view.* FROM cluster_view ) ORDER BY name ASC) as T1 OFFSET (1 -1) LIMIT 2147483647 2020-01-06 10:00:05,461+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery] (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Query SearchQuery took 5 ms When browsing the hosts tab, the customer complains it still takes a long time to load. It seems the admin portal will request a GetVdsInterfacesByVdsId for each of the 100 hosts being shown per page, this part takes 15s to return: 2020-01-06 11:23:41,217+09 DEBUG [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Server: RunMultipleQuery invoked! [amount of queries: 100] 2020-01-06 11:23:41,217+09 DEBUG [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query type 'GetVdsInterfacesByVdsId', Parameters 'IdQueryParameters:{refresh='false', filtered='false'}' ... 2020-01-06 11:23:58,076+09 DEBUG [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default task-42) [] Server: RunMultipleQuery result [amount of queries: 100] This apparently can sometimes take up to 0.25s each (note the size of the env) 2020-01-06 11:23:57,745+09 DEBUG [org.ovirt.engine.core.bll.network.host.GetVdsInterfacesByVdsIdQuery] (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query GetVdsInterfacesByVdsIdQuery took 248 ms Regarding problem 2 I found some slower GetDeviceList commands, and apparently the engine does 4 of these sequentially to load the Manage Domain tab, but it doesn't explain all the delay. I'm struggling to explain this slowness and would appreciate some help to indentify if we already found all the problems or there is more. I'm attaching engine debug logs and timestamps of the problems: 1) Hosts tab slow to load - Jan 6 11:22:42 to Jan 6 11:23:56 2) Manage Storage Domain slow to load - Jan 6 11:04:48 to Jan 6 11:10:29 3) Migrate VM dialog slow to load - Jan 6 10:47:33 to Jan 6 10:53:04 Can someone please take a look if there are indications of additional problems we haven't found yet? I think we may not have uncovered everything yet. (In reply to Germano Veit Michel from comment #34) > Going back to this, I've got more data now... > > The customer is now complaining mainly of 3 problems. > 1) Hosts tab slow to load > 2) Manage Storage Domain slow to load > 3) Migrate VM dialog slow to load > > (In reply to Roy Golan from comment #32) > > I guess we can attribute those to slow queries execution on the cluster > > view, and its related views (network, host nics and so on), > > which is the main issue of that bug. Waiting for the data to validate this. > > BTW a simple api query to clusters should reveal this - if you can perform > > this just as a small test it will be an indicator: > > > > ``` > > time curl -k -u user@internal:pass > > https://enginefqdn/ovirt-engine/api/clusters > > ``` > > This would be problem 1. But I'm not sure its related to the already found > problems, see this from customer's logs, cluster_view being quite fast: > 2020-01-06 10:00:05,456+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery] > (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Executing generic > query: SELECT * FROM ((SELECT cluster_view.* FROM cluster_view ) ORDER > BY name ASC) as T1 OFFSET (1 -1) LIMIT 2147483647 > 2020-01-06 10:00:05,461+09 DEBUG [org.ovirt.engine.core.bll.SearchQuery] > (default task-11) [a90fc4a1-4c8a-4bcf-b664-4979d1e2c04c] Query SearchQuery > took 5 ms > > When browsing the hosts tab, the customer complains it still takes a long > time to load. It seems the admin portal will request a > GetVdsInterfacesByVdsId for each of the 100 hosts being shown per page, this > part takes 15s to return: > > 2020-01-06 11:23:41,217+09 DEBUG > [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default > task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Server: RunMultipleQuery > invoked! [amount of queries: 100] > 2020-01-06 11:23:41,217+09 DEBUG > [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default > task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query type > 'GetVdsInterfacesByVdsId', Parameters 'IdQueryParameters:{refresh='false', > filtered='false'}' > ... > 2020-01-06 11:23:58,076+09 DEBUG > [org.ovirt.engine.ui.frontend.server.gwt.GenericApiGWTServiceImpl] (default > task-42) [] Server: RunMultipleQuery result [amount of queries: 100] 100 queries instead of a bulk select. We probabaly have tons of these. The UI need to stop doing that. > > This apparently can sometimes take up to 0.25s each (note the size of the > env) > 2020-01-06 11:23:57,745+09 DEBUG > [org.ovirt.engine.core.bll.network.host.GetVdsInterfacesByVdsIdQuery] > (default task-42) [33a10816-2611-4297-b64c-123ba1d361a9] Query > GetVdsInterfacesByVdsIdQuery took 248 ms > > Regarding problem 2 I found some slower GetDeviceList commands, and > apparently the engine does 4 of these sequentially to load the Manage Domain > tab, but it doesn't explain all the delay. > > I'm struggling to explain this slowness and would appreciate some help to > indentify if we already found all the problems or there is more. > > I'm attaching engine debug logs and timestamps of the problems: > 1) Hosts tab slow to load - Jan 6 11:22:42 to Jan 6 11:23:56 > 2) Manage Storage Domain slow to load - Jan 6 11:04:48 to Jan 6 11:10:29 > 3) Migrate VM dialog slow to load - Jan 6 10:47:33 to Jan 6 10:53:04 > > Can someone please take a look if there are indications of additional > problems we haven't found yet? I think we may not have uncovered everything > yet. (In reply to Roy Golan from comment #38) > > 100 queries instead of a bulk select. We probabaly have tons of these. > The UI need to stop doing that. > Should I open a separate bug to track it? If you had time to check the logs, did you find anything else that might be contributing? (In reply to Germano Veit Michel from comment #39) > (In reply to Roy Golan from comment #38) > > > > 100 queries instead of a bulk select. We probabaly have tons of these. > > The UI need to stop doing that. > > > Should I open a separate bug to track it? Opened BZ1790746 to track webadmin queries on hosts tab. We have finished several fixes, which improved scale performance enough, so closing this tracker. If you find any other important scale issue, please open a new concrete bug about. |