New commit detected on ManageIQ/manageiq/ivanchuk: https://github.com/ManageIQ/manageiq/commit/7cbeb1dc06ca8aadc2bf14b63a31acbecd56fbbd commit 7cbeb1dc06ca8aadc2bf14b63a31acbecd56fbbd Author: Gregg Tanzillo <gtanzill> AuthorDate: Thu Aug 29 07:52:27 2019 -0400 Commit: Gregg Tanzillo <gtanzill> CommitDate: Thu Aug 29 07:52:27 2019 -0400 Merge pull request #19219 from agrare/add_tool_to_cleanup_duplicate_host_guest_devices Add a tool to cleanup duplicate host guest_devices (cherry picked from commit 775ae0231932b28b637a1861e76019c44c3af640) https://bugzilla.redhat.com/show_bug.cgi?id=1767819 tools/cleanup_duplicate_host_guest_devices.rb | 50 + 1 file changed, 50 insertions(+) https://github.com/ManageIQ/manageiq/commit/0af0f6a571d6180713e61429785206a5f318cf0f commit 0af0f6a571d6180713e61429785206a5f318cf0f Author: Keenan Brock <keenan> AuthorDate: Tue Oct 15 11:24:28 2019 -0400 Commit: Keenan Brock <keenan> CommitDate: Tue Oct 15 11:24:28 2019 -0400 Merge pull request #19235 from agrare/turbo_button_for_guest_device_cleanup Make destroying guest_devices faster (cherry picked from commit 236cbdf91adad47c9e6ddd07f62848f482996098) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1767819 tools/cleanup_duplicate_host_guest_devices.rb | 25 +- 1 file changed, 24 insertions(+), 1 deletion(-)
I used /var/www/miq/vmdb/tools/cleanup_duplicate_host_guest_devices.rb -e 1000000000041 --no-dry-run and that reduced the number of devices from: irb(main):010:0> GuestDevice.count => 1621452 to irb(main):022:0> GuestDevice.count => 526018 in couple of minutes. So we are on third of the original value. But the number is still quite high. I am not sure did we destroy all the devices we should have.
If you run it again does it find anything to delete?
Adam helped me with this. He found out that remaining devices are associated with an archived host. I found out that attempting to delete that host probably causes the whole relation to start getting loaded into the memory, which is quite slow and certainly not scalable. Thus we cannot delete the remaining junk. Adam told that there is a way though how to find out the rest of the junk-to-be-removed and I conclude we may need to modify the tool to do better job here. I will have to create a RFE for this.
I tried to display the archived host in web-ui and I god Proxy error after a while of loading (... and I think this may caused my further problems.) I was told that the main problem that we are fixing here is that the high amount of devices made provider refreshes not working. With the db I got, I got two RHV providers pointing to the customer systems each. I have no access to them, but I decided try working this around by setting the providers to our system (it may be a too hackish, but I directed both of them to the same RHV system by fooling the hostname check using /etc/hosts). One of the providers did refresh, though after like 5 minutes or so... quite long. The other didn't refresh for long time, so I attempted to systemctl stop the evmserverd. The systemctl was blocking, so I started killing the evm processes. I had to use kill -9. Then after new start of evmserverd, both RHV providers were refreshed. From that I would conclude that running the system with a ~500k devices seems risky as perhaps some code may attempt to load them all to the memory and this can block the worker. We need a tool to deal with the rest of the junk.
I am now checking again and I have problems with non-responding worker: [----] E, [2019-11-14T06:24:57.649031 #46671:2ab683c105c0] ERROR -- : MIQ(MiqServer#validate_worker) Worker [ManageIQ::Providers::Redhat::InfraManager::RefreshWorker] with ID: [1000000367905], PID: [51355], GUID: [49b3089a-e38e-419b-9e81-31a9c7b02f0e] has not responded in 128.822146317 seconds, restarting worker In top one of the process grew steadily to this size: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 51403 root 27 7 6593580 5.3g 2228 D 0.3 45.9 26:55.71 ruby The EVM status: [root@CENSORED vmdb]# bin/rails evm:status Checking EVM status... Region | Zone | Server | Status | PID | SPID | Workers | Version | Started | Heartbeat | MB Usage | Roles --------+---------+--------+---------+-------+-------+---------+----------+-------------+-------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | default | EVM* | started | 46671 | 46814 | 20 | 5.11.1.0 | 09:59:17UTC | 11:38:32UTC | 229 | automate:database_operations:database_owner:ems_inventory:ems_operations:event:remote_console:reporting:scheduler:smartstate:user_interface:web_services * marks a master appliance Type | Status | PID | SPID | Queue | Started | Heartbeat | MB Usage -------------------------------+--------------+-------+-------+-------------------+-------------+-------------+----------- EventHandler | started | 54211 | 54241 | ems | 11:17:05UTC | 11:38:54UTC | 234/500 Generic | started | 47147 | 47253 | generic | 09:59:20UTC | 11:38:53UTC | 368/500 Generic | started | 47156 | 47286 | generic | 09:59:20UTC | 11:38:53UTC | 328/500 Priority | started | 47165 | 47216 | generic | 09:59:20UTC | 11:38:54UTC | 311/600 Priority | started | 47174 | 47214 | generic | 09:59:20UTC | 11:38:54UTC | 301/600 Redhat::Infra::EventCatcher | started | 54673 | 54689 | ems_1000000000044 | 11:22:20UTC | 11:38:40UTC | 234/2048 Redhat::Infra::EventCatcher | started | 54722 | 54743 | ems_1000000000041 | 11:22:38UTC | 11:38:54UTC | 230/2048 Redhat::Infra::Refresh | started | 3432 | 3444 | ems_1000000000044 | 11:25:15UTC | 11:38:51UTC | 323/2048 Redhat::Infra::Refresh | stop pending | 51403 | 51444 | ems_1000000000041 | 10:50:07UTC | 10:50:08UTC | 2133/2048 Redhat::Infra::Refresh | started | 52479 | 52494 | ems_1000000000041 | 10:58:32UTC | 10:58:35UTC | 5/2048 Redhat::Network::EventCatcher | started | 54730 | 54742 | ems_1000000000045 | 11:22:38UTC | 11:38:42UTC | 231/2048 Redhat::Network::EventCatcher | started | 57010 | 57967 | ems_1000000000042 | 11:23:12UTC | 11:38:45UTC | 233/2048 Redhat::Network::Refresh | started | 57020 | 57758 | ems_1000000000045 | 11:23:12UTC | 11:38:48UTC | 289/2048 Redhat::Network::Refresh | started | 60561 | 61017 | ems_1000000000042 | 11:23:17UTC | 11:38:48UTC | 290/2048 RemoteConsole | started | 54115 | | http:5000 | 11:16:35UTC | 11:38:50UTC | 237/1024 Reporting | started | 4022 | 4033 | reporting | 11:30:44UTC | 11:38:53UTC | 256/500 Reporting | started | 54220 | 54243 | reporting | 11:17:05UTC | 11:38:52UTC | 302/500 Schedule | started | 47201 | 47252 | | 09:59:21UTC | 11:38:53UTC | 224/500 Ui | started | 53831 | | http:3000 | 11:13:16UTC | 11:38:43UTC | 323/1024 WebService | started | 54229 | | http:4000 | 11:17:05UTC | 11:38:44UTC | 326/1024 All rows have the values: Region=1, Zone=default, Server=EVM I think the remedy tool is not yet helping as it should. I am checking with each provider redirected to separate system in our lab. The ovirtm provider did refresh after changing it to default zone. The other one fails to refresh.
In the comment above, one can easily match the evm:status with the top to see the process of the refresh worker did grew quite large. I think this is a proof that the remaining 526018 Guest Devices in the archived host are still an issue.
Yet another clue: I enabled postgres log (https://tableplus.com/blog/2018/10/how-to-show-queries-log-in-postgresql.html) and I see a line: 2019-11-14 07:13:43 EST:::1(45140):5dcd42cb.19ba:root@vmdb_production:[6586]:LOG: execute <unnamed>: SELECT "guest_devices".* FROM "guest_devices" WHERE "guest_devices"."parent_device_id" = $1 AND "guest_devices"."parent_device_id" IN (1000000000306, 1000000000307, 1000000000308, 1000000000309, 1000000000352, 1000000000354, 1000000000355, 1000000000356, 1000000000357, 1000000000358, ... 1000001635547, 1000001635548, 1000001635549, 1000001635550, 1000001635551, 1000001635552, 1000001635553) File with that single log line has 7.6MB.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:4201
We are not gonna to cover this as the devices were getting created as a result of former CFME bug.