Bug 1767819 - unable to remove duplicate guest devices due to memory
Summary: unable to remove duplicate guest devices due to memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.10.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.11.1
Assignee: Boriso
QA Contact: Jaroslav Henner
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On: 1746600
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-01 13:33 UTC by Satoe Imaishi
Modified: 2022-07-09 10:57 UTC (History)
13 users (show)

Fixed In Version: 5.11.1.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1746600
Environment:
Last Closed: 2019-12-13 00:35:40 UTC
Category: ---
Cloudforms Team: RHEVM
Target Upstream Version:
Embargoed:
simaishi: cfme-5.11.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2019:4201 0 None None None 2019-12-13 00:35:48 UTC

Comment 2 CFME Bot 2019-11-01 19:32:58 UTC
New commit detected on ManageIQ/manageiq/ivanchuk:

https://github.com/ManageIQ/manageiq/commit/7cbeb1dc06ca8aadc2bf14b63a31acbecd56fbbd
commit 7cbeb1dc06ca8aadc2bf14b63a31acbecd56fbbd
Author:     Gregg Tanzillo <gtanzill>
AuthorDate: Thu Aug 29 07:52:27 2019 -0400
Commit:     Gregg Tanzillo <gtanzill>
CommitDate: Thu Aug 29 07:52:27 2019 -0400

    Merge pull request #19219 from agrare/add_tool_to_cleanup_duplicate_host_guest_devices

    Add a tool to cleanup duplicate host guest_devices

    (cherry picked from commit 775ae0231932b28b637a1861e76019c44c3af640)

    https://bugzilla.redhat.com/show_bug.cgi?id=1767819

 tools/cleanup_duplicate_host_guest_devices.rb | 50 +
 1 file changed, 50 insertions(+)


https://github.com/ManageIQ/manageiq/commit/0af0f6a571d6180713e61429785206a5f318cf0f
commit 0af0f6a571d6180713e61429785206a5f318cf0f
Author:     Keenan Brock <keenan>
AuthorDate: Tue Oct 15 11:24:28 2019 -0400
Commit:     Keenan Brock <keenan>
CommitDate: Tue Oct 15 11:24:28 2019 -0400

    Merge pull request #19235 from agrare/turbo_button_for_guest_device_cleanup

    Make destroying guest_devices faster

    (cherry picked from commit 236cbdf91adad47c9e6ddd07f62848f482996098)

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1767819

 tools/cleanup_duplicate_host_guest_devices.rb | 25 +-
 1 file changed, 24 insertions(+), 1 deletion(-)

Comment 3 Jaroslav Henner 2019-11-13 13:35:50 UTC
I used /var/www/miq/vmdb/tools/cleanup_duplicate_host_guest_devices.rb -e 1000000000041  --no-dry-run and that reduced the number of devices from:

irb(main):010:0> GuestDevice.count
=> 1621452

to

irb(main):022:0> GuestDevice.count
=> 526018

in couple of minutes. So we are on third of the original value. But the number is still quite high. I am not sure did we destroy all the devices we should have.

Comment 4 Adam Grare 2019-11-13 14:26:03 UTC
If you run it again does it find anything to delete?

Comment 5 Jaroslav Henner 2019-11-13 16:40:27 UTC
Adam helped me with this. He found out that remaining devices are associated with an archived host. I found out that attempting to delete that host probably causes the whole relation to start getting loaded into the memory, which is quite slow and certainly not scalable. Thus we cannot delete the remaining junk.

Adam told that there is a way though how to find out the rest of the junk-to-be-removed and I conclude we may need to modify the tool to do better job here. I will have to create a RFE for this.

Comment 6 Jaroslav Henner 2019-11-13 20:49:31 UTC
I tried to display the archived host in web-ui and I god Proxy error after a while of loading (... and I think this may caused my further problems.)

I was told that the main problem that we are fixing here is that the high amount of devices made provider refreshes not working.

With the db I got, I got two RHV providers pointing to the customer systems each. I have no access to them, but I decided try working this around by setting the providers to our system (it may be a too hackish, but I directed both of them to the same RHV system by fooling the hostname check using /etc/hosts). One of the providers did refresh, though after like 5 minutes or so... quite long. The other didn't refresh for long time, so I attempted to systemctl stop the evmserverd. The systemctl was blocking, so I started killing the evm processes. I had to use kill -9. Then after new start of evmserverd, both RHV providers were refreshed.

From that I would conclude that running the system with a ~500k devices seems risky as perhaps some code may attempt to load them all to the memory and this can block the worker. We need a tool to deal with the rest of the junk.

Comment 7 Jaroslav Henner 2019-11-14 11:32:48 UTC
I am now checking again and I have problems with non-responding worker:
[----] E, [2019-11-14T06:24:57.649031 #46671:2ab683c105c0] ERROR -- : MIQ(MiqServer#validate_worker) Worker [ManageIQ::Providers::Redhat::InfraManager::RefreshWorker] with ID: [1000000367905], PID: [51355], GUID: [49b3089a-e38e-419b-9e81-31a9c7b02f0e] has not responded in 128.822146317 seconds, restarting worker

In top one of the process grew steadily to this size:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                             
51403 root      27   7 6593580   5.3g   2228 D   0.3  45.9  26:55.71 ruby                                

The EVM status:
[root@CENSORED vmdb]# bin/rails evm:status
Checking EVM status...
 Region | Zone    | Server | Status  |   PID |  SPID | Workers | Version  | Started     | Heartbeat   | MB Usage | Roles
--------+---------+--------+---------+-------+-------+---------+----------+-------------+-------------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------
      1 | default | EVM*   | started | 46671 | 46814 |      20 | 5.11.1.0 | 09:59:17UTC | 11:38:32UTC |      229 | automate:database_operations:database_owner:ems_inventory:ems_operations:event:remote_console:reporting:scheduler:smartstate:user_interface:web_services

* marks a master appliance

 Type                          | Status       |   PID | SPID  | Queue             | Started     | Heartbeat   | MB Usage
-------------------------------+--------------+-------+-------+-------------------+-------------+-------------+-----------
 EventHandler                  | started      | 54211 | 54241 | ems               | 11:17:05UTC | 11:38:54UTC | 234/500
 Generic                       | started      | 47147 | 47253 | generic           | 09:59:20UTC | 11:38:53UTC | 368/500
 Generic                       | started      | 47156 | 47286 | generic           | 09:59:20UTC | 11:38:53UTC | 328/500
 Priority                      | started      | 47165 | 47216 | generic           | 09:59:20UTC | 11:38:54UTC | 311/600
 Priority                      | started      | 47174 | 47214 | generic           | 09:59:20UTC | 11:38:54UTC | 301/600
 Redhat::Infra::EventCatcher   | started      | 54673 | 54689 | ems_1000000000044 | 11:22:20UTC | 11:38:40UTC | 234/2048
 Redhat::Infra::EventCatcher   | started      | 54722 | 54743 | ems_1000000000041 | 11:22:38UTC | 11:38:54UTC | 230/2048
 Redhat::Infra::Refresh        | started      |  3432 | 3444  | ems_1000000000044 | 11:25:15UTC | 11:38:51UTC | 323/2048
 Redhat::Infra::Refresh        | stop pending | 51403 | 51444 | ems_1000000000041 | 10:50:07UTC | 10:50:08UTC | 2133/2048
 Redhat::Infra::Refresh        | started      | 52479 | 52494 | ems_1000000000041 | 10:58:32UTC | 10:58:35UTC | 5/2048
 Redhat::Network::EventCatcher | started      | 54730 | 54742 | ems_1000000000045 | 11:22:38UTC | 11:38:42UTC | 231/2048
 Redhat::Network::EventCatcher | started      | 57010 | 57967 | ems_1000000000042 | 11:23:12UTC | 11:38:45UTC | 233/2048
 Redhat::Network::Refresh      | started      | 57020 | 57758 | ems_1000000000045 | 11:23:12UTC | 11:38:48UTC | 289/2048
 Redhat::Network::Refresh      | started      | 60561 | 61017 | ems_1000000000042 | 11:23:17UTC | 11:38:48UTC | 290/2048
 RemoteConsole                 | started      | 54115 |       | http:5000         | 11:16:35UTC | 11:38:50UTC | 237/1024
 Reporting                     | started      |  4022 | 4033  | reporting         | 11:30:44UTC | 11:38:53UTC | 256/500
 Reporting                     | started      | 54220 | 54243 | reporting         | 11:17:05UTC | 11:38:52UTC | 302/500
 Schedule                      | started      | 47201 | 47252 |                   | 09:59:21UTC | 11:38:53UTC | 224/500
 Ui                            | started      | 53831 |       | http:3000         | 11:13:16UTC | 11:38:43UTC | 323/1024
 WebService                    | started      | 54229 |       | http:4000         | 11:17:05UTC | 11:38:44UTC | 326/1024

All rows have the values: Region=1, Zone=default, Server=EVM


I think the remedy tool is not yet helping as it should.

I am checking with each provider redirected to separate system in our lab. The ovirtm provider did refresh after changing it to default zone. The other one fails to refresh.

Comment 8 Jaroslav Henner 2019-11-14 11:54:40 UTC
In the comment above, one can easily match the evm:status with the top to see the process of the refresh worker did grew quite large. I think this is a proof that the remaining 526018 Guest Devices in the archived host are still an issue.

Comment 9 Jaroslav Henner 2019-11-14 12:21:51 UTC
Yet another clue: I enabled postgres log (https://tableplus.com/blog/2018/10/how-to-show-queries-log-in-postgresql.html) and I see a line:

2019-11-14 07:13:43 EST:::1(45140):5dcd42cb.19ba:root@vmdb_production:[6586]:LOG:  execute <unnamed>: SELECT "guest_devices".* FROM "guest_devices" WHERE "guest_devices"."parent_device_id" = $1 AND "guest_devices"."parent_device_id" IN (1000000000306, 1000000000307, 1000000000308, 1000000000309, 1000000000352, 1000000000354, 1000000000355, 1000000000356, 1000000000357, 1000000000358,
...
1000001635547, 1000001635548, 1000001635549, 1000001635550, 1000001635551, 1000001635552, 1000001635553)

File with that single log line has 7.6MB.

Comment 21 errata-xmlrpc 2019-12-13 00:35:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:4201

Comment 22 Jaroslav Henner 2020-07-13 20:52:12 UTC
We are not gonna to cover this as the devices were getting created as a result of former CFME bug.


Note You need to log in before you can comment on or make changes to this bug.