Bug 1767866

Summary:	Resource Pools for 1 RHV Infra provider no longer have any Clusters, Hosts or VMs
Product:	Red Hat CloudForms Management Engine	Reporter:	mheppler
Component:	Appliance	Assignee:	Gregg Tanzillo <gtanzill>
Status:	CLOSED DUPLICATE	QA Contact:	Angelina Vasileva <anikifor>
Severity:	high	Docs Contact:	Red Hat CloudForms Documentation <cloudforms-docs>
Priority:	high
Version:	5.10.11	CC:	abellott, agrare, anikifor, dmetzger, gtanzill, jfrey, jhardy, jrafanie, obarenbo, sigbjorn.lie, sigbjorn
Target Milestone:	GA
Target Release:	cfme-future
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-11 15:17:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description mheppler 2019-11-01 14:44:56 UTC

Description of problem:

Customer:

  In our Cloudforms setup, 1 out of 2 RHV Infrastructure Providers no longer have any VMs registered to them. The infra provider in question is "GM-RHV1".

  The VMs from the provider is still visible under Compute -> Infrastructure -> Virtual Machines, however no VMs does not have any Resource Pool attached.

  The Clusters from this RHV infra provider does not have any Resource Pools or Datacenters attached.

  The Hosts from the particular RHV infra provider does not have any Resource Pools attached.

  Our 2nd RHV infra provider, "MRO-RHV1" have none of these issues. 

  I have tried to do a "Refresh Relationships and Power states" for the provider, but that did not change anything.

  We are on Cloudforms version 5.10.11.0.20190923184203_de5e4c8

Observations:

  Workers stucked in stop pending state until manually killed:

 Type                                   | Status       |   PID | SPID   | Queue                 | Started     | Heartbeat   | MB Usage
----------------------------------------+--------------+-------+--------+-----------------------+-------------+-------------+-------------
 Redhat::Infra::Refresh                 | stop pending | 41971 | 107201 | ems_1000000000002     | 16:07:29UTC | 16:07:31UTC | 10258/10240
 Redhat::Infra::Refresh                 | started      | 52839 | 110786 | ems_1000000000002     | 17:30:58UTC | 17:30:59UTC | 909/10240


Errors in log:

[----] E, [2019-10-10T08:58:16.142975 #35944:986f5c] ERROR -- : Error when saving InventoryCollection:<ManageIQ::Providers::Redhat::InfraManager::Host> with strategy: , saver_strategy: default, targeted: false. Message:
 PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK
[----] I, [2019-10-10T08:58:16.143335 #35944:986f5c]  INFO -- : Exception in realtime_block :ems_refresh - Timings: {:collect_inventory_for_targets=>2.9923532009124756, :parse_inventory=>84.12276315689087, :parse_target
ed_inventory=>84.12290954589844, :save_inventory=>3830.749542951584, :ems_refresh=>3917.865128517151}
[----] E, [2019-10-10T08:58:16.143473 #35944:986f5c] ERROR -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Refresh::Strategies::Api4#refresh) EMS: [GM-RHV1], id: [1000000000002] Refresh failed
[----] E, [2019-10-10T08:58:16.143703 #35944:986f5c] ERROR -- : [ActiveRecord::StatementInvalid]: PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-10-10T08:58:16.143789 #35944:986f5c] ERROR -- : /opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/postgresql/database_statements.rb:98:in `async_exec'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/postgresql/database_statements.rb:98:in `block in execute'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract_adapter.rb:590:in `block in log'
/opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/notifications/instrumenter.rb:21:in `instrument'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract_adapter.rb:583:in `log'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/postgresql/database_statements.rb:97:in `execute'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/postgresql/database_statements.rb:167:in `exec_rollback_db_transaction'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract/database_statements.rb:285:in `rollback_db_transaction'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract/query_cache.rb:17:in `rollback_db_transaction'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract/transaction.rb:138:in `rollback'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract/transaction.rb:183:in `rollback_transaction'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract/transaction.rb:192:in `rescue in within_new_transaction'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract/transaction.rb:209:in `within_new_transaction'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/connection_adapters/abstract/database_statements.rb:232:in `transaction'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/transactions.rb:211:in `transaction'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/saver/base.rb:145:in `save!'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/saver/base.rb:84:in `save_inventory_collection!'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/base.rb:37:in `save_inventory'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/base.rb:24:in `save_inventory_object_inventory'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:29:in `block (2 levels) in save_collections'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:28:in `each'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:28:in `block in save_collections'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:26:in `each'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:26:in `each_with_index'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:26:in `save_collections'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_inventory.rb:24:in `save_inventory'
/var/www/miq/vmdb/app/models/ems_refresh/save_inventory.rb:5:in `save_ems_inventory'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:191:in `save_inventory'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:109:in `block in refresh_targets_for_ems'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-98680009fe14/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-98680009fe14/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:109:in `refresh_targets_for_ems'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:41:in `block (2 levels) in refresh'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-98680009fe14/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-98680009fe14/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:41:in `block in refresh'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:31:in `each'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:31:in `refresh'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:11:in `refresh'
/var/www/miq/vmdb/app/models/ems_refresh.rb:103:in `block in refresh'
/var/www/miq/vmdb/app/models/ems_refresh.rb:102:in `each'
/var/www/miq/vmdb/app/models/ems_refresh.rb:102:in `refresh'
/var/www/miq/vmdb/app/models/miq_queue.rb:455:in `block in dispatch_method'
/usr/share/ruby/timeout.rb:93:in `block in timeout'
/usr/share/ruby/timeout.rb:33:in `block in catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:108:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:453:in `dispatch_method'
/var/www/miq/vmdb/app/models/miq_queue.rb:430:in `block in deliver'
/var/www/miq/vmdb/app/models/user.rb:275:in `with_user_group'
/var/www/miq/vmdb/app/models/miq_queue.rb:430:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:137:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:155:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:408:in `block in start_runner_via_fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:406:in `start_runner_via_fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start_runner'
/var/www/miq/vmdb/app/models/miq_worker.rb:447:in `start'
/var/www/miq/vmdb/app/models/miq_worker.rb:277:in `start_worker'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:74:in `start_worker_for_ems'

Comment 2 Adam Grare 2019-11-01 14:54:06 UTC

> [----] E, [2019-10-10T08:58:16.142975 #35944:986f5c] ERROR -- : Error when saving InventoryCollection:<ManageIQ::Providers::Redhat::InfraManager::Host> with strategy: , saver_strategy: default, targeted: false. Message:
> PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK

This sounds like an appliance/database connection issue, +Gregg Tanzillo

Comment 4 Joe Rafaniello 2019-11-05 17:02:13 UTC

Swap is severely invaded... this might be causing the timeouts and various errors... Redhat::InfraManager::RefreshWorker is growing to around 11 GB.

Comment 8 Joe Rafaniello 2019-11-05 17:11:19 UTC

Adam, can you take a look?  Please see comment #4 through #7.

Comment 9 Adam Grare 2019-11-05 17:16:57 UTC

Was this recently upgraded by any chance?  There was a bug in previous versions that created a huge number of duplicate guest_devices and after upgrading when trying to clean these up it takes a lot of time+memory and often times the worker gets killed.

Try running tools/cleanup_duplicate_host_guest_devices.rb

Comment 10 Joe Rafaniello 2019-11-05 17:19:14 UTC

Also, it looks like this appliance is running with inventory, mertrics, operations, notifier, reporting, UI, api, websocket all on the same box... leading to 49 of our MIQ processes at one time.

Comment 14 Joe Rafaniello 2019-11-05 17:24:03 UTC

It looks like like the system was updated to 5.10.6 on July 2nd and then upgraded to 5.10.11.0 after this issue occurred.

Comment 15 Joe Rafaniello 2019-11-05 17:28:37 UTC

See comment #9.  This system was configured with too many roles/workers and the redhat refresh worker is using up to 11 GB of memory.  Can we investigate the guest devices issue that Adam mentioned in that comment?  Also, they'll need to redistribute the roles more evenly as the system keeps invading swap and having closer to 15-30 MIQ workers would probably improve performance significantly.

Comment 16 mheppler 2019-11-11 11:17:32 UTC

Please, can be more specific? I can ask anything from customer, but I do not understand what exactly.

Comment 17 Joe Rafaniello 2019-11-11 12:46:11 UTC

According to the diagnostics, the red hat refresh worker is growing to 11 GB RSS.  According to comment #9, this could be a previously seen issue where there are many duplicate guest devices that we then can't delete without exceeding memory in the worker.  Please check this out.  The script is available in master/ivanchuk/hammer in manageiq:  https://github.com/ManageIQ/manageiq/blob/hammer/tools/cleanup_duplicate_host_guest_devices.rb

Additionally, the roles are misconfigured with too many roles on the same appliance.  There were ~49 MIQ processes running on the same appliance all competing with each other and the bloated refresh worker for system resources.  There should be enough processes that there is still free memory available on the system to handle any spikes in memory usage.

Comment 18 Sigbjorn Lie 2019-11-11 12:54:59 UTC

The cleanup_duplicate_host_guest_devices.rb fails...

vmdb]# tools/cleanup_duplicate_host_guest_devices.rb --ems-name=MRO-RHV1 --no-dry-run
Found 1055980 duplicate Guest Devices...
**** THIS WILL MODIFY YOUR DATABASE ****
     Press Enter to Continue:

Destroying slice 1 of 10560...
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/reflection.rb:173:in `join_keys': wrong number of arguments (given 0, expected 1) (ArgumentError)
	from tools/cleanup_duplicate_host_guest_devices.rb:59:in `block (2 levels) in <main>'
	from tools/cleanup_duplicate_host_guest_devices.rb:57:in `each'
	from tools/cleanup_duplicate_host_guest_devices.rb:57:in `block in <main>'
	from tools/cleanup_duplicate_host_guest_devices.rb:53:in `each'
	from tools/cleanup_duplicate_host_guest_devices.rb:53:in `each_slice'
	from tools/cleanup_duplicate_host_guest_devices.rb:53:in `with_index'
	from tools/cleanup_duplicate_host_guest_devices.rb:53:in `<main>'

Comment 19 Adam Grare 2019-11-11 13:22:03 UTC

On 5.10 you need this fix for an active-record API change https://github.com/ManageIQ/manageiq/pull/19447

You can `wget https://raw.githubusercontent.com/ManageIQ/manageiq/3062fcaecccb3f01474ed9be43f4e082fbb6338a/tools/cleanup_duplicate_host_guest_devices.rb`

Comment 20 Sigbjorn Customer 2019-11-11 14:27:42 UTC

That worked fine. Thanks.

I ran this on both RHV providers, both had a high number of duplicated devices to remove. 

Destroyed 953680 duplicate Guest Devices
Destroyed 1625939 duplicate Guest Devices

The Resource Pools for the provider with the highest amount of duplicated guest devices now have VM's attached again.

Thank you.

Comment 21 Adam Grare 2019-11-11 15:06:56 UTC

Awesome, in that case @jrafanie I recommend closing this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1746600

Comment 22 Joe Rafaniello 2019-11-11 15:17:57 UTC

Thank you for the updates.  I'm marking this as a duplicate of the case as Adam mentioned.  

Please keep in mind, it's still not recommended to have all/most roles enabled on the same appliance especially when you have worker processes collecting inventory and metrics from providers.  Once things are stable, please ensure there is plenty of free system memory on the appliance after all the workers are running over a few days.  Keep an eye on the swap usage as you may need to shift roles to different locations or decrease the number of workers if the swap is often in use.  Too frequent usage of swap will severely hurt performance of all tasks on the appliance.  Support can help you adjust roles to better balance the load.

*** This bug has been marked as a duplicate of bug 1746600 ***

Comment 23 Sigbjorn Customer 2019-11-11 16:35:58 UTC

Thank you for the advise. 

The roles are currently configured for debug purposes. This single error made the system swap, as the refresh worker did not stop consuming memory, usually there are enough memory available and swap is not used.

Comment 24 Red Hat Bugzilla 2023-09-18 00:18:08 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days