Bug 1767866
Summary: | Resource Pools for 1 RHV Infra provider no longer have any Clusters, Hosts or VMs | ||
---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | mheppler |
Component: | Appliance | Assignee: | Gregg Tanzillo <gtanzill> |
Status: | CLOSED DUPLICATE | QA Contact: | Angelina Vasileva <anikifor> |
Severity: | high | Docs Contact: | Red Hat CloudForms Documentation <cloudforms-docs> |
Priority: | high | ||
Version: | 5.10.11 | CC: | abellott, agrare, anikifor, dmetzger, gtanzill, jfrey, jhardy, jrafanie, obarenbo, sigbjorn.lie, sigbjorn |
Target Milestone: | GA | ||
Target Release: | cfme-future | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-11-11 15:17:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
mheppler
2019-11-01 14:44:56 UTC
> [----] E, [2019-10-10T08:58:16.142975 #35944:986f5c] ERROR -- : Error when saving InventoryCollection:<ManageIQ::Providers::Redhat::InfraManager::Host> with strategy: , saver_strategy: default, targeted: false. Message:
> PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK
This sounds like an appliance/database connection issue, +Gregg Tanzillo
Swap is severely invaded... this might be causing the timeouts and various errors... Redhat::InfraManager::RefreshWorker is growing to around 11 GB. Adam, can you take a look? Please see comment #4 through #7. Was this recently upgraded by any chance? There was a bug in previous versions that created a huge number of duplicate guest_devices and after upgrading when trying to clean these up it takes a lot of time+memory and often times the worker gets killed. Try running tools/cleanup_duplicate_host_guest_devices.rb Also, it looks like this appliance is running with inventory, mertrics, operations, notifier, reporting, UI, api, websocket all on the same box... leading to 49 of our MIQ processes at one time. It looks like like the system was updated to 5.10.6 on July 2nd and then upgraded to 5.10.11.0 after this issue occurred. See comment #9. This system was configured with too many roles/workers and the redhat refresh worker is using up to 11 GB of memory. Can we investigate the guest devices issue that Adam mentioned in that comment? Also, they'll need to redistribute the roles more evenly as the system keeps invading swap and having closer to 15-30 MIQ workers would probably improve performance significantly. Please, can be more specific? I can ask anything from customer, but I do not understand what exactly. According to the diagnostics, the red hat refresh worker is growing to 11 GB RSS. According to comment #9, this could be a previously seen issue where there are many duplicate guest devices that we then can't delete without exceeding memory in the worker. Please check this out. The script is available in master/ivanchuk/hammer in manageiq: https://github.com/ManageIQ/manageiq/blob/hammer/tools/cleanup_duplicate_host_guest_devices.rb Additionally, the roles are misconfigured with too many roles on the same appliance. There were ~49 MIQ processes running on the same appliance all competing with each other and the bloated refresh worker for system resources. There should be enough processes that there is still free memory available on the system to handle any spikes in memory usage. The cleanup_duplicate_host_guest_devices.rb fails... vmdb]# tools/cleanup_duplicate_host_guest_devices.rb --ems-name=MRO-RHV1 --no-dry-run Found 1055980 duplicate Guest Devices... **** THIS WILL MODIFY YOUR DATABASE **** Press Enter to Continue: Destroying slice 1 of 10560... /opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/reflection.rb:173:in `join_keys': wrong number of arguments (given 0, expected 1) (ArgumentError) from tools/cleanup_duplicate_host_guest_devices.rb:59:in `block (2 levels) in <main>' from tools/cleanup_duplicate_host_guest_devices.rb:57:in `each' from tools/cleanup_duplicate_host_guest_devices.rb:57:in `block in <main>' from tools/cleanup_duplicate_host_guest_devices.rb:53:in `each' from tools/cleanup_duplicate_host_guest_devices.rb:53:in `each_slice' from tools/cleanup_duplicate_host_guest_devices.rb:53:in `with_index' from tools/cleanup_duplicate_host_guest_devices.rb:53:in `<main>' On 5.10 you need this fix for an active-record API change https://github.com/ManageIQ/manageiq/pull/19447 You can `wget https://raw.githubusercontent.com/ManageIQ/manageiq/3062fcaecccb3f01474ed9be43f4e082fbb6338a/tools/cleanup_duplicate_host_guest_devices.rb` That worked fine. Thanks. I ran this on both RHV providers, both had a high number of duplicated devices to remove. Destroyed 953680 duplicate Guest Devices Destroyed 1625939 duplicate Guest Devices The Resource Pools for the provider with the highest amount of duplicated guest devices now have VM's attached again. Thank you. Awesome, in that case @jrafanie I recommend closing this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1746600 Thank you for the updates. I'm marking this as a duplicate of the case as Adam mentioned. Please keep in mind, it's still not recommended to have all/most roles enabled on the same appliance especially when you have worker processes collecting inventory and metrics from providers. Once things are stable, please ensure there is plenty of free system memory on the appliance after all the workers are running over a few days. Keep an eye on the swap usage as you may need to shift roles to different locations or decrease the number of workers if the swap is often in use. Too frequent usage of swap will severely hurt performance of all tasks on the appliance. Support can help you adjust roles to better balance the load. *** This bug has been marked as a duplicate of bug 1746600 *** Thank you for the advise. The roles are currently configured for debug purposes. This single error made the system swap, as the refresh worker did not stop consuming memory, usually there are enough memory available and swap is not used. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |