Summarizing the previous private comment from last night: Looking at the logs we're seeing occurrences where the UI worker servicing "/vm_infra/report_data" requests is running well over a minute and exceeding the memory threshold. The UI worker is eventually killed by the server [----] W, [2019-07-25T13:00:41.034781 #31695:71ebfc] WARN -- : MIQ(MiqUiWorker::Runner#log_long_running_requests) Long running http(s) request: '/vm_infra/report_data' handled by #31695:3cf7580, running for 60.64 seconds ... [----] W, [2019-07-25T13:00:57.110946 #13751:486f5c] WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqUiWorker] with ID: [XXX], PID: [XXX], GUID: [XXX] process memory usage [5233468000] exceeded limit [1073741824], requesting worker to exit Last entry for pid in top log showing 1.9gb mem usage***************************************** 31695.pid:./evm_current_XXX_20190725_181201/log/top_output.log:31695 13751 root 21 1 2335996 1.9g 4436 S 4.6 3.0 1:33.36 puma 3.7.1 (tcp://127.0.0.1:3008) [MIQ: Web Server Worker] 1) We might want to see if we have this database or can get one that is similar to this for testing internally. 2) But, regardless, we should ask them do the following: Setup an appliance (production or a new one) and run this again with rails debug logging, which will help us isolate the issue. They would need to: * ensure they have the user interface and web services roles enabled on this appliance * they should then change "level_rails" from info to debug in the log section of the advanced settings * once debug is save, use tail -f log/production.log to ensure it starts logging DEBUG messages (it may take a minute or two) * once it starts logging DEBUG messages, they should copy off the log/production.log and log/evm.log and then truncate these logs to minimize the size of the logs * try to get the child tenant vms screen to recreate the reported issue Provide the logs from this appliance after 5-10 minutes later * Then, put the level_rails setting back to info, save, and verify there are no more DEBUG messages after several seconds.
Tuan, Based on what we see in the appliance and PG logs, there is clearly a performance issue causing loading the VM list view to fail. We are trying to understand the specific cause however, with the limited diagnostic information we have we can only speculate. We would need the customers database (which you've already requested) to fully troubleshoot this issue. We do have a pretty strong theory about a contributor to this performance problem. We think that the inclusion of last compliance status in the list view is causing a huge amount of data to come back from SQL. According to other things we've seen, we think they have a very large number of compliances rows. We would expect that the number of compliance records between the regions where the view loads and doesn't to be significantly different. If the customer needs a short term fix while we continue to troubleshoot, they can remove last compliance status from that view by editing a file on the appliance. These are the instructions to do that: On both UI appliances, the file named VmOrTemplate.yaml needs to be edited to remove the offending column - vmdb cd `bundle show manageiq-ui-classic`/product/views # make a backup copy of the file first cp -p VmOrTemplate.yaml VmOrTemplate.yaml.org vi VmOrTemplate.yaml # remove all occurrences of "last_compliance_status" and its header "Compliant". This is what the diff should look like after the change diff VmOrTemplate.yaml{.org,} 25d24 < - last_compliance_status 58d56 < - last_compliance_status 72d69 < - Compliant
Hi Tuan, Here's the latest status on this issue - We are in the process of creating a fix that will allow the VM views to render in an acceptable amount of time, without removing the "Compliant" column. We do not have an ETA at this time on when a hotfix with this change in it. In the meantime, we have a work around that removes the "Compliant" column from the affected views to alleviate the issue. We have successfully tested this with the customer's DB. Here are the steps necessary to apply the work around - Note: the patch step below should quietly reject an already applied change. yum install patch cd ~ wget https://github.com/ManageIQ/manageiq-ui-classic/compare/hammer...jrafanie:remove_compliance_from_vm_template_accordions.patch cd `bundle show manageiq-ui-classic` patch -p1 -N < ~/hammer...jrafanie\:remove_compliance_from_vm_template_accordions.patch systemctl restart evmserverd ***To reverse the patch, if necessary: patch --reverse -p1 < ~/hammer...jrafanie\:remove_compliance_from_vm_template_accordions.patch systemctl restart evmserverd
https://github.com/ManageIQ/manageiq-ui-classic/pull/5919
New commit detected on ManageIQ/manageiq-ui-classic/master: https://github.com/ManageIQ/manageiq-ui-classic/commit/174f9beb37e3a17145312f089c4b79011f149858 commit 174f9beb37e3a17145312f089c4b79011f149858 Author: Keenan Brock <keenan> AuthorDate: Wed Jul 31 11:46:42 2019 -0400 Commit: Keenan Brock <keenan> CommitDate: Wed Jul 31 11:46:42 2019 -0400 views: remove unneeded compliances join For some databases, this brings back way too many records. It is unnecessary. And if it were necessary, it will get automatically anyway See also https://github.com/ManageIQ/manageiq-ui-classic/pull/5283 https://bugzilla.redhat.com/show_bug.cgi?id=1733351 product/views/ManageIQ_Providers_CloudManager_Template-all_vms_and_templates.yaml | 1 - product/views/MiqTemplate.yaml | 1 - product/views/ProvisionCloudTemplates.yaml | 1 - product/views/Vm-all_vms.yaml | 1 - product/views/VmOrTemplate-all_archived.yaml | 1 - 5 files changed, 5 deletions(-)
To fix this, This is merged and needs to be backported: https://github.com/ManageIQ/manageiq/pull/17475 These needs to be merged and backported: https://github.com/ManageIQ/manageiq-ui-classic/pull/5919 https://github.com/ManageIQ/manageiq-ui-classic/pull/5926
New commit detected on ManageIQ/manageiq-ui-classic/ivanchuk: https://github.com/ManageIQ/manageiq-ui-classic/commit/3e8082d63a70e9911b6c75b7551ae22cccd4fb2b commit 3e8082d63a70e9911b6c75b7551ae22cccd4fb2b Author: Milan Zázrivec <mzazrivec> AuthorDate: Thu Aug 1 04:09:29 2019 -0400 Commit: Milan Zázrivec <mzazrivec> CommitDate: Thu Aug 1 04:09:29 2019 -0400 Merge pull request #5919 from kbrock/fixup_vm_includes views: remove unneeded compliances join (cherry picked from commit fc8c4911aa5c6a358566078ab26dc19f2d6d253b) https://bugzilla.redhat.com/show_bug.cgi?id=1733351 product/views/ManageIQ_Providers_CloudManager_Template-all_vms_and_templates.yaml | 1 - product/views/MiqTemplate.yaml | 1 - product/views/ProvisionCloudTemplates.yaml | 1 - product/views/Vm-all_vms.yaml | 1 - product/views/VmOrTemplate-all_archived.yaml | 1 - 5 files changed, 5 deletions(-)
New commit detected on ManageIQ/manageiq-ui-classic/ivanchuk: https://github.com/ManageIQ/manageiq-ui-classic/commit/87ce65b6c7cb64ce576d0b3afaa521c32718fb67 commit 87ce65b6c7cb64ce576d0b3afaa521c32718fb67 Author: Milan Zázrivec <mzazrivec> AuthorDate: Fri Aug 2 05:06:09 2019 -0400 Commit: Milan Zázrivec <mzazrivec> CommitDate: Fri Aug 2 05:06:09 2019 -0400 Merge pull request #5926 from kbrock/compliance_statuses Circling back and removing compliance join for all (cherry picked from commit f7dcdc6fd248f53ceb0780a76c9f8c24f9d99238) https://bugzilla.redhat.com/show_bug.cgi?id=1733351 product/views/InstanceOrImage.yaml | 1 - product/views/ManageIQ_Providers_CloudManager_Template.yaml | 1 - product/views/ManageIQ_Providers_CloudManager_Vm-all_vms_and_templates.yaml | 1 - product/views/ManageIQ_Providers_CloudManager_Vm-vms.yaml | 1 - product/views/ManageIQ_Providers_CloudManager_Vm.yaml | 1 - product/views/ManageIQ_Providers_InfraManager_Template.yaml | 1 - product/views/ManageIQ_Providers_InfraManager_Vm.yaml | 1 - product/views/MiqTemplate-all_miq_templates.yaml | 1 - product/views/ProvisionInfraTemplates.yaml | 1 - product/views/Vm.yaml | 1 - product/views/VmOrTemplate-all_orphaned.yaml | 1 - product/views/VmOrTemplate-all_vms_and_templates.yaml | 1 - 12 files changed, 12 deletions(-)
We see the same "Long running http request" issue when expanding a report tree. Will the last commit also fix the issue for reports? [----] W, [2019-08-07T09:36:42.443589 #53073:154b0a4] WARN -- : MIQ(MiqUiWorker::Runner#log_long_running_requests) Long running http(s) request: '/report/tree_autoload' handled by #53073:537ff78, running for 113.72 seconds
The customer DB is huge and QE doesn't have enough resources to recreate the reproducer. Hence this BZ will be marked as verified.
Moving it to verified. Reference BZ- https://bugzilla.redhat.com/show_bug.cgi?id=1738266