Bug 1625966

Summary: 502 Proxy Error - while checking the provider details on UI
Product: Red Hat CloudForms Management Engine Reporter: Avinash Kumar Dasoundhi <adasound>
Component: UI - OPSAssignee: dmetzger
Status: CLOSED NOTABUG QA Contact: Jad Haj Yahya <jhajyahy>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.9.4CC: adasound, dmetzger, gblomqui, hkataria, jfrey, jhardy, lavenel, mpovolny, obarenbo, psuriset
Target Milestone: GA   
Target Release: 5.9.6   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-07 18:10:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
evm log
none
production.log
none
ssl error log
none
proxy error UI screenshot
none
proxy error browser screenshot
none
Log review summary
none
Max worker USS
none
Runnable process count over time none

Description Avinash Kumar Dasoundhi 2018-09-06 10:00:06 UTC
Created attachment 1481248 [details]
evm log

Description of problem:
Getting 502 proxy error when checking the details of the added provider.

After adding the provider which is VMware here and has 10k VMs in it. It is showing a popup containing 502 proxy error while checking the details of the provider. After doing a refresh relationship and power stats for the added provider, it is showing the proxy error. When tried to access the URL from the browser, It is showing the same proxy error. (screenshots attached)

Version-Release number of selected component (if applicable):
CFME 5.9.4.7

How reproducible:
Everytime

Steps to Reproduce:
1. Add the provider (In this case, it is VMware, which has 10k VMs in it)
2. Select the provider & do refresh relationship and power stats.
3. Click on the provider to see the details like hosts, vms etc

Actual results:
After clicking on the provider to check host, VM numbers etc, a popup is coming on the UI saying 502 proxy error. 

Expected results:
It should show the details of the provider like number of hosts, VMs etc

Additional info:
Attaching the evm.log, production.log, and apache/ssl* logs with the error screenshots.

Comment 2 Avinash Kumar Dasoundhi 2018-09-06 10:01:06 UTC
Created attachment 1481249 [details]
production.log

Comment 3 Avinash Kumar Dasoundhi 2018-09-06 10:01:40 UTC
Created attachment 1481250 [details]
ssl error log

Comment 4 Avinash Kumar Dasoundhi 2018-09-06 10:02:29 UTC
Created attachment 1481251 [details]
proxy error UI screenshot

Comment 5 Avinash Kumar Dasoundhi 2018-09-06 10:03:23 UTC
Created attachment 1481252 [details]
proxy error browser screenshot

Comment 9 dmetzger 2018-09-07 18:03:17 UTC
Created attachment 1481632 [details]
Log review summary

Comment 10 dmetzger 2018-09-07 18:04:24 UTC
Created attachment 1481633 [details]
Max worker USS

Comment 11 dmetzger 2018-09-07 18:05:04 UTC
Created attachment 1481634 [details]
Runnable process count over time

Comment 12 dmetzger 2018-09-07 18:05:46 UTC
Based on reviewing the logs from reference worker appliance, the underlying issue appears to be in the work load configured for this appliance.

Based on log review (summary in analysis.txt) we can see:
- Workers Web Services (UI), Reporting and Metrics Collector were exceeding their configured maximum memory usage (max USS usage shown in max_workerP_uss.png) and being restarted.
- The Metrics Collectors were not able to keep up with the data, resulting in many misses
- All worker appliances are in a single Zone, 10K is a very large number of VMs for a single zone
- Number of runnable of processes (workers) over time (vmstat_runnable.png) was often twice the CPU count, resultring in a high system load avcerage, causing high application latency

Overall, the logs reflect an appliance running in a large environment that requires additional deployment configuration in order to run error free.

Comment 13 dmetzger 2018-09-07 18:10:12 UTC
Closing as triaging this ticket indicates the problem appears to be an environment / configuration issue.

Please see https://access.redhat.com/documentation/en-us/reference_architectures/2017/html/deploying_cloudforms_at_scale/index for guidance on configuring CFME at scale.