Created attachment 1481248 [details] evm log Description of problem: Getting 502 proxy error when checking the details of the added provider. After adding the provider which is VMware here and has 10k VMs in it. It is showing a popup containing 502 proxy error while checking the details of the provider. After doing a refresh relationship and power stats for the added provider, it is showing the proxy error. When tried to access the URL from the browser, It is showing the same proxy error. (screenshots attached) Version-Release number of selected component (if applicable): CFME 5.9.4.7 How reproducible: Everytime Steps to Reproduce: 1. Add the provider (In this case, it is VMware, which has 10k VMs in it) 2. Select the provider & do refresh relationship and power stats. 3. Click on the provider to see the details like hosts, vms etc Actual results: After clicking on the provider to check host, VM numbers etc, a popup is coming on the UI saying 502 proxy error. Expected results: It should show the details of the provider like number of hosts, VMs etc Additional info: Attaching the evm.log, production.log, and apache/ssl* logs with the error screenshots.
Created attachment 1481249 [details] production.log
Created attachment 1481250 [details] ssl error log
Created attachment 1481251 [details] proxy error UI screenshot
Created attachment 1481252 [details] proxy error browser screenshot
Created attachment 1481632 [details] Log review summary
Created attachment 1481633 [details] Max worker USS
Created attachment 1481634 [details] Runnable process count over time
Based on reviewing the logs from reference worker appliance, the underlying issue appears to be in the work load configured for this appliance. Based on log review (summary in analysis.txt) we can see: - Workers Web Services (UI), Reporting and Metrics Collector were exceeding their configured maximum memory usage (max USS usage shown in max_workerP_uss.png) and being restarted. - The Metrics Collectors were not able to keep up with the data, resulting in many misses - All worker appliances are in a single Zone, 10K is a very large number of VMs for a single zone - Number of runnable of processes (workers) over time (vmstat_runnable.png) was often twice the CPU count, resultring in a high system load avcerage, causing high application latency Overall, the logs reflect an appliance running in a large environment that requires additional deployment configuration in order to run error free.
Closing as triaging this ticket indicates the problem appears to be an environment / configuration issue. Please see https://access.redhat.com/documentation/en-us/reference_architectures/2017/html/deploying_cloudforms_at_scale/index for guidance on configuring CFME at scale.