Red Hat Bugzilla – Bug 1479339
Memory leak in MetricsProcessor Worker
Last modified: 2018-01-19 18:56:38 EST
Created attachment 1310596 [details]
Priority worker leakage on all appliances
Description of problem:
Observed memory leaks in MetricsProcessor Worker and GenericWorker for a 10k VMware provider infra connected
This ran for over 3 days but leaks occurred in the first day itself.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. [ 1 DB, 5 worker] 6 appliance setup.
2. Turn on C&U on all worker appliances (and cluster wide C&U collection settings in config) and keep server roles to minimum on DB appliance
3. Connect to 10k vms VMware infra provider and let it run for 2-3 days while keeping an eye on C&U data collector worker memory usage.
MetricsProcessorWorker memory grew from about 1.5G to 2.8G.
almost little or no memory growth after initial C&U/refresh period.
attaching screenshot for reference
Original comment (From BZ about MetricsCollector worker leak for multiple providers): https://bugzilla.redhat.com/show_bug.cgi?id=1456775#c29
Created attachment 1310597 [details]
Generic worker leakage w.r.t stable processor queue and powered on/off vms
Created attachment 1310598 [details]
all worker types memory usage comparison
To further add to 'steps to reproduce' in description, I had increased memory thresholds / counts for specific worker processes on all appliances, just enough to accommodate those many VMs for a 6 appliance setup.
- Generic - 2, 500 MB
- Priority - 2, 600 MB
# Worker appliances
- Generic - 4, 500 MB
- Priority - 2, 800 MB
- C&U Data Collectors - 6, 600 MB
- C&U Data Processors - 4, 800 MB
- Refresh - 2 GB
The refresh worker's (leaked?) memory grew by few MBs. Its RSS memory growth is included in the attachment https://bugzilla.redhat.com/attachment.cgi?id=1310598
Created attachment 1311340 [details]
PSS & RSS utilization - 4+ day test run
Single Metrics Processor Worker
1.5Gb Memory Threshold
Type: VMware VC 5.5.0
I think based on some talks with Dennis regarding similar tickets, I think enabling the metrics collection is the root cause to some of the "leaks" that we are seeing.
Most of my commenting will probably be done on:
Will update here when I have more to share.
A possible fix has been proposed in this related BZ:
That is targeted for the MiqServer, and high confidence that it will fix the leak there. Updates will probably happen there more regularly until we determine if there is a different leak in the MetricsProcessor Worker, and there is a high probability this was a leak across all workers.
The fix above has been backported to 5.8:
As well as for future releases here:
We are going to do some testing ourselves to see if this is fixing the issue with the MetricsProcessor as well, and will update with those results.