Bug 1883971 - [metric-daemon] memory usage high at scale
Summary: [metric-daemon] memory usage high at scale
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Federico Paolinelli
QA Contact: Weibin Liang
URL:
Whiteboard: aos-scalability-46
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-30 16:06 UTC by Joe Talerico
Modified: 2020-10-27 16:47 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:47:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift network-metrics-daemon pull 28 0 None closed Bug 1883971: Use a filtered shared informer. 2020-11-06 13:59:29 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:47:22 UTC

Description Joe Talerico 2020-09-30 16:06:56 UTC
Description of problem:
While running some other networking tests at scale, we noticed the networking-daemon was consuming more memory than we expected[1]. We provided the pprof data to Federico[2]. 

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-22-130743

How reproducible:
N/A 

Steps to Reproduce:
1. Deploy 100 node cluster, launch some pods across the cluster.

Actual results:
[1] https://snapshot.raintank.io/dashboard/snapshot/yLMK4nYAsCnQN73HI41gb2eUcDCsAd7K?orgId=2

Additional info:
[2] https://coreos.slack.com/archives/CFFSAHWHF/p1601476713051200?thread_ts=1601405444.014700&cid=CFFSAHWHF

Comment 1 Federico Paolinelli 2020-09-30 16:38:41 UTC
I think the reason is that the shared informer is caching all pods from all the nodes (even though each pod makes logic only on the pods of the node it belongs to).
Sent a temporary build to Joe that caches only local nodes pods, seems to be working.

Comment 4 Anurag saxena 2020-10-09 08:00:44 UTC
@Mifiedle, Can you/we try to see in your next scale setup if possible?

Comment 5 Mike Fiedler 2020-10-09 15:16:27 UTC
@rook Is this something you can re-test on 4.6?  or should i spin a new env?

Comment 6 Joe Talerico 2020-10-12 17:59:07 UTC
(In reply to Mike Fiedler from comment #5)
> @rook Is this something you can re-test on 4.6?  or should i spin a new env?

Hey Mike - I have already verified @Federico's patch. Do we know if this landed in rc0, if so, I have check a active deployment...

Comment 8 Mike Fiedler 2020-10-14 15:08:15 UTC
@rook it is in rc0.   Missed your ping.  Let me know if you can take this one, otherwise I will spin a cluster.

Comment 9 Joe Talerico 2020-10-14 15:23:43 UTC
Verified on RC2.

Comment 12 errata-xmlrpc 2020-10-27 16:47:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.