Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1883971

Summary:	[metric-daemon] memory usage high at scale
Product:	OpenShift Container Platform	Reporter:	Joe Talerico <jtaleric>
Component:	Networking	Assignee:	Federico Paolinelli <fpaoline>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	anusaxen, fiezzi, fpaoline, mifiedle, msheth, nelluri, rsevilla, xtian
Version:	4.6
Target Milestone:	---
Target Release:	4.6.0
Hardware:	All
OS:	All
Whiteboard:	aos-scalability-46
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:47:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joe Talerico 2020-09-30 16:06:56 UTC

Description of problem:
While running some other networking tests at scale, we noticed the networking-daemon was consuming more memory than we expected[1]. We provided the pprof data to Federico[2]. 

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-22-130743

How reproducible:
N/A 

Steps to Reproduce:
1. Deploy 100 node cluster, launch some pods across the cluster.

Actual results:
[1] https://snapshot.raintank.io/dashboard/snapshot/yLMK4nYAsCnQN73HI41gb2eUcDCsAd7K?orgId=2

Additional info:
[2] https://coreos.slack.com/archives/CFFSAHWHF/p1601476713051200?thread_ts=1601405444.014700&cid=CFFSAHWHF

Comment 1 Federico Paolinelli 2020-09-30 16:38:41 UTC

I think the reason is that the shared informer is caching all pods from all the nodes (even though each pod makes logic only on the pods of the node it belongs to).
Sent a temporary build to Joe that caches only local nodes pods, seems to be working.

Comment 4 Anurag saxena 2020-10-09 08:00:44 UTC

@Mifiedle, Can you/we try to see in your next scale setup if possible?

Comment 5 Mike Fiedler 2020-10-09 15:16:27 UTC

@rook Is this something you can re-test on 4.6?  or should i spin a new env?

Comment 6 Joe Talerico 2020-10-12 17:59:07 UTC

(In reply to Mike Fiedler from comment #5)
> @rook Is this something you can re-test on 4.6?  or should i spin a new env?

Hey Mike - I have already verified @Federico's patch. Do we know if this landed in rc0, if so, I have check a active deployment...

Comment 8 Mike Fiedler 2020-10-14 15:08:15 UTC

@rook it is in rc0.   Missed your ping.  Let me know if you can take this one, otherwise I will spin a cluster.

Comment 9 Joe Talerico 2020-10-14 15:23:43 UTC

Verified on RC2.

Comment 12 errata-xmlrpc 2020-10-27 16:47:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196