Description of problem: While running some other networking tests at scale, we noticed the networking-daemon was consuming more memory than we expected[1]. We provided the pprof data to Federico[2]. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-22-130743 How reproducible: N/A Steps to Reproduce: 1. Deploy 100 node cluster, launch some pods across the cluster. Actual results: [1] https://snapshot.raintank.io/dashboard/snapshot/yLMK4nYAsCnQN73HI41gb2eUcDCsAd7K?orgId=2 Additional info: [2] https://coreos.slack.com/archives/CFFSAHWHF/p1601476713051200?thread_ts=1601405444.014700&cid=CFFSAHWHF
I think the reason is that the shared informer is caching all pods from all the nodes (even though each pod makes logic only on the pods of the node it belongs to). Sent a temporary build to Joe that caches only local nodes pods, seems to be working.
@Mifiedle, Can you/we try to see in your next scale setup if possible?
@rook Is this something you can re-test on 4.6? or should i spin a new env?
(In reply to Mike Fiedler from comment #5) > @rook Is this something you can re-test on 4.6? or should i spin a new env? Hey Mike - I have already verified @Federico's patch. Do we know if this landed in rc0, if so, I have check a active deployment...
@rook it is in rc0. Missed your ping. Let me know if you can take this one, otherwise I will spin a cluster.
Verified on RC2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196