Bug 1980437 - High memory usage by crio.service in worker nodes
Summary: High memory usage by crio.service in worker nodes
Keywords:
Status: CLOSED DUPLICATE of bug 2000092
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Release
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-08 16:31 UTC by Alexey Kazakov
Modified: 2023-09-15 01:11 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-01 16:18:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"} (173.55 KB, image/png)
2021-07-08 19:17 UTC, Alexey Kazakov
no flags Details
System memory usage for the last two weeks (94.76 KB, image/png)
2021-07-08 20:18 UTC, Alexey Kazakov
no flags Details
crio working set after restart (44.76 KB, image/png)
2021-07-08 21:07 UTC, Alexey Kazakov
no flags Details
crio rss after restart (42.15 KB, image/png)
2021-07-08 21:08 UTC, Alexey Kazakov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6304881 0 None None None 2021-09-03 14:48:16 UTC

Description Alexey Kazakov 2021-07-08 16:31:53 UTC
Description of problem:

We have a few danced OSD clusters which running Dev Sandbox: https://developers.redhat.com/developer-sandbox

These are 4.7.16 clusters running in AWS.

One of our cluster shows high system memory usage in worker nodes.
Specifically in one node we see very high memory usage by crio.service in worker nodes.

The node has about 100 running pods.

container_memory_rss{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

Three top:
/system.slice/crio.service - ~2.9 GB
/system.slice/kubelet.service - ~550 MB
/system.slice/systemd-journald.service - ~80 MB
The rest is less than 50 MB


container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

Three top:
/system.slice/crio.service - ~3.6 GB
/system.slice/kubelet.service - ~700 MB
/system.slice/systemd-journald.service - 180 MB
The rest is less than 100 MB

The pods in this node experience different issues including network connection failing by timeout. Our operators happen to run on that node and the entire Dev Sandbox is unstable.

Comment 2 Peter Hunt 2021-07-08 19:04:58 UTC
I am also interested in the metrics for the number of bytes pulled for image pulls as time goes on. Specifically, it'd be interesteing to see how 
`container_memory_working_set_bytes` correlates with the `crio_image_pulls_by_name` metrics

are you able to get that information for me?

Comment 3 Alexey Kazakov 2021-07-08 19:17:59 UTC
Created attachment 1799755 [details]
container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

I've added the screenshot from Prometheus for:
container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

I don't see any crio_image_pulls_by_name metric.. I must be doing something wrong. Can you assist me with this please?

Comment 4 Peter Hunt 2021-07-08 19:30:51 UTC
ah the metric is actually named `container_runtime_crio_image_pulls_by_name`
we may not be scraping them by prom, as I cannot see them in the dashboard of a node I have up. If not, you can ssh to the node and grab the value now by doing:
`curl localhost:9537/metrics`

Comment 6 Peter Hunt 2021-07-08 19:54:02 UTC
I think it is pretty likely it's due to multiple concurrent pulls bumping up cri-o's rss. Note the following script:

```
grep _digest{ /tmp/mozilla_pehunt0/crio-metrics.out   | grep -o size=\".*\" | sed 's/\"/ /g' | awk '{ printf $2"\n" }' > /tmp/bytes
paste -sd+ /tmp/bytes | bc
```

I get 385648780939 (385 GB)

This number is the sum of the size of all images that node has pulled. 

We have a couple of options:
- periodically force go GC (which may be inefficient)
- set the GC threshold to be lower so go GCs faster
- set a number of concurrent pulls to mitigate the amount of concurrent memory cri-o needs
- bump the system reserved for nodes that we expect tons of image pulls

Comment 7 Alexey Kazakov 2021-07-08 20:00:00 UTC
The last option "bump the system reserved for nodes" is what we could do right now to stabilize our nodes, right?
We have a ticket for the OSD sre folks for that: https://issues.redhat.com/browse/OHSS-5120
Do you think setting it to 4GB would be reasonable (instead of the current 1GB)?

Our nodes are open to our users and we have tons of users starting different pods. So, yes, there can be big number of concurrent pulls.

Comment 8 Peter Hunt 2021-07-08 20:05:19 UTC
it's hard to tell what's "reasonable" as it's hard to tell the max concurrent pulls (and how much memory they'd consume). I would hope we'd stay under it, but it is hard to tell given how much it's ballooned already. I do believe we'll want some sort of cri-o fix as well.

Comment 9 Alexey Kazakov 2021-07-08 20:18:36 UTC
Created attachment 1799782 [details]
System memory usage for the last two weeks

I'm attaching the system memory usage for this node for the last two weeks. You can see how it was growing.

And sure. We would love to get something fixed on the crio side but meanwhile we need to stabilize our nodes ASAP since they seem to be choking for a week already.

So, I'm thinking about increasing the reservation and maybe re-starting the nodes so it provides some temporal relieve while we are waiting for more robust solution on the crio/OCP end.

Any thoughts?

Comment 10 Peter Hunt 2021-07-08 20:24:40 UTC
that sounds great. If you could also periodically check the total image pull size (as shown above) along with the crio rss to get a correlation, I'd love to have that information.

Comment 11 Peter Hunt 2021-07-08 20:25:23 UTC
fyi, you should be able to just restart crio and it'll relinquish the hoarded rss. full node reboot shouldn't be necessary

Comment 12 Alexey Kazakov 2021-07-08 20:29:42 UTC
How do I restart crio? Just "systemctl start crio"? Or should I take care of something else?

Comment 13 Peter Hunt 2021-07-08 20:32:31 UTC
"systemctl restart crio" but yes essentially :)

Comment 14 Alexey Kazakov 2021-07-08 21:07:50 UTC
Created attachment 1799788 [details]
crio working set after restart

Comment 15 Alexey Kazakov 2021-07-08 21:08:34 UTC
Created attachment 1799789 [details]
crio rss after restart

Comment 16 Alexey Kazakov 2021-07-08 22:30:24 UTC
Just for the record. The system memory usage significantly dropped after restarting crio (see attached screenshots above) but we still see network issues. At least in pods in this node :(((

Comment 17 Mrunal Patel 2021-07-09 17:53:15 UTC
Another thing to try is force golang GC by sending SIGUSR2 to crio process. I am interested to see how much that helps.

Check rss before and after
grep -i rss /proc/$(pidof crio)/status

kill -USR2 $(pidof crio)

grep -i rss /proc/$(pidof crio)/status

Comment 18 Alexey Kazakov 2021-07-09 18:22:20 UTC
We just updated our clusters to 4.7.19, so we will have to wait for awhile until the crio memory usage builds up again. Then we will try the SIGUSR2 signal.

Comment 19 W. Trevor King 2021-07-17 03:34:25 UTC
> I think it is pretty likely it's due to multiple concurrent pulls bumping up cri-o's rss.

Does CRI-O hold image layers in memory?  I would have expected it to stream them from the network right onto the disk, while teeing the bytes into a hasher to confirm that we got the expected digest.

Comment 33 Sascha Grunert 2021-09-01 16:18:26 UTC
I think we can consider this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2000092, feel free to reopen if the issue is not resolved.

*** This bug has been marked as a duplicate of bug 2000092 ***

Comment 34 Red Hat Bugzilla 2023-09-15 01:11:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.