Bug 1980437

Summary:

High memory usage by crio.service in worker nodes

Product:

OpenShift Container Platform

Reporter:

Alexey Kazakov <alkazako>

Component:

Release

Assignee:

Peter Hunt <pehunt>

Status:

CLOSED DUPLICATE

QA Contact:

Sunil Choudhary <schoudha>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.7

CC:

aos-bugs, dawilson, dmunneor, dofinn, jokerman, kgordeev, mharri, mhofmann, mpatel, mrobson, pehunt, sgrunert, ssonigra, wking

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-09-01 16:18:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}	none
System memory usage for the last two weeks	none
crio working set after restart	none
crio rss after restart	none

Description Alexey Kazakov 2021-07-08 16:31:53 UTC

Description of problem:

We have a few danced OSD clusters which running Dev Sandbox: https://developers.redhat.com/developer-sandbox

These are 4.7.16 clusters running in AWS.

One of our cluster shows high system memory usage in worker nodes.
Specifically in one node we see very high memory usage by crio.service in worker nodes.

The node has about 100 running pods.

container_memory_rss{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

Three top:
/system.slice/crio.service - ~2.9 GB
/system.slice/kubelet.service - ~550 MB
/system.slice/systemd-journald.service - ~80 MB
The rest is less than 50 MB


container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

Three top:
/system.slice/crio.service - ~3.6 GB
/system.slice/kubelet.service - ~700 MB
/system.slice/systemd-journald.service - 180 MB
The rest is less than 100 MB

The pods in this node experience different issues including network connection failing by timeout. Our operators happen to run on that node and the entire Dev Sandbox is unstable.

Comment 2 Peter Hunt 2021-07-08 19:04:58 UTC

I am also interested in the metrics for the number of bytes pulled for image pulls as time goes on. Specifically, it'd be interesteing to see how 
`container_memory_working_set_bytes` correlates with the `crio_image_pulls_by_name` metrics

are you able to get that information for me?

Comment 3 Alexey Kazakov 2021-07-08 19:17:59 UTC

Created attachment 1799755 [details]
container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

I've added the screenshot from Prometheus for:
container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"}

I don't see any crio_image_pulls_by_name metric.. I must be doing something wrong. Can you assist me with this please?

Comment 4 Peter Hunt 2021-07-08 19:30:51 UTC

ah the metric is actually named `container_runtime_crio_image_pulls_by_name`
we may not be scraping them by prom, as I cannot see them in the dashboard of a node I have up. If not, you can ssh to the node and grab the value now by doing:
`curl localhost:9537/metrics`

Comment 6 Peter Hunt 2021-07-08 19:54:02 UTC

I think it is pretty likely it's due to multiple concurrent pulls bumping up cri-o's rss. Note the following script:

```
grep _digest{ /tmp/mozilla_pehunt0/crio-metrics.out   | grep -o size=\".*\" | sed 's/\"/ /g' | awk '{ printf $2"\n" }' > /tmp/bytes
paste -sd+ /tmp/bytes | bc
```

I get 385648780939 (385 GB)

This number is the sum of the size of all images that node has pulled. 

We have a couple of options:
- periodically force go GC (which may be inefficient)
- set the GC threshold to be lower so go GCs faster
- set a number of concurrent pulls to mitigate the amount of concurrent memory cri-o needs
- bump the system reserved for nodes that we expect tons of image pulls

Comment 7 Alexey Kazakov 2021-07-08 20:00:00 UTC

The last option "bump the system reserved for nodes" is what we could do right now to stabilize our nodes, right?
We have a ticket for the OSD sre folks for that: https://issues.redhat.com/browse/OHSS-5120
Do you think setting it to 4GB would be reasonable (instead of the current 1GB)?

Our nodes are open to our users and we have tons of users starting different pods. So, yes, there can be big number of concurrent pulls.

Comment 8 Peter Hunt 2021-07-08 20:05:19 UTC

it's hard to tell what's "reasonable" as it's hard to tell the max concurrent pulls (and how much memory they'd consume). I would hope we'd stay under it, but it is hard to tell given how much it's ballooned already. I do believe we'll want some sort of cri-o fix as well.

Comment 9 Alexey Kazakov 2021-07-08 20:18:36 UTC

Created attachment 1799782 [details]
System memory usage for the last two weeks

I'm attaching the system memory usage for this node for the last two weeks. You can see how it was growing.

And sure. We would love to get something fixed on the crio side but meanwhile we need to stabilize our nodes ASAP since they seem to be choking for a week already.

So, I'm thinking about increasing the reservation and maybe re-starting the nodes so it provides some temporal relieve while we are waiting for more robust solution on the crio/OCP end.

Any thoughts?

Comment 10 Peter Hunt 2021-07-08 20:24:40 UTC

that sounds great. If you could also periodically check the total image pull size (as shown above) along with the crio rss to get a correlation, I'd love to have that information.

Comment 11 Peter Hunt 2021-07-08 20:25:23 UTC

fyi, you should be able to just restart crio and it'll relinquish the hoarded rss. full node reboot shouldn't be necessary

Comment 12 Alexey Kazakov 2021-07-08 20:29:42 UTC

How do I restart crio? Just "systemctl start crio"? Or should I take care of something else?

Comment 13 Peter Hunt 2021-07-08 20:32:31 UTC

"systemctl restart crio" but yes essentially :)

Comment 14 Alexey Kazakov 2021-07-08 21:07:50 UTC

Created attachment 1799788 [details]
crio working set after restart

Comment 15 Alexey Kazakov 2021-07-08 21:08:34 UTC

Created attachment 1799789 [details]
crio rss after restart

Comment 16 Alexey Kazakov 2021-07-08 22:30:24 UTC

Just for the record. The system memory usage significantly dropped after restarting crio (see attached screenshots above) but we still see network issues. At least in pods in this node :(((

Comment 17 Mrunal Patel 2021-07-09 17:53:15 UTC

Another thing to try is force golang GC by sending SIGUSR2 to crio process. I am interested to see how much that helps.

Check rss before and after
grep -i rss /proc/$(pidof crio)/status

kill -USR2 $(pidof crio)

grep -i rss /proc/$(pidof crio)/status

Comment 18 Alexey Kazakov 2021-07-09 18:22:20 UTC

We just updated our clusters to 4.7.19, so we will have to wait for awhile until the crio memory usage builds up again. Then we will try the SIGUSR2 signal.

Comment 19 W. Trevor King 2021-07-17 03:34:25 UTC

> I think it is pretty likely it's due to multiple concurrent pulls bumping up cri-o's rss.

Does CRI-O hold image layers in memory?  I would have expected it to stream them from the network right onto the disk, while teeing the bytes into a hasher to confirm that we got the expected digest.

Comment 33 Sascha Grunert 2021-09-01 16:18:26 UTC

I think we can consider this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2000092, feel free to reopen if the issue is not resolved.

*** This bug has been marked as a duplicate of bug 2000092 ***

Comment 34 Red Hat Bugzilla 2023-09-15 01:11:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days