Description of problem: We have a few danced OSD clusters which running Dev Sandbox: https://developers.redhat.com/developer-sandbox These are 4.7.16 clusters running in AWS. One of our cluster shows high system memory usage in worker nodes. Specifically in one node we see very high memory usage by crio.service in worker nodes. The node has about 100 running pods. container_memory_rss{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"} Three top: /system.slice/crio.service - ~2.9 GB /system.slice/kubelet.service - ~550 MB /system.slice/systemd-journald.service - ~80 MB The rest is less than 50 MB container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"} Three top: /system.slice/crio.service - ~3.6 GB /system.slice/kubelet.service - ~700 MB /system.slice/systemd-journald.service - 180 MB The rest is less than 100 MB The pods in this node experience different issues including network connection failing by timeout. Our operators happen to run on that node and the entire Dev Sandbox is unstable.
I am also interested in the metrics for the number of bytes pulled for image pulls as time goes on. Specifically, it'd be interesteing to see how `container_memory_working_set_bytes` correlates with the `crio_image_pulls_by_name` metrics are you able to get that information for me?
Created attachment 1799755 [details] container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"} I've added the screenshot from Prometheus for: container_memory_working_set_bytes{id=~"/system.slice/.*",node="ip-10-0-228-21.us-east-2.compute.internal"} I don't see any crio_image_pulls_by_name metric.. I must be doing something wrong. Can you assist me with this please?
ah the metric is actually named `container_runtime_crio_image_pulls_by_name` we may not be scraping them by prom, as I cannot see them in the dashboard of a node I have up. If not, you can ssh to the node and grab the value now by doing: `curl localhost:9537/metrics`
I think it is pretty likely it's due to multiple concurrent pulls bumping up cri-o's rss. Note the following script: ``` grep _digest{ /tmp/mozilla_pehunt0/crio-metrics.out | grep -o size=\".*\" | sed 's/\"/ /g' | awk '{ printf $2"\n" }' > /tmp/bytes paste -sd+ /tmp/bytes | bc ``` I get 385648780939 (385 GB) This number is the sum of the size of all images that node has pulled. We have a couple of options: - periodically force go GC (which may be inefficient) - set the GC threshold to be lower so go GCs faster - set a number of concurrent pulls to mitigate the amount of concurrent memory cri-o needs - bump the system reserved for nodes that we expect tons of image pulls
The last option "bump the system reserved for nodes" is what we could do right now to stabilize our nodes, right? We have a ticket for the OSD sre folks for that: https://issues.redhat.com/browse/OHSS-5120 Do you think setting it to 4GB would be reasonable (instead of the current 1GB)? Our nodes are open to our users and we have tons of users starting different pods. So, yes, there can be big number of concurrent pulls.
it's hard to tell what's "reasonable" as it's hard to tell the max concurrent pulls (and how much memory they'd consume). I would hope we'd stay under it, but it is hard to tell given how much it's ballooned already. I do believe we'll want some sort of cri-o fix as well.
Created attachment 1799782 [details] System memory usage for the last two weeks I'm attaching the system memory usage for this node for the last two weeks. You can see how it was growing. And sure. We would love to get something fixed on the crio side but meanwhile we need to stabilize our nodes ASAP since they seem to be choking for a week already. So, I'm thinking about increasing the reservation and maybe re-starting the nodes so it provides some temporal relieve while we are waiting for more robust solution on the crio/OCP end. Any thoughts?
that sounds great. If you could also periodically check the total image pull size (as shown above) along with the crio rss to get a correlation, I'd love to have that information.
fyi, you should be able to just restart crio and it'll relinquish the hoarded rss. full node reboot shouldn't be necessary
How do I restart crio? Just "systemctl start crio"? Or should I take care of something else?
"systemctl restart crio" but yes essentially :)
Created attachment 1799788 [details] crio working set after restart
Created attachment 1799789 [details] crio rss after restart
Just for the record. The system memory usage significantly dropped after restarting crio (see attached screenshots above) but we still see network issues. At least in pods in this node :(((
Another thing to try is force golang GC by sending SIGUSR2 to crio process. I am interested to see how much that helps. Check rss before and after grep -i rss /proc/$(pidof crio)/status kill -USR2 $(pidof crio) grep -i rss /proc/$(pidof crio)/status
We just updated our clusters to 4.7.19, so we will have to wait for awhile until the crio memory usage builds up again. Then we will try the SIGUSR2 signal.
> I think it is pretty likely it's due to multiple concurrent pulls bumping up cri-o's rss. Does CRI-O hold image layers in memory? I would have expected it to stream them from the network right onto the disk, while teeing the bytes into a hasher to confirm that we got the expected digest.
I think we can consider this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2000092, feel free to reopen if the issue is not resolved. *** This bug has been marked as a duplicate of bug 2000092 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days