Description of problem: I am not sure what is causing this issue so I will set to low for now until we can get further debug. But I am continuing to see major dips in my memory utilization on pods on my cluster. I see it on different pods throughout the cluster outside of my mapreduce cluster. Some info as to what is running: I have a workload running that consists of 1. a datagen pod - which does about 2GB of inserts to a mongodb db/collection 2. the mongodb pod 3. a mapreduce pod which performs mapreduce queries against the three different collections in mongodb in parallel. The inserts happen every 30 minutes while mapreuduce runs ever 15. I cannot tell yet if the memory is actually no longer in use or maybe something is not being reported correctly. With inserts you will see a spike of CP IO and Network utilization with mapreduce you will see a spike of CP utilization. If you can offer any insight and how to further debug? What is strange is this has not been consistent where I have one run happening now on KVM where I am not seeing the dips and I had a run on z/VM where they were happening. Where prior I had seen it solid on z/VM and the dips occurring on KVM so I don't know what the difference is yet. Maybe pod placement which I will see if I can play with that further. I will attach some screen shots of my metrics from both KVM and z/VM On my current watch I noticed that on some refreshes a dip may disappear and reappear on next refresh on grafana and on the OCP Console Dashboard. Not sure why that would be happening. I will include a shot of that as well. Version-Release number of selected component (if applicable): Client Version: 4.8.0-fc.3 Server Version: 4.8.0-fc.3 Kubernetes Version: v1.21.0-rc.0+291e731 How reproducible: Not sure yet. If it is my workload causing the issue then you would need something that creates spikes of CP Network and I/O utilization every 15 to 30 minutes. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: I would expect the memory to be somewhat steady esp on my mongodb pod. I would also expect that the refresh data for grafana and the OCP dashboard reports consistently. Additional info: Please let me know what info I can pull for anything else I can do for more debug. And Thank you for any guidance!
Created attachment 1784203 [details] screen shots of dashboard and grafana ui
@jhusta could you perhaps check if the memory metrics are reported steadily. Do you see any dips when running the following query: ``` sum(container_memory_working_set_bytes{pod="mongodb-XXXXXXXX"}) by (container) ``` If you are seeing any dips, does it affect all containers of the pod or only some of them? Also, if there are dips, could you check the kubelet logs and search for failures related to the container for which we are seeing dips in the metrics. In this particular case, my hunch would be that the kubelet can't get memory insights about the container during a mapreduce. From the screenshots you shared, I assumed that you are only seeing dips for the mongodb memory usage, can you please confirm that this is correct?
Thank you @dgrisonn I started a couple of runs today and this is what I am seeing: - on my mapreduce application I will see a dip on the mongodb and mgodatagen container I am still watching additional runs to check for consistency - this does not seem to be limited to just my mapreduce app as I am seeing it on other pods on the cluster for example tuned I have attached a screen shot - Also, the memory dips on some of the projects are not at the same time so I will see different time dips for different projects - Finally, I was able to get access to newly installed z/vm environment that has nothing running not it and I captured a container in the OpenShift-apiserver Namespace with a one of the containers dipping. I am working on going through the logs to see if anything stands out but it is starting to look like it cannot pull the data from the container intermittently throughout some of the pods in the cluster. I will attach my current screen shots and will update with any log info.
Created attachment 1784861 [details] new screen shots with requested metrics on mapreduce project
> it cannot pull the data from the container intermittently throughout some of the pods in the cluster. To me it looks like this bug isn't actually caused by the monitoring stack, but the issue lies in kubelet not being able to get the memory usage of some containers consistently. Thus, I am sending this bug over to the Node team to further investigate.
It might be a duplicate of bug 1950993.
@jhusta can you ask the customer to send us some screenshots of the metric `container_scrape_error` from Prometheus? The dips in Grafana appear to be because missing data is rendered as the value 0 or connected. This can be configured to instead render null (no value) when data is missing: https://grafana.com/docs/grafana/latest/panels/visualizations/graph-panel/#stacking-and-null-value
Hi @ehashman I pulled container_scrape_error from a z/vm and KVM environment. Both are running the workload which does not seem to be a factor but posted the memory usage on that pod.
Created attachment 1785356 [details] container scrape error
@ehashman Hi there! regarding the need more info, what else do you need besides the container_scrape_error screen shot which I added in my last post? Thanks!
@jhusta the screenshot you attached is for an AWS node. However, the bug as filed says it is for an s390x cluster and you referenced z/VMs in your report. I need you to check this metric on the the s390x cluster where the issue was reported, as you appear to have attached a duplicate of the screenshot from https://bugzilla.redhat.com/show_bug.cgi?id=1950993
Created attachment 1787268 [details] s390x scrape error @ehashman My apologies I must have somehow got the files mixed up when investigating the other bug. Please see the new attached which has samples from both z/VM and z/KVM. There is a new run on z/KVM if you need additional information.
Not all the timestamps in the screenshot line up, but for those that do, it appears this is a duplicate of #1950993. Metrics aren't available because container is not getting scraped. As I mentioned earlier, customer may want to set missing data display to missing rather than have it render as 0. *** This bug has been marked as a duplicate of bug 1950993 ***