Bug 1961395 - Seeing memory utilization stop on my workload pods and services pods from time to time
Summary: Seeing memory utilization stop on my workload pods and services pods from tim...
Keywords:
Status: CLOSED DUPLICATE of bug 1950993
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: s390x
OS: Linux
unspecified
low
Target Milestone: ---
: 4.8.0
Assignee: Elana Hashman
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-17 19:53 UTC by jhusta
Modified: 2021-06-01 18:56 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-01 18:56:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screen shots of dashboard and grafana ui (775.38 KB, application/pdf)
2021-05-17 19:56 UTC, jhusta
no flags Details
new screen shots with requested metrics on mapreduce project (172.57 KB, application/zip)
2021-05-19 15:30 UTC, jhusta
no flags Details
container scrape error (78.84 KB, image/png)
2021-05-20 20:25 UTC, jhusta
no flags Details
s390x scrape error (323.27 KB, application/pdf)
2021-05-26 13:42 UTC, jhusta
no flags Details

Description jhusta 2021-05-17 19:53:46 UTC
Description of problem:
I am not sure what is causing this issue so I will set to low for now until we can get further debug. But I am continuing to see major dips in my memory utilization on pods on my cluster. I see it on different pods throughout the cluster outside of my mapreduce cluster. 

Some info as to what is running:
I have a workload running that consists of 
1. a datagen pod - which does about 2GB of inserts to a mongodb db/collection
2. the mongodb pod
3. a mapreduce pod which performs mapreduce queries against the three different collections in mongodb in parallel.

The inserts happen every 30 minutes while mapreuduce runs ever 15. I cannot tell yet if the memory is actually no longer in use or maybe something is not being reported correctly. 
With inserts you will see a spike of CP IO and Network utilization with mapreduce you will see a spike of CP utilization.

If you can offer any insight and how to further debug? What is strange is this has not been consistent where I have one run happening now on KVM where I am not seeing the dips and I had a run on z/VM where they were happening. Where prior I had seen it solid on z/VM and the dips occurring on KVM so I don't know what the difference is yet. Maybe pod placement which I will see if I can play with that further.

I will attach some screen shots of my metrics from both KVM and z/VM

On my current watch I noticed that on some refreshes a dip may disappear and reappear on next refresh on grafana and on the OCP Console Dashboard. Not sure why that would be happening. I will include a shot of that as well.

Version-Release number of selected component (if applicable):
Client Version: 4.8.0-fc.3
Server Version: 4.8.0-fc.3
Kubernetes Version: v1.21.0-rc.0+291e731


How reproducible:
Not sure yet. If it is my workload causing the issue then you would need something that creates spikes of CP Network and I/O utilization every 15 to 30 minutes.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
I would expect the memory to be somewhat steady esp on my mongodb pod.
I would also expect that the refresh data for grafana and the OCP dashboard reports consistently. 


Additional info:
Please let me know what info I can pull for anything else I can do for more debug. And Thank you for any guidance!

Comment 1 jhusta 2021-05-17 19:56:22 UTC
Created attachment 1784203 [details]
screen shots of dashboard and grafana ui

Comment 2 Damien Grisonnet 2021-05-19 08:07:35 UTC
@jhusta could you perhaps check if the memory metrics are reported steadily. Do you see any dips when running the following query:

```
sum(container_memory_working_set_bytes{pod="mongodb-XXXXXXXX"}) by (container)
```

If you are seeing any dips, does it affect all containers of the pod or only some of them?

Also, if there are dips, could you check the kubelet logs and search for failures related to the container for which we are seeing dips in the metrics. In this particular case, my hunch would be that the kubelet can't get memory insights about the container during a mapreduce.

From the screenshots you shared, I assumed that you are only seeing dips for the mongodb memory usage, can you please confirm that this is correct?

Comment 3 jhusta 2021-05-19 15:28:15 UTC
Thank you @dgrisonn I started a couple of runs today and this is what I am seeing:
- on my mapreduce application I will see a dip on the mongodb and mgodatagen container I am still watching additional runs to check for consistency
- this does not seem to be limited to just my mapreduce app as I am seeing it on other pods on the cluster for example tuned I have attached a screen shot
- Also, the memory dips on some of the projects are not at the same time so I will see different time dips for different projects
- Finally, I was able to get access to newly installed z/vm environment that has nothing running not it and I captured a container in the OpenShift-apiserver Namespace with a one of the containers dipping.

I am working on going through the logs to see if anything stands out but it is starting to look like it cannot pull the data from the container intermittently throughout some of the pods in the cluster. 

I will attach my current screen shots and will update with any log info.

Comment 4 jhusta 2021-05-19 15:30:41 UTC
Created attachment 1784861 [details]
new screen shots with requested metrics on mapreduce project

Comment 5 Damien Grisonnet 2021-05-20 14:06:07 UTC
> it cannot pull the data from the container intermittently throughout some of the pods in the cluster.

To me it looks like this bug isn't actually caused by the monitoring stack, but the issue lies in kubelet not being able to get the memory usage of some containers consistently.

Thus, I am sending this bug over to the Node team to further investigate.

Comment 6 Simon Pasquier 2021-05-20 15:54:04 UTC
It might be a duplicate of bug 1950993.

Comment 7 Elana Hashman 2021-05-20 16:53:45 UTC
@jhusta can you ask the customer to send us some screenshots of the metric `container_scrape_error` from Prometheus?

The dips in Grafana appear to be because missing data is rendered as the value 0 or connected. This can be configured to instead render null (no value) when data is missing: https://grafana.com/docs/grafana/latest/panels/visualizations/graph-panel/#stacking-and-null-value

Comment 8 jhusta 2021-05-20 20:24:53 UTC
Hi @ehashman I pulled container_scrape_error from a z/vm and KVM environment. Both are running the workload which does not seem to be a factor but posted the memory usage on that pod.

Comment 9 jhusta 2021-05-20 20:25:33 UTC
Created attachment 1785356 [details]
container scrape error

Comment 11 jhusta 2021-05-25 13:57:48 UTC
@ehashman Hi there! regarding the need more info,  what else do you need besides the container_scrape_error screen shot which I added in my last post? Thanks!

Comment 12 Elana Hashman 2021-05-25 18:31:55 UTC
@jhusta the screenshot you attached is for an AWS node. However, the bug as filed says it is for an s390x cluster and you referenced z/VMs in your report. I need you to check this metric on the the s390x cluster where the issue was reported, as you appear to have attached a duplicate of the screenshot from https://bugzilla.redhat.com/show_bug.cgi?id=1950993

Comment 13 jhusta 2021-05-26 13:42:23 UTC
Created attachment 1787268 [details]
s390x scrape error

@ehashman My apologies I must have somehow got the files mixed up when investigating the other bug. Please see the new attached which has samples from both z/VM and z/KVM. There is a new run on z/KVM if you need additional information.

Comment 14 Elana Hashman 2021-06-01 18:56:57 UTC
Not all the timestamps in the screenshot line up, but for those that do, it appears this is a duplicate of #1950993. Metrics aren't available because container is not getting scraped.

As I mentioned earlier, customer may want to set missing data display to missing rather than have it render as 0.

*** This bug has been marked as a duplicate of bug 1950993 ***


Note You need to log in before you can comment on or make changes to this bug.