Bug 1961395
| Summary: | Seeing memory utilization stop on my workload pods and services pods from time to time | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | jhusta <jhusta> | ||||||||||
| Component: | Node | Assignee: | Elana Hashman <ehashman> | ||||||||||
| Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | ||||||||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||||
| Severity: | low | ||||||||||||
| Priority: | unspecified | CC: | alegrand, alklein, anpicker, aos-bugs, brueckner, dgrisonn, erooth, Holger.Wolf, jhusta, kakkoyun, krmoser, lcosic, pkrupa, rphillips, spasquie, wolfgang.voesch | ||||||||||
| Version: | 4.8 | ||||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | 4.8.0 | ||||||||||||
| Hardware: | s390x | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2021-06-01 18:56:57 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
jhusta
2021-05-17 19:53:46 UTC
Created attachment 1784203 [details]
screen shots of dashboard and grafana ui
@jhusta could you perhaps check if the memory metrics are reported steadily. Do you see any dips when running the following query:
```
sum(container_memory_working_set_bytes{pod="mongodb-XXXXXXXX"}) by (container)
```
If you are seeing any dips, does it affect all containers of the pod or only some of them?
Also, if there are dips, could you check the kubelet logs and search for failures related to the container for which we are seeing dips in the metrics. In this particular case, my hunch would be that the kubelet can't get memory insights about the container during a mapreduce.
From the screenshots you shared, I assumed that you are only seeing dips for the mongodb memory usage, can you please confirm that this is correct?
Thank you @dgrisonn I started a couple of runs today and this is what I am seeing: - on my mapreduce application I will see a dip on the mongodb and mgodatagen container I am still watching additional runs to check for consistency - this does not seem to be limited to just my mapreduce app as I am seeing it on other pods on the cluster for example tuned I have attached a screen shot - Also, the memory dips on some of the projects are not at the same time so I will see different time dips for different projects - Finally, I was able to get access to newly installed z/vm environment that has nothing running not it and I captured a container in the OpenShift-apiserver Namespace with a one of the containers dipping. I am working on going through the logs to see if anything stands out but it is starting to look like it cannot pull the data from the container intermittently throughout some of the pods in the cluster. I will attach my current screen shots and will update with any log info. Created attachment 1784861 [details]
new screen shots with requested metrics on mapreduce project
> it cannot pull the data from the container intermittently throughout some of the pods in the cluster.
To me it looks like this bug isn't actually caused by the monitoring stack, but the issue lies in kubelet not being able to get the memory usage of some containers consistently.
Thus, I am sending this bug over to the Node team to further investigate.
It might be a duplicate of bug 1950993. @jhusta can you ask the customer to send us some screenshots of the metric `container_scrape_error` from Prometheus? The dips in Grafana appear to be because missing data is rendered as the value 0 or connected. This can be configured to instead render null (no value) when data is missing: https://grafana.com/docs/grafana/latest/panels/visualizations/graph-panel/#stacking-and-null-value Hi @ehashman I pulled container_scrape_error from a z/vm and KVM environment. Both are running the workload which does not seem to be a factor but posted the memory usage on that pod. Created attachment 1785356 [details]
container scrape error
@ehashman Hi there! regarding the need more info, what else do you need besides the container_scrape_error screen shot which I added in my last post? Thanks! @jhusta the screenshot you attached is for an AWS node. However, the bug as filed says it is for an s390x cluster and you referenced z/VMs in your report. I need you to check this metric on the the s390x cluster where the issue was reported, as you appear to have attached a duplicate of the screenshot from https://bugzilla.redhat.com/show_bug.cgi?id=1950993 Created attachment 1787268 [details]
s390x scrape error
@ehashman My apologies I must have somehow got the files mixed up when investigating the other bug. Please see the new attached which has samples from both z/VM and z/KVM. There is a new run on z/KVM if you need additional information.
Not all the timestamps in the screenshot line up, but for those that do, it appears this is a duplicate of #1950993. Metrics aren't available because container is not getting scraped. As I mentioned earlier, customer may want to set missing data display to missing rather than have it render as 0. *** This bug has been marked as a duplicate of bug 1950993 *** |