Bug 1857676
| Summary: | [BUG] Constant system crash due to OOM events on all masters. Kube-api and etcd seem to be consuming huge amounts of CPU and Memory | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Andre Costa <andcosta> | ||||
| Component: | RHCOS | Assignee: | Colin Walters <walters> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.4 | CC: | bbreard, imcleod, jligon, nstielau, sbatsche, sttts | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.6.0 | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-07-17 10:39:03 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
The kubelet logs are good for providing the signal that OOM is happening, but we would like to get the journals from the affected nodes for additional triage. Going to ask for NEEDINFO on Stefan and Sam to see if they have any guesses based on the provided info. Or asks for additional info. Keeping high priority for now and targeting 4.6 |
Created attachment 1701362 [details] Nodes_info Description of problem: Customer is noticing their masters crashing randomly due to OOMKilled events. Kube-apiserver and etcd processes seem to be consuming an unusual amount of CPU and memory, but not sure if this isn't some issue on the OS level. After rebooting the nodes temporary get back to normal. Uploading some logs gather from the nodes, but not sure if you can use kdump on RHCOS to gather some more information of what may be causing this constant crashes. Version-Release number of selected component (if applicable): OCP 4.4.12 How reproducible: On the customer constantly Steps to Reproduce: Unknown