Bug 2094865
Summary: | INIT container stuck forever | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Chen <cchen> | |
Component: | Node | Assignee: | Peter Hunt <pehunt> | |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | urgent | CC: | ahoness, cgaynor, dblack, eglottma, nagrawal, openshift-bugs-escalate, pehunt, rphillips, yasingh | |
Version: | 4.10 | |||
Target Milestone: | --- | |||
Target Release: | 4.12.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | cri-o-1.25.0-25.rhaos4.12.git20f639c.el8 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2115435 (view as bug list) | Environment: | ||
Last Closed: | 2023-01-17 19:49:58 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2115435, 2117346 |
Description
Chen
2022-06-08 13:47:17 UTC
can you attach a must-gather or sos_report from a node on which an affected pod is scheduled? Hi Ryan,
Thank you for your reply.
>I suspect the pod memory limits are set too low. Can you request the CU to raise the pod memory limits?
Do you mean the memory limits of initContainer ?
The "config-init" initContainer in cran2/po-cran2-rcpsdl-0 pod indeed has a low limits, which is 50Mi. However this issue also happens on other PODs and for example, pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" as following log shows:
Jun 02 16:40:52 hz25f-pz-10-106-244-34-master01 hyperkube[8356]: E0602 16:40:52.980366 8356 cpu_manager.go:470] "ReconcileState: failed to update container" err="rpc error: code = Unknown desc = updating resources for container \"3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28\" failed: time=\"2022-06-02T16:40:52+08:00\" level=warning msg=\"Setting back cgroup configs failed due to error: Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found., your state.json and actual configs might be inconsistent.\"\ntime=\"2022-06-02T16:40:52+08:00\" level=error msg=\"Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found.\"\n (exit status 1)" pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" containerID="3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28" cpuSet="0-63"
Checking alarmdata-init-1 limits, it has 512Mi. Do you think 512Mi is still too low for them?
name: alarmdata-init-1
resources:
limits:
cpu: 100m
memory: 512Mi
requests:
cpu: 50m
memory: 512Mi
Also I thought if the memory usage exceeds memory limits, the OOM killer will kill the pod and there should be some logs indicating this but I couldn't find OOM logs in the journal.
Best Regards,
Chen
Hi Peter, Today I had a remote session with the customer. There is one system which can reproduce the problem every time when deploying the application. Here is a brief of what we saw: 1. Remove Released PV 2. Deploy application 3. We found "CreatedAt is not set" error 4. Uninstall application 5. Two of the PVs will be in Released status Since the error is something like kubelet can not verify container status from crio, we restarted the crio service. $ sudo systemcrl restart crio After the restart, we don't see the symptom anymore. So in order to find out the root cause, what kind of the information do we need ? Is https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines enough ? Thank you so much! Best Regards, Chen *** Bug 2096179 has been marked as a duplicate of this bug. *** yeah the goroutine stacks should be enough (assuming the node is currently hitting the problem) another thing I've just thought of: it could be useful to set cri-o to debug log level with a ContainerRuntimeConfig [1] and getting the logs 1: https://docs.openshift.com/container-platform/4.7/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks *** Bug 2099942 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |