Description of problem: INIT container stuck forever even though it finishes its script. One of the initContainer is called config-init. It was created 55 mins ago but it stuck since then. We suspected perhaps the init container stuck somewhere but the customer's app team analyzed the INIT container log and they found the init container has finished running as they found the container output an ending log. 55m Normal Created pod/po-cran3-rcpsdl-2 Created container config-init 55m Normal Started pod/po-cran3-rcpsdl-2 Started container config-init 50s Warning FailedSync pod/po-cran3-rcpsdl-2 error determining status: status.CreatedAt is not set Version-Release number of selected component (if applicable): 4.10.10 How reproducible: Quite often in customer's env Several times per day as they are doing rapid redeployment testing Steps to Reproduce: 1. Delete and deploy customer's pod 2. Sometimes one of the pod will stuck in INIT status 3. This issue could happen to random application pod, even the pod has different initContainer image and command Actual results: Expected results: Additional info:
can you attach a must-gather or sos_report from a node on which an affected pod is scheduled?
Hi Ryan, Thank you for your reply. >I suspect the pod memory limits are set too low. Can you request the CU to raise the pod memory limits? Do you mean the memory limits of initContainer ? The "config-init" initContainer in cran2/po-cran2-rcpsdl-0 pod indeed has a low limits, which is 50Mi. However this issue also happens on other PODs and for example, pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" as following log shows: Jun 02 16:40:52 hz25f-pz-10-106-244-34-master01 hyperkube[8356]: E0602 16:40:52.980366 8356 cpu_manager.go:470] "ReconcileState: failed to update container" err="rpc error: code = Unknown desc = updating resources for container \"3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28\" failed: time=\"2022-06-02T16:40:52+08:00\" level=warning msg=\"Setting back cgroup configs failed due to error: Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found., your state.json and actual configs might be inconsistent.\"\ntime=\"2022-06-02T16:40:52+08:00\" level=error msg=\"Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found.\"\n (exit status 1)" pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" containerID="3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28" cpuSet="0-63" Checking alarmdata-init-1 limits, it has 512Mi. Do you think 512Mi is still too low for them? name: alarmdata-init-1 resources: limits: cpu: 100m memory: 512Mi requests: cpu: 50m memory: 512Mi Also I thought if the memory usage exceeds memory limits, the OOM killer will kill the pod and there should be some logs indicating this but I couldn't find OOM logs in the journal. Best Regards, Chen
Hi Peter, Today I had a remote session with the customer. There is one system which can reproduce the problem every time when deploying the application. Here is a brief of what we saw: 1. Remove Released PV 2. Deploy application 3. We found "CreatedAt is not set" error 4. Uninstall application 5. Two of the PVs will be in Released status Since the error is something like kubelet can not verify container status from crio, we restarted the crio service. $ sudo systemcrl restart crio After the restart, we don't see the symptom anymore. So in order to find out the root cause, what kind of the information do we need ? Is https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines enough ? Thank you so much! Best Regards, Chen
*** Bug 2096179 has been marked as a duplicate of this bug. ***
yeah the goroutine stacks should be enough (assuming the node is currently hitting the problem)
another thing I've just thought of: it could be useful to set cri-o to debug log level with a ContainerRuntimeConfig [1] and getting the logs 1: https://docs.openshift.com/container-platform/4.7/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks
*** Bug 2099942 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days