Bug 2094865

Summary: INIT container stuck forever
Product: OpenShift Container Platform Reporter: Chen <cchen>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: ahoness, cgaynor, dblack, eglottma, nagrawal, openshift-bugs-escalate, pehunt, rphillips, yasingh
Version: 4.10   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cri-o-1.25.0-25.rhaos4.12.git20f639c.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2115435 (view as bug list) Environment:
Last Closed: 2023-01-17 19:49:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2115435, 2117346    

Description Chen 2022-06-08 13:47:17 UTC
Description of problem:

INIT container stuck forever even though it finishes its script.

One of the initContainer is called config-init. It was created 55 mins ago but it stuck since then. We suspected perhaps the init container stuck somewhere but the customer's app team analyzed the INIT container log and they found the init container has finished running as they found the container output an ending log.

55m        Normal   Created                 pod/po-cran3-rcpsdl-2                                      Created container config-init
55m        Normal   Started                 pod/po-cran3-rcpsdl-2                                      Started container config-init
50s        Warning  FailedSync              pod/po-cran3-rcpsdl-2                                      error determining status: status.CreatedAt is not set

Version-Release number of selected component (if applicable):

4.10.10

How reproducible:

Quite often in customer's env
Several times per day as they are doing rapid redeployment testing

Steps to Reproduce:
1. Delete and deploy customer's pod
2. Sometimes one of the pod will stuck in INIT status
3. This issue could happen to random application pod, even the pod has different initContainer image and command

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2022-06-08 13:51:57 UTC
can you attach a must-gather or sos_report from a node on which an affected pod is scheduled?

Comment 5 Chen 2022-06-10 01:20:18 UTC
Hi Ryan,

Thank you for your reply.

>I suspect the pod memory limits are set too low. Can you request the CU to raise the pod memory limits?

Do you mean the memory limits of initContainer ?

The "config-init" initContainer in cran2/po-cran2-rcpsdl-0 pod indeed has a low limits, which is 50Mi. However this issue also happens on other PODs and for example, pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" as following log shows:

Jun 02 16:40:52 hz25f-pz-10-106-244-34-master01 hyperkube[8356]: E0602 16:40:52.980366    8356 cpu_manager.go:470] "ReconcileState: failed to update container" err="rpc error: code = Unknown desc = updating resources for container \"3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28\" failed: time=\"2022-06-02T16:40:52+08:00\" level=warning msg=\"Setting back cgroup configs failed due to error: Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found., your state.json and actual configs might be inconsistent.\"\ntime=\"2022-06-02T16:40:52+08:00\" level=error msg=\"Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found.\"\n  (exit status 1)" pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" containerID="3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28" cpuSet="0-63"

Checking alarmdata-init-1 limits, it has 512Mi. Do you think 512Mi is still too low for them?

    name: alarmdata-init-1
    resources:
      limits:
        cpu: 100m
        memory: 512Mi
      requests:
        cpu: 50m
        memory: 512Mi

Also I thought if the memory usage exceeds memory limits, the OOM killer will kill the pod and there should be some logs indicating this but I couldn't find OOM logs in the journal.

Best Regards,
Chen

Comment 7 Chen 2022-06-21 08:56:56 UTC
Hi Peter,

Today I had a remote session with the customer. There is one system which can reproduce the problem every time when deploying the application. Here is a brief of what we saw:

1. Remove Released PV
2. Deploy application
3. We found "CreatedAt is not set" error
4. Uninstall application
5. Two of the PVs will be in Released status

Since the error is something like kubelet can not verify container status from crio, we restarted the crio service.

$ sudo systemcrl restart crio

After the restart, we don't see the symptom anymore.

So in order to find out the root cause, what kind of the information do we need ? Is https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines enough ?

Thank you so much!

Best Regards,
Chen

Comment 8 Peter Hunt 2022-06-21 14:13:15 UTC
*** Bug 2096179 has been marked as a duplicate of this bug. ***

Comment 9 Peter Hunt 2022-06-21 14:17:36 UTC
yeah the goroutine stacks should be enough (assuming the node is currently hitting the problem)

Comment 10 Peter Hunt 2022-06-21 20:08:23 UTC
another thing I've just thought of: it could be useful to set cri-o to debug log level with a ContainerRuntimeConfig [1] and getting the logs 

1: https://docs.openshift.com/container-platform/4.7/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks

Comment 18 Peter Hunt 2022-06-30 20:30:35 UTC
*** Bug 2099942 has been marked as a duplicate of this bug. ***

Comment 67 errata-xmlrpc 2023-01-17 19:49:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 68 Red Hat Bugzilla 2023-09-18 04:38:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days