Bug 2094865 - INIT container stuck forever
Summary: INIT container stuck forever
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.12.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 2096179 2099942 (view as bug list)
Depends On:
Blocks: 2115435 2117346
TreeView+ depends on / blocked
 
Reported: 2022-06-08 13:47 UTC by Chen
Modified: 2023-09-18 04:38 UTC (History)
9 users (show)

Fixed In Version: cri-o-1.25.0-25.rhaos4.12.git20f639c.el8
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2115435 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:49:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 6122 0 None open oci: take opLock for UpdateContainer 2022-08-03 19:32:46 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:17 UTC

Description Chen 2022-06-08 13:47:17 UTC
Description of problem:

INIT container stuck forever even though it finishes its script.

One of the initContainer is called config-init. It was created 55 mins ago but it stuck since then. We suspected perhaps the init container stuck somewhere but the customer's app team analyzed the INIT container log and they found the init container has finished running as they found the container output an ending log.

55m        Normal   Created                 pod/po-cran3-rcpsdl-2                                      Created container config-init
55m        Normal   Started                 pod/po-cran3-rcpsdl-2                                      Started container config-init
50s        Warning  FailedSync              pod/po-cran3-rcpsdl-2                                      error determining status: status.CreatedAt is not set

Version-Release number of selected component (if applicable):

4.10.10

How reproducible:

Quite often in customer's env
Several times per day as they are doing rapid redeployment testing

Steps to Reproduce:
1. Delete and deploy customer's pod
2. Sometimes one of the pod will stuck in INIT status
3. This issue could happen to random application pod, even the pod has different initContainer image and command

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2022-06-08 13:51:57 UTC
can you attach a must-gather or sos_report from a node on which an affected pod is scheduled?

Comment 5 Chen 2022-06-10 01:20:18 UTC
Hi Ryan,

Thank you for your reply.

>I suspect the pod memory limits are set too low. Can you request the CU to raise the pod memory limits?

Do you mean the memory limits of initContainer ?

The "config-init" initContainer in cran2/po-cran2-rcpsdl-0 pod indeed has a low limits, which is 50Mi. However this issue also happens on other PODs and for example, pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" as following log shows:

Jun 02 16:40:52 hz25f-pz-10-106-244-34-master01 hyperkube[8356]: E0602 16:40:52.980366    8356 cpu_manager.go:470] "ReconcileState: failed to update container" err="rpc error: code = Unknown desc = updating resources for container \"3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28\" failed: time=\"2022-06-02T16:40:52+08:00\" level=warning msg=\"Setting back cgroup configs failed due to error: Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found., your state.json and actual configs might be inconsistent.\"\ntime=\"2022-06-02T16:40:52+08:00\" level=error msg=\"Unit crio-3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28.scope not found.\"\n  (exit status 1)" pod="cran1/po-cran1-rcpfm-0" containerName="alarmdata-init-1" containerID="3e5a6ba24d38fb4ed823ecd8cfe9eb636e741291a02c1a713974b60771c75b28" cpuSet="0-63"

Checking alarmdata-init-1 limits, it has 512Mi. Do you think 512Mi is still too low for them?

    name: alarmdata-init-1
    resources:
      limits:
        cpu: 100m
        memory: 512Mi
      requests:
        cpu: 50m
        memory: 512Mi

Also I thought if the memory usage exceeds memory limits, the OOM killer will kill the pod and there should be some logs indicating this but I couldn't find OOM logs in the journal.

Best Regards,
Chen

Comment 7 Chen 2022-06-21 08:56:56 UTC
Hi Peter,

Today I had a remote session with the customer. There is one system which can reproduce the problem every time when deploying the application. Here is a brief of what we saw:

1. Remove Released PV
2. Deploy application
3. We found "CreatedAt is not set" error
4. Uninstall application
5. Two of the PVs will be in Released status

Since the error is something like kubelet can not verify container status from crio, we restarted the crio service.

$ sudo systemcrl restart crio

After the restart, we don't see the symptom anymore.

So in order to find out the root cause, what kind of the information do we need ? Is https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines enough ?

Thank you so much!

Best Regards,
Chen

Comment 8 Peter Hunt 2022-06-21 14:13:15 UTC
*** Bug 2096179 has been marked as a duplicate of this bug. ***

Comment 9 Peter Hunt 2022-06-21 14:17:36 UTC
yeah the goroutine stacks should be enough (assuming the node is currently hitting the problem)

Comment 10 Peter Hunt 2022-06-21 20:08:23 UTC
another thing I've just thought of: it could be useful to set cri-o to debug log level with a ContainerRuntimeConfig [1] and getting the logs 

1: https://docs.openshift.com/container-platform/4.7/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks

Comment 18 Peter Hunt 2022-06-30 20:30:35 UTC
*** Bug 2099942 has been marked as a duplicate of this bug. ***

Comment 67 errata-xmlrpc 2023-01-17 19:49:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 68 Red Hat Bugzilla 2023-09-18 04:38:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.