Bug 2092940 - Occasionally pods stuck in ContainerStatusUnknown prevent installation to complete
Summary: Occasionally pods stuck in ContainerStatusUnknown prevent installation to com...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Abu Kashem
QA Contact: Rahul Gangwar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-02 14:53 UTC by Omer Tuchfeld
Modified: 2022-08-18 14:26 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-18 14:26:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Omer Tuchfeld 2022-06-02 14:53:55 UTC
Description of problem:
Occasionally pods stuck in ContainerStatusUnknown prevent installation to complete . This seems to happen to either openshift-apiserver pods or openshift-oauth-apiserver pods

Version-Release number of selected component (if applicable):
4.9.36

How reproducible:
Around 0.35% of SNO installations

Steps to Reproduce:
1. Install thousands of SNO clusters
2. Note that a small number of them fail to complete installation 

Actual results:
Installation cannot finish because a pod in ContainerStatusUnknown prevents another pod from being scheduled in its place because of anti-affinity rules

Expected results:
Pods should not be in ContainerStatusUnknown

Additional info:
Attached must-gather from a cluster demonstrating this issue - one in openshift-apiserver and one in openshift-oauth-apiserver

Comment 3 Sai Ramesh Vanka 2022-07-21 13:31:56 UTC
Hi Ian,

Here is my analysis.

The issue is observed to be associated with the "fix-audit-permissions" container which is an "initContainer".
From the Node's perspective, the container has been started successfully from the below crio log (found in the openshift-oauth-apiserver attachment)

Log Location: host_service_logs/masters/crio_service.log:1079

"Jun 02 00:30:28.080377 sno01929 bash[3231]: time="2022-06-02 00:30:28.080317532Z" level=info msg="Started container" PID=19349 containerID=029e56989ff05ba8f9644068ef6acb2b1e1a6ead089511f499b944a406637fe4 description=openshift-apiserver/apiserver-68545dc6b7-pnz8s/fix-audit-permissions id=789636a8-eda4-40b3-b90c-8c802a01952e name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=5fdbd53c41d59e372270a43d566f0c55e37bdebcd098376a22cac283690aa5d5"

Looks like the container has been exited with the "exitCode: 137" observed from the below initContainerStatuses
Log Location: namespaces/openshift-oauth-apiserver/pods/apiserver-74bfdcd8f8-ppdmb/apiserver-74bfdcd8f8-ppdmb.yaml
initContainerStatuses:
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:81924ba3dca440744d3c26cca1d39fa75c112fc290e6827979ee3eade6d1736d
    imageID: ""
    lastState: {}
    name: fix-audit-permissions
    ready: false
    restartCount: 0
    state:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        startedAt: null

As part of this initContainer, I see only modifying the permissions from the below command configuration.
 - command:
      - sh
      - -c
      - chmod 0700 /var/log/openshift-apiserver && touch /var/log/openshift-apiserver/audit.log
        && chmod 0600 /var/log/openshift-apiserver/*

I don't think this would cause an issue to return 137 exitcode and I would probably suggest to check with the API folks if they can give any suggestions on the issue.

Thanks,
Ramesh

Comment 4 Sai Ramesh Vanka 2022-07-21 13:41:21 UTC
(In reply to Sai Ramesh Vanka from comment #3)
> Hi Ian,
> 
> Here is my analysis.
> 
> The issue is observed to be associated with the "fix-audit-permissions"
> container which is an "initContainer".
> From the Node's perspective, the container has been started successfully
> from the below crio log (found in the openshift-oauth-apiserver attachment)
> 
> Log Location: host_service_logs/masters/crio_service.log:1079
> 
> "Jun 02 00:30:28.080377 sno01929 bash[3231]: time="2022-06-02
> 00:30:28.080317532Z" level=info msg="Started container" PID=19349
> containerID=029e56989ff05ba8f9644068ef6acb2b1e1a6ead089511f499b944a406637fe4
> description=openshift-apiserver/apiserver-68545dc6b7-pnz8s/fix-audit-
> permissions id=789636a8-eda4-40b3-b90c-8c802a01952e
> name=/runtime.v1alpha2.RuntimeService/StartContainer
> sandboxID=5fdbd53c41d59e372270a43d566f0c55e37bdebcd098376a22cac283690aa5d5"
> 
> Looks like the container has been exited with the "exitCode: 137" observed
> from the below initContainerStatuses
> Log Location:
> namespaces/openshift-oauth-apiserver/pods/apiserver-74bfdcd8f8-ppdmb/
> apiserver-74bfdcd8f8-ppdmb.yaml
> initContainerStatuses:
>   - image:
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 81924ba3dca440744d3c26cca1d39fa75c112fc290e6827979ee3eade6d1736d
>     imageID: ""
>     lastState: {}
>     name: fix-audit-permissions
>     ready: false
>     restartCount: 0
>     state:
>       terminated:
>         exitCode: 137
>         finishedAt: null
>         message: The container could not be located when the pod was
> terminated
>         reason: ContainerStatusUnknown
>         startedAt: null
> 
> As part of this initContainer, I see only modifying the permissions from the
> below command configuration.
>  - command:
>       - sh
>       - -c
>       - chmod 0700 /var/log/openshift-apiserver && touch
> /var/log/openshift-apiserver/audit.log
>         && chmod 0600 /var/log/openshift-apiserver/*
> 
> I don't think this would cause an issue to return 137 exitcode and I would
> probably suggest to check with the API folks if they can give any
> suggestions on the issue.
> 
> Thanks,
> Ramesh

Oops... wrong addressing, sorry.
FYI @otuchfel

Comment 5 Omer Tuchfeld 2022-07-21 14:23:43 UTC
Why has this been moved to openshift-apiserver specifically? The same bug happens in the openshift-oauth-apiserver as well, the ticket has must-gathers that show the same problem on both.

Also why are the container / logs for the init container missing? That sounds like a kubelet/cri-o problem

Comment 6 Michal Fojtik 2022-08-18 14:26:03 UTC
Dear reporter, 

As part of the migration of all OpenShift bugs to Red Hat Jira, we are evaluating all bugs which will result in some stale issues or those without high or urgent priority to be closed. If you believe this bug still requires engineering resolution, we kindly ask you to follow this link[1] and continue working with us in Jira by recreating the issue and providing the necessary information. Also, please provide the link to the original Bugzilla in the description.

To create an issue, follow this link:

[1] https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&priority=10300&components=12367637


Note You need to log in before you can comment on or make changes to this bug.