Bug 2092940

Summary: Occasionally pods stuck in ContainerStatusUnknown prevent installation to complete
Product: OpenShift Container Platform Reporter: Omer Tuchfeld <otuchfel>
Component: openshift-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED WONTFIX QA Contact: Rahul Gangwar <rgangwar>
Severity: low Docs Contact:
Priority: low    
Version: 4.9CC: harpatil, imiller, mfojtik
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-18 14:26:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Omer Tuchfeld 2022-06-02 14:53:55 UTC
Description of problem:
Occasionally pods stuck in ContainerStatusUnknown prevent installation to complete . This seems to happen to either openshift-apiserver pods or openshift-oauth-apiserver pods

Version-Release number of selected component (if applicable):
4.9.36

How reproducible:
Around 0.35% of SNO installations

Steps to Reproduce:
1. Install thousands of SNO clusters
2. Note that a small number of them fail to complete installation 

Actual results:
Installation cannot finish because a pod in ContainerStatusUnknown prevents another pod from being scheduled in its place because of anti-affinity rules

Expected results:
Pods should not be in ContainerStatusUnknown

Additional info:
Attached must-gather from a cluster demonstrating this issue - one in openshift-apiserver and one in openshift-oauth-apiserver

Comment 3 Sai Ramesh Vanka 2022-07-21 13:31:56 UTC
Hi Ian,

Here is my analysis.

The issue is observed to be associated with the "fix-audit-permissions" container which is an "initContainer".
From the Node's perspective, the container has been started successfully from the below crio log (found in the openshift-oauth-apiserver attachment)

Log Location: host_service_logs/masters/crio_service.log:1079

"Jun 02 00:30:28.080377 sno01929 bash[3231]: time="2022-06-02 00:30:28.080317532Z" level=info msg="Started container" PID=19349 containerID=029e56989ff05ba8f9644068ef6acb2b1e1a6ead089511f499b944a406637fe4 description=openshift-apiserver/apiserver-68545dc6b7-pnz8s/fix-audit-permissions id=789636a8-eda4-40b3-b90c-8c802a01952e name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=5fdbd53c41d59e372270a43d566f0c55e37bdebcd098376a22cac283690aa5d5"

Looks like the container has been exited with the "exitCode: 137" observed from the below initContainerStatuses
Log Location: namespaces/openshift-oauth-apiserver/pods/apiserver-74bfdcd8f8-ppdmb/apiserver-74bfdcd8f8-ppdmb.yaml
initContainerStatuses:
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:81924ba3dca440744d3c26cca1d39fa75c112fc290e6827979ee3eade6d1736d
    imageID: ""
    lastState: {}
    name: fix-audit-permissions
    ready: false
    restartCount: 0
    state:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        startedAt: null

As part of this initContainer, I see only modifying the permissions from the below command configuration.
 - command:
      - sh
      - -c
      - chmod 0700 /var/log/openshift-apiserver && touch /var/log/openshift-apiserver/audit.log
        && chmod 0600 /var/log/openshift-apiserver/*

I don't think this would cause an issue to return 137 exitcode and I would probably suggest to check with the API folks if they can give any suggestions on the issue.

Thanks,
Ramesh

Comment 4 Sai Ramesh Vanka 2022-07-21 13:41:21 UTC
(In reply to Sai Ramesh Vanka from comment #3)
> Hi Ian,
> 
> Here is my analysis.
> 
> The issue is observed to be associated with the "fix-audit-permissions"
> container which is an "initContainer".
> From the Node's perspective, the container has been started successfully
> from the below crio log (found in the openshift-oauth-apiserver attachment)
> 
> Log Location: host_service_logs/masters/crio_service.log:1079
> 
> "Jun 02 00:30:28.080377 sno01929 bash[3231]: time="2022-06-02
> 00:30:28.080317532Z" level=info msg="Started container" PID=19349
> containerID=029e56989ff05ba8f9644068ef6acb2b1e1a6ead089511f499b944a406637fe4
> description=openshift-apiserver/apiserver-68545dc6b7-pnz8s/fix-audit-
> permissions id=789636a8-eda4-40b3-b90c-8c802a01952e
> name=/runtime.v1alpha2.RuntimeService/StartContainer
> sandboxID=5fdbd53c41d59e372270a43d566f0c55e37bdebcd098376a22cac283690aa5d5"
> 
> Looks like the container has been exited with the "exitCode: 137" observed
> from the below initContainerStatuses
> Log Location:
> namespaces/openshift-oauth-apiserver/pods/apiserver-74bfdcd8f8-ppdmb/
> apiserver-74bfdcd8f8-ppdmb.yaml
> initContainerStatuses:
>   - image:
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 81924ba3dca440744d3c26cca1d39fa75c112fc290e6827979ee3eade6d1736d
>     imageID: ""
>     lastState: {}
>     name: fix-audit-permissions
>     ready: false
>     restartCount: 0
>     state:
>       terminated:
>         exitCode: 137
>         finishedAt: null
>         message: The container could not be located when the pod was
> terminated
>         reason: ContainerStatusUnknown
>         startedAt: null
> 
> As part of this initContainer, I see only modifying the permissions from the
> below command configuration.
>  - command:
>       - sh
>       - -c
>       - chmod 0700 /var/log/openshift-apiserver && touch
> /var/log/openshift-apiserver/audit.log
>         && chmod 0600 /var/log/openshift-apiserver/*
> 
> I don't think this would cause an issue to return 137 exitcode and I would
> probably suggest to check with the API folks if they can give any
> suggestions on the issue.
> 
> Thanks,
> Ramesh

Oops... wrong addressing, sorry.
FYI @otuchfel

Comment 5 Omer Tuchfeld 2022-07-21 14:23:43 UTC
Why has this been moved to openshift-apiserver specifically? The same bug happens in the openshift-oauth-apiserver as well, the ticket has must-gathers that show the same problem on both.

Also why are the container / logs for the init container missing? That sounds like a kubelet/cri-o problem

Comment 6 Michal Fojtik 2022-08-18 14:26:03 UTC
Dear reporter, 

As part of the migration of all OpenShift bugs to Red Hat Jira, we are evaluating all bugs which will result in some stale issues or those without high or urgent priority to be closed. If you believe this bug still requires engineering resolution, we kindly ask you to follow this link[1] and continue working with us in Jira by recreating the issue and providing the necessary information. Also, please provide the link to the original Bugzilla in the description.

To create an issue, follow this link:

[1] https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&priority=10300&components=12367637