Bug 2068047 - "Debug container" in Web Console fails with "pods ... not found" or "The debug pod failed"
Summary: "Debug container" in Web Console fails with "pods ... not found" or "The debu...
Keywords:
Status: CLOSED DUPLICATE of bug 2064744
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.10
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.10.z
Assignee: Zac Herman
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-24 10:49 UTC by Simon Krenger
Modified: 2022-03-24 22:57 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-24 22:57:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Simon Krenger 2022-03-24 10:49:02 UTC
Description of problem:

In OpenShift Container Platform 4.10, there is the new feature to Debug a Pod in the Web Console. On a Pod that is in "CrashLoopBackOff", customers can click on "Debug container <NAME>" and then get a debug Pod.

However, on multiple clusters this fails with "pods "fedora-6b5c67b55c-gdqlk-debug-cpq2b" not found" or "The debug pod failed.".

When checking the Pods in the namespace, it can be seen that the "debug" Pod is started but it immediately terminates:

~~~
$ oc get pods -w
NAME                      READY   STATUS             RESTARTS      AGE
fedora-6b5c67b55c-gdqlk   0/1     CrashLoopBackOff   4 (14s ago)   94s
fedora-6b5c67b55c-gdqlk-debug-6rnt2   0/1     Pending            0             0s
fedora-6b5c67b55c-gdqlk-debug-6rnt2   0/1     Pending            0             0s
fedora-6b5c67b55c-gdqlk-debug-6rnt2   0/1     Terminating        0             0s
~~~

Events show the same behaviour:

~~~
$ oc get events -w
LAST SEEN   TYPE     REASON      OBJECT                                    MESSAGE
0s          Normal   Scheduled   pod/fedora-6b5c67b55c-gdqlk-debug-bgsfz   Successfully assigned fedora/fedora-6b5c67b55c-gdqlk-debug-bgsfz to ip-10-0-133-207.eu-central-1.compute.internal
0s          Normal   SuccessfulDelete   replicaset/fedora-6b5c67b55c              Deleted pod: fedora-6b5c67b55c-gdqlk-debug-bgsfz
0s          Normal   AddedInterface     pod/fedora-6b5c67b55c-gdqlk-debug-bgsfz   Add eth0 [10.129.2.16/23] from openshift-sdn
0s          Normal   Pulled             pod/fedora-6b5c67b55c-gdqlk-debug-bgsfz   Container image "registry.fedoraproject.org/fedora:35" already present on machine
0s          Normal   Created            pod/fedora-6b5c67b55c-gdqlk-debug-bgsfz   Created container fedora
0s          Normal   Started            pod/fedora-6b5c67b55c-gdqlk-debug-bgsfz   Started container fedora
0s          Normal   Killing            pod/fedora-6b5c67b55c-gdqlk-debug-bgsfz   Stopping container fedora
~~~

In the kube-apiserver we can see some of the following errors which may or may not be related:

~~~
I0324 10:43:42.349455      16 node_authorizer.go:203] "NODE DENY" err="node 'ip-10-0-133-207.eu-central-1.compute.internal' cannot get configmap fedora/kube-root-ca.crt, no relationship to this object was found in the node authorizer graph"
I0324 10:43:42.349612      16 node_authorizer.go:203] "NODE DENY" err="node 'ip-10-0-133-207.eu-central-1.compute.internal' cannot get configmap fedora/openshift-service-ca.crt, no relationship to this object was found in the node authorizer graph"
I0324 10:43:42.349707      16 node_authorizer.go:203] "NODE DENY" err="node 'ip-10-0-133-207.eu-central-1.compute.internal' cannot get secret fedora/default-dockercfg-fcs76, no relationship to this object was found in the node authorizer graph"
E0324 10:44:18.969927      16 apiaccess_count_controller.go:161] invalid resource name ".": [may not be '.']
~~~

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.10.3
Server Version: 4.10.6
Kubernetes Version: v1.23.5+b0357ed

How reproducible:

Always

Steps to Reproduce:
1. Create a Deployment that will CrashLoopBackOff. For example, use the following definition:

~~~
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fedora
  labels:
    app: fedora
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fedora
  template:
    metadata:
      labels:
        app: fedora
    spec:
      containers:
      - image: registry.fedoraproject.org/fedora:35
        name: fedora
        command: ['cat','/this-file-does-not-exist']
~~~

2. In the Web Console, navigate to "Workloads" -> "Pods" and then locate the Pod that is in "CrashLoopBackOff"
3. Click on the "CrashLoopBackOff" status in the list and click on "Debug container fedora"

Actual results:

The new page is opened and after some time fails with "pods "fedora-6b5c67b55c-gdqlk-debug-kzsw8" not found" or "The debug pod failed."

Expected results:

Debug Terminal is shown

Additional info:

- Reproduced it with OpenShift Container Platform 4.10.6
- May or may not be related to Bug 2065672

Comment 1 Zac Herman 2022-03-24 15:35:40 UTC
This is the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=2064744.  I will leave this open as it refers to 4.10 and is customer specific.  As for the issue, for some reason, pods inside of certain deployments are immediately closing but if I simply create the pod directly, then the debug feature works.  Still investigating.

Comment 2 Zac Herman 2022-03-24 22:57:10 UTC
We have a fix created with PR https://github.com/openshift/console/pull/11229.  Closing this as duplicate and we will update the 4.10 branch after the PR merges.

*** This bug has been marked as a duplicate of bug 2064744 ***


Note You need to log in before you can comment on or make changes to this bug.