Bug 1795277

Summary: oc debug node/<node-name> tries to pull non s390x rhel7/support-tools image
Product: OpenShift Container Platform Reporter: Alexander Klein <alklein>
Component: Multi-ArchAssignee: David Benoit <dbenoit>
Status: CLOSED CURRENTRELEASE QA Contact: Barry Donahue <bdonahue>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.2.zCC: crawford, dbenoit, dorzel, epasch, Holger.Wolf, jpoulin, wvoesch
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-13 15:36:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1777031    

Description Alexander Klein 2020-01-27 15:51:21 UTC
Description of problem:
oc debug node/<node-name> does not start a remote shell. it hangs for a while before reporting an error. looking at the pod it created you can see it tries to pull an image that's not for s390x.

Failed to pull image "registry.redhat.io/rhel7/support-tools": rpc error: code = Unknown desc = no image found in manifest list for architecture s390x, OS linux

Version-Release number of selected component (if applicable):
# oc version
Client Version: openshift-clients-4.2.2-201910250432-12-g72076900
Server Version: 4.2.12-s390x
Kubernetes Version: v1.14.6+32dc4a0

Comment 1 David Benoit 2020-01-28 15:20:22 UTC
We will need to mention in the documentation that s390x users need to specify the RHEL 8 support-tools image.

oc debug --image=registry.redhat.io/rhel8/support-tools

Comment 3 Alexander Klein 2020-01-28 16:48:16 UTC
this works for me

Comment 4 Alex Crawford 2020-01-31 00:02:11 UTC
Also reported in https://bugzilla.redhat.com/show_bug.cgi?id=1777030. Let's use this one though, since it has more information.

Comment 5 Alex Crawford 2020-01-31 00:02:38 UTC
*** Bug 1777030 has been marked as a duplicate of this bug. ***

Comment 6 Alex Crawford 2020-01-31 00:03:55 UTC
*** Bug 1777031 has been marked as a duplicate of this bug. ***

Comment 7 Alex Crawford 2020-01-31 00:04:10 UTC
*** Bug 1778807 has been marked as a duplicate of this bug. ***

Comment 8 Alex Crawford 2020-02-04 19:30:24 UTC
https://access.redhat.com/containers/#/registry.access.redhat.com/rhel7/support-tools now reports that it supports AMD64, ppc64le, and s390x.

Comment 10 wvoesch 2020-02-10 09:49:34 UTC
I tested oc debug node/<node-name> on four different clusters. It worked on three out of the four clusters. 

I noticed that the clusters on which the command worked have version 4.2.16 and the one on which it didn't worked has version 4.2.12. I think different versions should not influence whether the command works or not, as the registry is external. Do you have any idea why the command works three out of four times?

Comment 11 Eberhard Pasch 2020-02-11 10:42:44 UTC
(In reply to wvoesch from comment #10)
> I tested oc debug node/<node-name> on four different clusters. It worked on
> three out of the four clusters. 
> 

Not acceptable - please reopen the bug!

Comment 12 wvoesch 2020-02-11 12:30:42 UTC
I tested again and now in general the command works on all four clusters. 

However this time I got once a different error:


# oc debug node/<node-name>
Starting pod/<node-name-debug> ...
To use host binaries, run `chroot /host`

Removing debug pod ...
error: unable to create the debug pod "<node-name-debug> "


I tried to start the debug container on another node and it worked so I retried it on the node where it didn't work first time. The second time it works:

# oc debug node/<node-name>
Starting pod/<node-name-debug>  ...
To use host binaries, run `chroot /host`
Pod IP: <pod-ip>
If you don't see a command prompt, try pressing enter.
sh-4.2#

Comment 13 Eberhard Pasch 2020-02-12 12:53:17 UTC
So it still doesn't work reliably. Can you reopen this bug? 

For the future: Can you automate this testcase - try all clusters that you have. For each cluster try 100 times....

We probably should keep this in the regression test bucket for a while....

Comment 14 Jeremy Poulin 2020-02-12 14:01:35 UTC
So, the underlying Multi-Arch related issue here was that the container being pulled did not support non-x86_64 architectures. This has been fixed and verified.

If you're now hitting an intermittent issue with oc debug node, that is a new issue altogether, since it is not related to the change of the underlying issue that this bug was tracking. If you can reproduce the intermittent issue on your clusters, I suggest you file a new bug with instructions on how to reproduce, so that we can narrow in on the new problem.

Comment 15 Holger Wolf 2020-02-13 15:36:11 UTC
As discussed in the bugzapper call - we close this issue as it was resolved and will open a new one as soon we have re-tested if the reliability needs to be improved.