Bug 1840643 - [Must-gather] Logging improvement - don't fail on TO when there is another issue with log collection
Summary: [Must-gather] Logging improvement - don't fail on TO when there is another is...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.6.0
Assignee: Yuval Turgeman
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-27 11:30 UTC by Nelly Credi
Modified: 2020-10-27 16:01 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:01:07 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift oc pull 577 0 None closed Bug 1840643: handle pull errors in must-gather 2021-01-27 16:57:19 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:01:21 UTC

Description Nelly Credi 2020-05-27 11:30:20 UTC
Description of problem:
When trying to run must-gather cnv image, and using the wrong image path (can be due to a typo), the log doesn't properly indicate that
and instead you will say a timeout failure


Version-Release number of selected component (if applicable):
2.3.0-45

How reproducible:
100%

Steps to Reproduce:
1. Run must gather with an image which will cause InspectFailed error
2.let it run until it indicates a failure


Actual results:

executed this on a disconnected env:
oc adm must-gather --image=registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT Using must-gather plugin-in image: registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT namespace/openshift-must-gather-xhgkc created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-zqlgv created
[must-gather      ] OUT pod for plug-in image registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45 created
[must-gather-5hcxc] OUT gather did not start: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-zqlgv deleted
[must-gather      ] OUT namespace/openshift-must-gather-xhgkc deleted
error: gather did not start for pod must-gather-5hcxc: timed out waiting for the condition


Expected results:
Instead of a TO
I would have liked to get an error message which indicates that must-gather wasn't even able to start


Additional info:
While running must-gather, if you go to the created must-gather NS, get the pods or describe them, it is possible to see the error,
but once must-gather command fails, that NS is removed, so we lose the ability to understand the issue

if you do describe, it is very easy to understand where the problem is. This data should be reflected in the must-gather command execution
Example:

Events:
  Type     Reason         Age               From                         Message
  ----     ------         ----              ----                         -------
  Normal   Scheduled      <unknown>         default-scheduler            Successfully assigned openshift-must-gather-c9kdk/must-gather-dq9gq to openshift-worker-1
  Warning  InspectFailed  7s (x9 over 82s)  kubelet, openshift-worker-1  Failed to apply default image tag "registry=registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45": couldn't parse image reference "registry=registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45": invalid reference format
  Warning  Failed         7s (x9 over 82s)  kubelet, openshift-worker-1  Error: InvalidImageName


I do want to mention that in some cases, the error is populated, so this is probably a corner case which wasn't handle
examples for good outputs i do see:
1.
error: gather did not start for pod must-gather-97nsp: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry registry.bla.com:5000: Get https://registry.bla.com:5000/v2/: dial tcp: lookup registry.bla.com on 10.46.29.133:53: no such host

2. 
error: gather did not start for pod must-gather-m6xpg: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = Error reading manifest latest in registry.ocp-edge.lab.eng.tlv2.redhat.com:5000/container-native-virtualization/cnv-must-gather-rhel9: manifest unknown: manifest unknown

3. 
error: gather did not start for pod must-gather-6qhns: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication

Comment 6 RamaKasturi 2020-09-28 07:20:43 UTC
Verified with the payload below and i do not see a Timeout waiting for condition instead saw "ImagePullBackOff error as below" as described in the patch "Treat ImagePullBackOff and InvalidImageName the same as ErrImagePull"

[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-27-075304]$ ./oc version
Client Version: 4.6.0-0.nightly-2020-09-27-075304
Server Version: 4.6.0-0.nightly-2020-09-27-075304
Kubernetes Version: v1.19.0+e465e66


[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-27-075304]$ ./oc adm must-gather --image=registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT Using must-gather plugin-in image: registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT namespace/openshift-must-gather-5c5sb created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-qjz7b created
[must-gather      ] OUT pod for plug-in image registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45 created
[must-gather-7cct8] OUT gather did not start: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45"
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-qjz7b deleted
[must-gather      ] OUT namespace/openshift-must-gather-5c5sb deleted
error: gather did not start for pod must-gather-7cct8: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45"

@yuval, can you please let me know if any other validation is required here other than the above ? Thanks !!

Comment 7 Yuval Turgeman 2020-09-29 10:09:47 UTC
Nice, you could verify with an InvalidImageName (capital letters for the image registry should do the trick).  Other than this, I think you covered it, unless @Nelly wants to test anything else here.
Thanks !

Comment 8 RamaKasturi 2020-09-29 12:39:57 UTC
Thanks Yuval, tried again as suggested by you here and i see that it returns with ImagePullBackOff error.

[ramakasturinarra@dhcp35-60 ~]$ oc adm must-gather --image=registry.redhat.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT Using must-gather plugin-in image: registry.redhat.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT namespace/openshift-must-gather-hcd66 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-98rln created
[must-gather      ] OUT pod for plug-in image registry.redhat.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45 created
[must-gather-2lcqv] OUT gather did not start: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45"
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-98rln deleted
[must-gather      ] OUT namespace/openshift-must-gather-hcd66 deleted
error: gather did not start for pod must-gather-2lcqv: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45"
[ramakasturinarra@dhcp35-60 ~]$ oc adm must-gather --image=REGISTRY.REDHAT.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT Using must-gather plugin-in image: REGISTRY.REDHAT.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45
[must-gather      ] OUT namespace/openshift-must-gather-l4bnv created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-pndm7 created
[must-gather      ] OUT pod for plug-in image REGISTRY.REDHAT.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45 created
[must-gather-fcnlw] OUT gather did not start: unable to pull image: ImagePullBackOff: Back-off pulling image "REGISTRY.REDHAT.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45"
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-pndm7 deleted
[must-gather      ] OUT namespace/openshift-must-gather-l4bnv deleted
error: gather did not start for pod must-gather-fcnlw: unable to pull image: ImagePullBackOff: Back-off pulling image "REGISTRY.REDHAT.IO/container-native-virtualization-cnv/must-gather-rhel8:v2.3.0-45"

will wait for nelly before moving the bug to verified state.

Comment 9 Yuval Turgeman 2020-09-29 13:05:31 UTC
I tried with something like ./oc adm must-gather --image=AAAAA

Comment 10 RamaKasturi 2020-09-29 13:14:17 UTC
Trying the above gives me the below error.

[ramakasturinarra@dhcp35-60 ~]$ oc adm must-gather --image=AAAAA
[must-gather      ] OUT Using must-gather plugin-in image: AAAAA
[must-gather      ] OUT namespace/openshift-must-gather-2qc9w created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-vkzmv created
[must-gather      ] OUT unable to parse image reference AAAAA: repository name must be lowercase
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-vkzmv deleted
[must-gather      ] OUT namespace/openshift-must-gather-2qc9w deleted
error: repository name must be lowercase

Comment 11 RamaKasturi 2020-09-30 06:06:41 UTC
Based on above comments moving the bug to verified state.

@Nelly please feel free to reopen if you find something obvious is missing incase you get a chance to test the same, thanks !!

Comment 14 errata-xmlrpc 2020-10-27 16:01:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.