Bug 1827694 - oc adm must-gather tests failing on s390x & ppc64le
Summary: oc adm must-gather tests failing on s390x & ppc64le
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.3.z
Hardware: s390x
OS: Unspecified
low
low
Target Milestone: ---
: 4.5.0
Assignee: Prashanth Sundararaman
QA Contact: Jeremy Poulin
URL:
Whiteboard: multi-arch LifecycleStale
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-24 14:44 UTC by Jeremy Poulin
Modified: 2021-05-14 14:08 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-26 20:33:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jeremy Poulin 2020-04-24 14:44:06 UTC
Description of problem:


Version-Release number of selected component (if applicable):
4.3.z on ppc64le & s390x
Relatively recent issue, found in CI in builds from mid April+


How reproducible:
Run the oc adm must-gather tests in the openshift/conformance/parallel suite on 4.3.z deploys on s390x or power
[cli] oc adm must-gather runs successfully [Suite:openshift/conformance/parallel]
[cli] oc adm must-gather runs successfully with options [Suite:openshift/conformance/parallel]



Actual results:
fail [github.com/openshift/origin/test/extended/cli/mustgather.go:65]: Expected success, but got an error:
<*util.ExitError | 0xc002a4ddd0>: {
Cmd: "oc --namespace=e2e-test-oc-adm-must-gather-shl77 --config=/tmp/admin.kubeconfig adm must-gather --dest-dir /tmp/test.oc-adm-must-gather.152582883",
StdErr: "[must-gather ] OUT unable to resolve the imagestream tag openshift/must-gather:latest\n[must-gather ] OUT \n[must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest\n[must-gather ] OUT namespace/openshift-must-gather-c4zt7 created\n[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-7blm6 created\n[must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created\n[must-gather-zd4zz] POD standard_init_linux.go:211: exec user process caused \"exec format error\"\n[must-gather-zd4zz] OUT waiting for gather to complete\n[must-gather-zd4zz] OUT gather never finished: pod is not running: Failed\n[must-gather-zd4zz] OUT \n[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-7blm6 deleted\n[must-gather ] OUT namespace/openshift-must-gather-c4zt7 deleted\nerror: gather never finished for pod must-gather-zd4zz: pod is not running: Failed",
ExitError: {
ProcessState: {
pid: 6697,
status: 256,
rusage: {
Utime:
{Sec: 0, Usec: 217463}

,
Stime:
{Sec: 0, Usec: 59410}

,
Maxrss: 86452,
Ixrss: 0,
Idrss: 0,
Isrss: 0,
Minflt: 11932,
Majflt: 0,
Nswap: 0,
Inblock: 0,
Oublock: 0,
Msgsnd: 0,
Msgrcv: 0,
Nsignals: 0,
Nvcsw: 1109,
Nivcsw: 2,
},
},
Stderr: nil,
},
}
exit status 1


Expected results:
Tests pass

Comment 1 Stephen Cuppett 2020-04-24 15:18:25 UTC
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 2 Prashanth Sundararaman 2020-04-24 15:27:39 UTC
Some additional info from my debugging:

When i execute the oc adm must-gather command:

[must-gather      ] OUT unable to resolve the imagestream tag openshift/must-gather:latest
[must-gather      ] OUT 
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-vnxjz created

It falls back to using the default x86 image which fails on ppc64le/s390x

Also , when i do a describe on the imagestream:

[psundara@rock-kvmlp-1 ~]$ ./oc -n openshift describe is/must-gather
Name:			must-gather
Namespace:		openshift
Created:		About an hour ago
Labels:			<none>
Annotations:		<none>
Image Repository:	image-registry.openshift-image-registry.svc:5000/openshift/must-gather
Image Lookup:		local=false
Unique Images:		0
Tags:			1
latest
  updates automatically from registry quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:93b354ba71ecc71ee4758c84a7d1d114f2c81d1776e546d387fe9492547c9c0d
  ! error: Import failed (Unauthorized): you may not have access to the container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:93b354ba71ecc71ee4758c84a7d1d114f2c81d1776e546d387fe9492547c9c0d"
      About an hour ago


This suggests that the imagestreams does not have the pull secret to access the image. On dumping the pull secret in the  Openshift namespace, i do not see the samples-registry-credentials which is a copy of the pull secret from the Openshift-config namespace required for the imagestreams to be accessed:

[psundara@rock-kvmlp-1 ~]$ ./oc get secrets -n openshift
NAME                       TYPE                                  DATA   AGE
builder-dockercfg-p7zgm    kubernetes.io/dockercfg               1      19h
builder-token-n7g8r        kubernetes.io/service-account-token   4      19h
builder-token-vptj7        kubernetes.io/service-account-token   4      19h
default-dockercfg-5zzws    kubernetes.io/dockercfg               1      19h
default-token-htjhd        kubernetes.io/service-account-token   4      19h
default-token-qgw5b        kubernetes.io/service-account-token   4      19h
deployer-dockercfg-l4vqr   kubernetes.io/dockercfg               1      19h
deployer-token-q6ndv       kubernetes.io/service-account-token   4      19h
deployer-token-tj756       kubernetes.io/service-account-token   4      19h

Also this issue in multi-arch specific. On x86 i do see the must-gather work correctly and the pull secret is present in the Openshift namespace:

[root@dell-per730-02 ~]# ./oc get secrets -n openshift
NAME                           TYPE                                  DATA   AGE
builder-dockercfg-f45rm        kubernetes.io/dockercfg               1      43m
builder-token-djbml            kubernetes.io/service-account-token   4      43m
builder-token-gxzjx            kubernetes.io/service-account-token   4      43m
default-dockercfg-bdhpd        kubernetes.io/dockercfg               1      43m
default-token-2x2nc            kubernetes.io/service-account-token   4      43m
default-token-jkfxd            kubernetes.io/service-account-token   4      43m
deployer-dockercfg-hnrgz       kubernetes.io/dockercfg               1      43m
deployer-token-9sqcz           kubernetes.io/service-account-token   4      43m
deployer-token-pr5cz           kubernetes.io/service-account-token   4      43m
samples-registry-credentials   kubernetes.io/dockerconfigjson        1      41m


Furthermore in the cluster-samples-operator logs i see that the pull secret is copied for x86:

time="2020-04-24T14:24:42Z" level=info msg="test connection to registry.redhat.io successful"
time="2020-04-24T14:24:42Z" level=info msg="creating default Config"
time="2020-04-24T14:24:45Z" level=info msg="waiting for informer caches to sync"
time="2020-04-24T14:24:45Z" level=info msg="started events processor"
time="2020-04-24T14:24:45Z" level=info msg="processing secret watch event while in Managed state; deletion event: false"
time="2020-04-24T14:24:45Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"

But on s390x/ppc64le, i do not see it being copied:

time="2020-04-24T15:20:59Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0004c55c0)}"
time="2020-04-24T15:20:59Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0004c5680)}"
time="2020-04-24T15:20:59Z" level=info msg="creating default Config"
time="2020-04-24T15:21:02Z" level=info msg="waiting for informer caches to sync"
time="2020-04-24T15:21:02Z" level=info msg="started events processor"
time="2020-04-24T15:21:02Z" level=info msg="processing secret watch event while in Removed state; deletion event: false"
time="2020-04-24T15:21:02Z" level=info msg="Attempting stage 1 Removed management state: RemovePending == true"
time="2020-04-24T15:21:02Z" level=info msg="CRDUPDATE process mgmt update spec Removed status "

Comment 3 Prashanth Sundararaman 2020-04-27 15:07:24 UTC
Added a debug print where copyDefaultClusterPullSecret is called and I see this:

time="2020-04-26T17:14:48Z" level=info msg="Go Version: go1.12.12"
time="2020-04-26T17:14:48Z" level=info msg="Go OS/Arch: linux/ppc64le"
time="2020-04-26T17:14:48Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0002c0780)}"
time="2020-04-26T17:14:48Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0002c0840)}"
time="2020-04-26T17:14:48Z" level=info msg="copyDefaultClusterPullSecret secret \"pull-secret\" not found"
time="2020-04-26T17:14:48Z" level=info msg="createDefaultResourceIfNeeded copy pull secret secret \"pull-secret\" not found"
time="2020-04-26T17:14:48Z" level=info msg="creating default Config"
time="2020-04-26T17:14:52Z" level=info msg="got already exists error on create default"
time="2020-04-26T17:14:52Z" level=info msg="waiting for informer caches to sync"
time="2020-04-26T17:14:55Z" level=info msg="started events processor"
time="2020-04-26T17:14:55Z" level=info msg="processing secret watch event while in Removed state; deletion event: false"
time="2020-04-26T17:14:55Z" level=info msg="creation/update of credential in openshift namespace recognized"
time="2020-04-26T17:14:55Z" level=info msg="processing secret watch event while in Removed state; deletion event: false"

Looks like it is trying to copy the pull secret even before the informer can run and populate the secrets from the "openshift-config namespace". I tried an older ppc64le 4.3 build from a month ago and this used to work. This is the logs I see in the sucessfull case:

time="2020-04-26T15:43:04Z" level=info msg="Go Version: go1.12.12"
time="2020-04-26T15:43:04Z" level=info msg="Go OS/Arch: linux/ppc64le"
time="2020-04-26T15:43:04Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0002cc840)}"
time="2020-04-26T15:43:04Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0002cc900)}"
time="2020-04-26T15:43:04Z" level=info msg="creating default Config"
time="2020-04-26T15:43:12Z" level=info msg="waiting for informer caches to sync"
E0426 15:43:14.101561      13 reflector.go:153] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: Failed to list *v1.Template: Unauthorized
E0426 15:43:14.101561      13 reflector.go:153] github.com/openshift/cluster-samples-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.Config: Unauthorized
E0426 15:43:14.101618      13 reflector.go:153] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Secret: Unauthorized
E0426 15:44:34.511848      13 reflector.go:153] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: Failed to list *v1.ImageStream: Get https://172.30.0.1:443/apis/image.openshift.io/v1/namespaces/openshift/imagestreams?limit=500&resourceVersion=0: unexpected EOF
E0426 15:44:34.511896      13 reflector.go:153] github.com/openshift/cluster-samples-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.Config: Get https://172.30.0.1:443/apis/samples.operator.openshift.io/v1/configs?limit=500&resourceVersion=0: unexpected EOF
E0426 15:44:34.511937      13 reflector.go:153] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Secret: Get https://172.30.0.1:443/api/v1/namespaces/openshift-config/secrets?limit=500&resourceVersion=0: unexpected EOF
E0426 15:44:34.511979      13 reflector.go:153] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Secret: Get https://172.30.0.1:443/api/v1/namespaces/openshift/secrets?limit=500&resourceVersion=0: unexpected EOF
time="2020-04-26T15:44:36Z" level=info msg="started events processor"
time="2020-04-26T15:44:36Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"
time="2020-04-26T15:44:40Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing"
time="2020-04-26T15:44:40Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing"
time="2020-04-26T15:44:40Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing"
time="2020-04-26T15:44:41Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing"
time="2020-04-26T15:44:41Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing"
time="2020-04-26T15:44:42Z" level=error msg="unable to sync: Received secret pull-secret but do not have the Config yet, requeuing, requeuing"
time="2020-04-26T15:44:43Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing"
time="2020-04-26T15:44:45Z" level=info msg="processing secret watch event while in Removed state; deletion event: false"
time="2020-04-26T15:44:45Z" level=info msg="creation/update of credential in openshift namespace recognized"

Looks like in this case it worked. I am not sure what changed between the two builds as nothing seems to have changed in the cluster-samples-operator

Comment 6 Maciej Szulik 2020-06-18 09:29:46 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.
As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority.
If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs,
that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 7 Prashanth Sundararaman 2020-06-18 12:02:43 UTC
Jeremy,

I believe this issue is fixed after the backports from Gabe? Can you confirm ?

Thanks
Prashanth

Comment 8 Jeremy Poulin 2020-06-26 20:07:14 UTC
I feel like this was fixed by a backport at some point. I also know it's working in 4.4+, because I was just running those tests today.
I will kick off a suite of tests to confirm if this issue is still present on 4.3.z.

@Gabe, do you happen to know offhand what backport may have fixed this. It would have been a while back, since we haven't looked at this in a while.

Comment 9 Gabe Montero 2020-06-26 20:29:49 UTC
There was https://bugzilla.redhat.com/show_bug.cgi?id=1818476 which was fixed in 4.5 where we were getting unauthroized errors pulling images for the must gather image stream because samples operator did not copy the install pull secret.

That is in 4.5

Samples operator code itself does not create the must gather imagestream.  The install does as a result of this file in the sample operator manifest:  https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml

With the various error messages posted here, I cannot tell if something else is going on with oc adm must-gather, of if the imagestream not getting populated was the root cause.

A simple check would be to run 

oc get is must-gather -o yaml -n openshift

I would tend to agree on closing this out, and reopen, getting oc get is must-gather -o yaml -n openshift along with the other debug data noted here, if it is seen again.

Comment 10 Jeremy Poulin 2020-06-26 20:33:03 UTC
Closing this out. Will reopen if seen again, as per the comment above.


Note You need to log in before you can comment on or make changes to this bug.