Description of problem: Version-Release number of selected component (if applicable): 4.3.z on ppc64le & s390x Relatively recent issue, found in CI in builds from mid April+ How reproducible: Run the oc adm must-gather tests in the openshift/conformance/parallel suite on 4.3.z deploys on s390x or power [cli] oc adm must-gather runs successfully [Suite:openshift/conformance/parallel] [cli] oc adm must-gather runs successfully with options [Suite:openshift/conformance/parallel] Actual results: fail [github.com/openshift/origin/test/extended/cli/mustgather.go:65]: Expected success, but got an error: <*util.ExitError | 0xc002a4ddd0>: { Cmd: "oc --namespace=e2e-test-oc-adm-must-gather-shl77 --config=/tmp/admin.kubeconfig adm must-gather --dest-dir /tmp/test.oc-adm-must-gather.152582883", StdErr: "[must-gather ] OUT unable to resolve the imagestream tag openshift/must-gather:latest\n[must-gather ] OUT \n[must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest\n[must-gather ] OUT namespace/openshift-must-gather-c4zt7 created\n[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-7blm6 created\n[must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created\n[must-gather-zd4zz] POD standard_init_linux.go:211: exec user process caused \"exec format error\"\n[must-gather-zd4zz] OUT waiting for gather to complete\n[must-gather-zd4zz] OUT gather never finished: pod is not running: Failed\n[must-gather-zd4zz] OUT \n[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-7blm6 deleted\n[must-gather ] OUT namespace/openshift-must-gather-c4zt7 deleted\nerror: gather never finished for pod must-gather-zd4zz: pod is not running: Failed", ExitError: { ProcessState: { pid: 6697, status: 256, rusage: { Utime: {Sec: 0, Usec: 217463} , Stime: {Sec: 0, Usec: 59410} , Maxrss: 86452, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 11932, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 1109, Nivcsw: 2, }, }, Stderr: nil, }, } exit status 1 Expected results: Tests pass
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
Some additional info from my debugging: When i execute the oc adm must-gather command: [must-gather ] OUT unable to resolve the imagestream tag openshift/must-gather:latest [must-gather ] OUT [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest [must-gather ] OUT namespace/openshift-must-gather-vnxjz created It falls back to using the default x86 image which fails on ppc64le/s390x Also , when i do a describe on the imagestream: [psundara@rock-kvmlp-1 ~]$ ./oc -n openshift describe is/must-gather Name: must-gather Namespace: openshift Created: About an hour ago Labels: <none> Annotations: <none> Image Repository: image-registry.openshift-image-registry.svc:5000/openshift/must-gather Image Lookup: local=false Unique Images: 0 Tags: 1 latest updates automatically from registry quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:93b354ba71ecc71ee4758c84a7d1d114f2c81d1776e546d387fe9492547c9c0d ! error: Import failed (Unauthorized): you may not have access to the container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:93b354ba71ecc71ee4758c84a7d1d114f2c81d1776e546d387fe9492547c9c0d" About an hour ago This suggests that the imagestreams does not have the pull secret to access the image. On dumping the pull secret in the Openshift namespace, i do not see the samples-registry-credentials which is a copy of the pull secret from the Openshift-config namespace required for the imagestreams to be accessed: [psundara@rock-kvmlp-1 ~]$ ./oc get secrets -n openshift NAME TYPE DATA AGE builder-dockercfg-p7zgm kubernetes.io/dockercfg 1 19h builder-token-n7g8r kubernetes.io/service-account-token 4 19h builder-token-vptj7 kubernetes.io/service-account-token 4 19h default-dockercfg-5zzws kubernetes.io/dockercfg 1 19h default-token-htjhd kubernetes.io/service-account-token 4 19h default-token-qgw5b kubernetes.io/service-account-token 4 19h deployer-dockercfg-l4vqr kubernetes.io/dockercfg 1 19h deployer-token-q6ndv kubernetes.io/service-account-token 4 19h deployer-token-tj756 kubernetes.io/service-account-token 4 19h Also this issue in multi-arch specific. On x86 i do see the must-gather work correctly and the pull secret is present in the Openshift namespace: [root@dell-per730-02 ~]# ./oc get secrets -n openshift NAME TYPE DATA AGE builder-dockercfg-f45rm kubernetes.io/dockercfg 1 43m builder-token-djbml kubernetes.io/service-account-token 4 43m builder-token-gxzjx kubernetes.io/service-account-token 4 43m default-dockercfg-bdhpd kubernetes.io/dockercfg 1 43m default-token-2x2nc kubernetes.io/service-account-token 4 43m default-token-jkfxd kubernetes.io/service-account-token 4 43m deployer-dockercfg-hnrgz kubernetes.io/dockercfg 1 43m deployer-token-9sqcz kubernetes.io/service-account-token 4 43m deployer-token-pr5cz kubernetes.io/service-account-token 4 43m samples-registry-credentials kubernetes.io/dockerconfigjson 1 41m Furthermore in the cluster-samples-operator logs i see that the pull secret is copied for x86: time="2020-04-24T14:24:42Z" level=info msg="test connection to registry.redhat.io successful" time="2020-04-24T14:24:42Z" level=info msg="creating default Config" time="2020-04-24T14:24:45Z" level=info msg="waiting for informer caches to sync" time="2020-04-24T14:24:45Z" level=info msg="started events processor" time="2020-04-24T14:24:45Z" level=info msg="processing secret watch event while in Managed state; deletion event: false" time="2020-04-24T14:24:45Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace" But on s390x/ppc64le, i do not see it being copied: time="2020-04-24T15:20:59Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0004c55c0)}" time="2020-04-24T15:20:59Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0004c5680)}" time="2020-04-24T15:20:59Z" level=info msg="creating default Config" time="2020-04-24T15:21:02Z" level=info msg="waiting for informer caches to sync" time="2020-04-24T15:21:02Z" level=info msg="started events processor" time="2020-04-24T15:21:02Z" level=info msg="processing secret watch event while in Removed state; deletion event: false" time="2020-04-24T15:21:02Z" level=info msg="Attempting stage 1 Removed management state: RemovePending == true" time="2020-04-24T15:21:02Z" level=info msg="CRDUPDATE process mgmt update spec Removed status "
Added a debug print where copyDefaultClusterPullSecret is called and I see this: time="2020-04-26T17:14:48Z" level=info msg="Go Version: go1.12.12" time="2020-04-26T17:14:48Z" level=info msg="Go OS/Arch: linux/ppc64le" time="2020-04-26T17:14:48Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0002c0780)}" time="2020-04-26T17:14:48Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0002c0840)}" time="2020-04-26T17:14:48Z" level=info msg="copyDefaultClusterPullSecret secret \"pull-secret\" not found" time="2020-04-26T17:14:48Z" level=info msg="createDefaultResourceIfNeeded copy pull secret secret \"pull-secret\" not found" time="2020-04-26T17:14:48Z" level=info msg="creating default Config" time="2020-04-26T17:14:52Z" level=info msg="got already exists error on create default" time="2020-04-26T17:14:52Z" level=info msg="waiting for informer caches to sync" time="2020-04-26T17:14:55Z" level=info msg="started events processor" time="2020-04-26T17:14:55Z" level=info msg="processing secret watch event while in Removed state; deletion event: false" time="2020-04-26T17:14:55Z" level=info msg="creation/update of credential in openshift namespace recognized" time="2020-04-26T17:14:55Z" level=info msg="processing secret watch event while in Removed state; deletion event: false" Looks like it is trying to copy the pull secret even before the informer can run and populate the secrets from the "openshift-config namespace". I tried an older ppc64le 4.3 build from a month ago and this used to work. This is the logs I see in the sucessfull case: time="2020-04-26T15:43:04Z" level=info msg="Go Version: go1.12.12" time="2020-04-26T15:43:04Z" level=info msg="Go OS/Arch: linux/ppc64le" time="2020-04-26T15:43:04Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0002cc840)}" time="2020-04-26T15:43:04Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0002cc900)}" time="2020-04-26T15:43:04Z" level=info msg="creating default Config" time="2020-04-26T15:43:12Z" level=info msg="waiting for informer caches to sync" E0426 15:43:14.101561 13 reflector.go:153] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: Failed to list *v1.Template: Unauthorized E0426 15:43:14.101561 13 reflector.go:153] github.com/openshift/cluster-samples-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.Config: Unauthorized E0426 15:43:14.101618 13 reflector.go:153] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Secret: Unauthorized E0426 15:44:34.511848 13 reflector.go:153] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: Failed to list *v1.ImageStream: Get https://172.30.0.1:443/apis/image.openshift.io/v1/namespaces/openshift/imagestreams?limit=500&resourceVersion=0: unexpected EOF E0426 15:44:34.511896 13 reflector.go:153] github.com/openshift/cluster-samples-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.Config: Get https://172.30.0.1:443/apis/samples.operator.openshift.io/v1/configs?limit=500&resourceVersion=0: unexpected EOF E0426 15:44:34.511937 13 reflector.go:153] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Secret: Get https://172.30.0.1:443/api/v1/namespaces/openshift-config/secrets?limit=500&resourceVersion=0: unexpected EOF E0426 15:44:34.511979 13 reflector.go:153] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Secret: Get https://172.30.0.1:443/api/v1/namespaces/openshift/secrets?limit=500&resourceVersion=0: unexpected EOF time="2020-04-26T15:44:36Z" level=info msg="started events processor" time="2020-04-26T15:44:36Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace" time="2020-04-26T15:44:40Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing" time="2020-04-26T15:44:40Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing" time="2020-04-26T15:44:40Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing" time="2020-04-26T15:44:41Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing" time="2020-04-26T15:44:41Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing" time="2020-04-26T15:44:42Z" level=error msg="unable to sync: Received secret pull-secret but do not have the Config yet, requeuing, requeuing" time="2020-04-26T15:44:43Z" level=error msg="unable to sync: Received secret samples-registry-credentials but do not have the Config yet, requeuing, requeuing" time="2020-04-26T15:44:45Z" level=info msg="processing secret watch event while in Removed state; deletion event: false" time="2020-04-26T15:44:45Z" level=info msg="creation/update of credential in openshift namespace recognized" Looks like in this case it worked. I am not sure what changed between the two builds as nothing seems to have changed in the cluster-samples-operator
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
Jeremy, I believe this issue is fixed after the backports from Gabe? Can you confirm ? Thanks Prashanth
I feel like this was fixed by a backport at some point. I also know it's working in 4.4+, because I was just running those tests today. I will kick off a suite of tests to confirm if this issue is still present on 4.3.z. @Gabe, do you happen to know offhand what backport may have fixed this. It would have been a while back, since we haven't looked at this in a while.
There was https://bugzilla.redhat.com/show_bug.cgi?id=1818476 which was fixed in 4.5 where we were getting unauthroized errors pulling images for the must gather image stream because samples operator did not copy the install pull secret. That is in 4.5 Samples operator code itself does not create the must gather imagestream. The install does as a result of this file in the sample operator manifest: https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml With the various error messages posted here, I cannot tell if something else is going on with oc adm must-gather, of if the imagestream not getting populated was the root cause. A simple check would be to run oc get is must-gather -o yaml -n openshift I would tend to agree on closing this out, and reopen, getting oc get is must-gather -o yaml -n openshift along with the other debug data noted here, if it is seen again.
Closing this out. Will reopen if seen again, as per the comment above.