Description of problem: We're seeing a test consistently fail due to a problem pulling an image. The error seen in the test output is that a secrets volume mount failed, and later in the build log there's an image pull error message. Version-Release number of selected component (if applicable): n/a How reproducible: easily Steps to Reproduce: run the e2e test bucket (so far only tested/seen in the libvirt CI) Actual results: [sig-operator] an end user can use OLM can subscribe to the operator [Suite:openshift/conformance/parallel] test times out fail [github.com/openshift/origin/test/extended/operators/olm.go:271]: Timed out after 300.000s. Expected results: test pass Additional info: errors from https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.6/1308677555380817920#1:build-log.txt%3A2161 Sep 23 08:59:07.420: INFO: At 2020-09-23 08:54:15 +0000 UTC - event for amq-streams-cluster-operator-v1.5.3-7fb46d478f-wpwcw: {kubelet ci-op-4h0y2gv4-694a5-gdfpt-worker-0-dgpzc} FailedMount: MountVolume.SetUp failed for volume "strimzi-cluster-operator-token-pvmph" : failed to sync secret cache: timed out waiting for the condition caused \\\\\\\"stat /var/lib/kubelet/pods/612cedd7-f531-4e24-a807-3d1977d3db2f/volumes/kubernetes.io~secret/default-token-7n76z: no such file or directory\\\\\\\"\\\"\""\ncontainer_linux.go:348: starting container process caused "process_linux.go:438: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/612cedd7-f531-4e24-a807-3d1977d3db2f/volumes/kubernetes.io~secret/default-token-7n76z\\\" to rootfs \\\"/var/lib/containers/storage/overlay/ec67262d54d15a0d362e7c7acd60f1ccd14b5a30e1c2bbf47a923398ea78d88d/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/612cedd7-f531-4e24-a807-3d1977d3db2f/volumes/kubernetes.io~secret/default-token-7n76z: no such file or directory\\\"\""\n Sep 23 09:12:12.045 W ns/e2e-container-runtime-1802 pod/image-pull-test2d3292ad-b58a-41e1-912c-e6a94ce09945 node/ci-op-4h0y2gv4-694a5-gdfpt-worker-0-dgpzc reason/Failed Failed to pull image "gcr.io/authenticated-image-pulling/alpine:3.7": rpc error: code = Unknown desc = Error reading manifest 3.7 in gcr.io/authenticated-image-pulling/alpine: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication Sep 23 09:12:12.058 W ns/e2e-container-runtime-1802 pod/image-pull-test2d3292ad-b58a-41e1-912c-e6a94ce09945 node/ci-op-4h0y2gv4-694a5-gdfpt-worker-0-dgpzc reason/Failed Error: ErrImagePull
We're able to reliably reproduce this locally (on Power). Will look into it.
Running this again and grabbing some logs before the pod was deleted shows an exec format error. According to @jpoulin, the OLM payloads in the image index bundle will be the x86 ones until GA. Should we leave this as a 4.6 bug and mark it as RELEASE_PENDING or some other resolved-sounding Status?
Hi Christy, what is your estimation of the "Target Release" of this bug (is it 4.6 or 4.7)? I am trying to triage this bug with a target release.
Hi Dan. 4.6 please and thankyou. :D I am not able to set a target milesone.
Thank you Christy. Setting target release as 4.6
Hi Christy (sorry - my last logistic question on this bug for the day..) do you think this bug will be resolved by the end of this Sprint (before October 3rd)? If it will be fixed after this week, I would like to add an "UpcomingSprint" label to this bug
This one, as I understood it, will resolve itself *at* GA -- so no -- it will not be fixed after this week. However, the test passed today, so I am perplexed as to the reason it was failing. Maybe there was a change in process and the index bundle has been multi-arched. TBD. But for now I think I answered your question.
(In reply to Christy Norman from comment #7) > This one, as I understood it, will resolve itself *at* GA -- so no -- it > will not be fixed after this week. > > However, the test passed today, so I am perplexed as to the reason it was > failing. Maybe there was a change in process and the index bundle has been > multi-arched. TBD. But for now I think I answered your question. Typo. It *will* be fixed after this week.
It's still failing on s390x: fail [github.com/openshift/origin/test/extended/operators/olm.go:211]: Unexpected error: <*errors.errorString | 0xc001c96990>: { s: "Error unmarshalling operatorhub spec: map[]", } Error unmarshalling operatorhub spec: map[] occurred and indeed when I run $ oc get operatorhub/cluster -o=jsonpath={.spec} map[]
Thank you. Adding "UpcomingSprint" label
(In reply to Rafael Fonseca from comment #9) > It's still failing on s390x: > > fail [github.com/openshift/origin/test/extended/operators/olm.go:211]: > Unexpected error: > <*errors.errorString | 0xc001c96990>: { > s: "Error unmarshalling operatorhub spec: map[]", > } > Error unmarshalling operatorhub spec: map[] > occurred > > and indeed when I run > > $ oc get operatorhub/cluster -o=jsonpath={.spec} > map[] Correction: it's passing in CI but failing locally.
Rafael, is this still failing for you? I closed my PR to skip it since it's passing CI. I'm okay with closing this bz unless you still need it for s390x.
It's been passing in CI for all the latest runs, so feel free to close it.
Closing as WORKSFORME. Dan or anyone who cares, feel free to change the close reason if it's important. :)