Bug 1882487 - ppc64le/s390x: e2e olm operator test fails due to invalid secret cache
Summary: ppc64le/s390x: e2e olm operator test fails due to invalid secret cache
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Christy Norman
QA Contact: Jeremy Poulin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-24 17:42 UTC by Christy Norman
Modified: 2020-10-06 15:47 UTC (History)
3 users (show)

Fixed In Version: 4.6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-06 15:47:07 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Christy Norman 2020-09-24 17:42:28 UTC
Description of problem:

We're seeing a test consistently fail due to a problem pulling an image. The error seen in the test output is that a secrets volume mount failed, and later in the build log there's an image pull error message.


Version-Release number of selected component (if applicable):
n/a

How reproducible:
easily


Steps to Reproduce:

run the e2e test bucket (so far only tested/seen in the libvirt CI)

Actual results:
[sig-operator] an end user can use OLM can subscribe to the operator [Suite:openshift/conformance/parallel]  test times out

fail [github.com/openshift/origin/test/extended/operators/olm.go:271]: Timed out after 300.000s.


Expected results:
test pass


Additional info:

errors from https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.6/1308677555380817920#1:build-log.txt%3A2161

 Sep 23 08:59:07.420: INFO: At 2020-09-23 08:54:15 +0000 UTC - event for amq-streams-cluster-operator-v1.5.3-7fb46d478f-wpwcw: {kubelet ci-op-4h0y2gv4-694a5-gdfpt-worker-0-dgpzc} FailedMount: MountVolume.SetUp failed for volume "strimzi-cluster-operator-token-pvmph" : failed to sync secret cache: timed out waiting for the condition 

caused \\\\\\\"stat /var/lib/kubelet/pods/612cedd7-f531-4e24-a807-3d1977d3db2f/volumes/kubernetes.io~secret/default-token-7n76z: no such file or directory\\\\\\\"\\\"\""\ncontainer_linux.go:348: starting container process caused "process_linux.go:438: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/612cedd7-f531-4e24-a807-3d1977d3db2f/volumes/kubernetes.io~secret/default-token-7n76z\\\" to rootfs \\\"/var/lib/containers/storage/overlay/ec67262d54d15a0d362e7c7acd60f1ccd14b5a30e1c2bbf47a923398ea78d88d/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/612cedd7-f531-4e24-a807-3d1977d3db2f/volumes/kubernetes.io~secret/default-token-7n76z: no such file or directory\\\"\""\n

Sep 23 09:12:12.045 W ns/e2e-container-runtime-1802 pod/image-pull-test2d3292ad-b58a-41e1-912c-e6a94ce09945 node/ci-op-4h0y2gv4-694a5-gdfpt-worker-0-dgpzc reason/Failed Failed to pull image "gcr.io/authenticated-image-pulling/alpine:3.7": rpc error: code = Unknown desc = Error reading manifest 3.7 in gcr.io/authenticated-image-pulling/alpine: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
Sep 23 09:12:12.058 W ns/e2e-container-runtime-1802 pod/image-pull-test2d3292ad-b58a-41e1-912c-e6a94ce09945 node/ci-op-4h0y2gv4-694a5-gdfpt-worker-0-dgpzc reason/Failed Error: ErrImagePull

Comment 1 Christy Norman 2020-09-25 16:27:13 UTC
We're able to reliably reproduce this locally (on Power). Will look into it.

Comment 2 Christy Norman 2020-09-25 18:04:33 UTC
Running this again and grabbing some logs before the pod was deleted shows an exec format error. According to @jpoulin, the OLM payloads in the image index bundle will be the x86 ones until GA. Should we leave this as a 4.6 bug and mark it as RELEASE_PENDING or some other resolved-sounding Status?

Comment 3 Dan Li 2020-09-28 13:21:25 UTC
Hi Christy, what is your estimation of the "Target Release" of this bug (is it 4.6 or 4.7)? I am trying to triage this bug with a target release.

Comment 4 Christy Norman 2020-09-28 14:29:06 UTC
Hi Dan. 4.6 please and thankyou. :D I am not able to set a target milesone.

Comment 5 Dan Li 2020-09-28 14:58:33 UTC
Thank you Christy. Setting target release as 4.6

Comment 6 Dan Li 2020-09-28 15:35:08 UTC
Hi Christy (sorry - my last logistic question on this bug for the day..) do you think this bug will be resolved by the end of this Sprint (before October 3rd)? If it will be fixed after this week, I would like to add an "UpcomingSprint" label to this bug

Comment 7 Christy Norman 2020-09-29 22:08:49 UTC
This one, as I understood it, will resolve itself *at* GA -- so no -- it will not be fixed after this week. 

However, the test passed today, so I am perplexed as to the reason it was failing. Maybe there was a change in process and the index bundle has been multi-arched. TBD. But for now I think I answered your question.

Comment 8 Christy Norman 2020-09-29 22:10:13 UTC
(In reply to Christy Norman from comment #7)
> This one, as I understood it, will resolve itself *at* GA -- so no -- it
> will not be fixed after this week. 
> 
> However, the test passed today, so I am perplexed as to the reason it was
> failing. Maybe there was a change in process and the index bundle has been
> multi-arched. TBD. But for now I think I answered your question.

Typo. It *will* be fixed after this week.

Comment 9 Rafael Fonseca 2020-09-30 08:39:04 UTC
It's still failing on s390x:

fail [github.com/openshift/origin/test/extended/operators/olm.go:211]: Unexpected error:
    <*errors.errorString | 0xc001c96990>: {
        s: "Error unmarshalling operatorhub spec: map[]",
    }
    Error unmarshalling operatorhub spec: map[]
occurred

and indeed when I run

$ oc get operatorhub/cluster -o=jsonpath={.spec}
map[]

Comment 10 Dan Li 2020-09-30 11:30:53 UTC
Thank you. Adding "UpcomingSprint" label

Comment 11 Rafael Fonseca 2020-09-30 11:57:34 UTC
(In reply to Rafael Fonseca from comment #9)
> It's still failing on s390x:
> 
> fail [github.com/openshift/origin/test/extended/operators/olm.go:211]:
> Unexpected error:
>     <*errors.errorString | 0xc001c96990>: {
>         s: "Error unmarshalling operatorhub spec: map[]",
>     }
>     Error unmarshalling operatorhub spec: map[]
> occurred
> 
> and indeed when I run
> 
> $ oc get operatorhub/cluster -o=jsonpath={.spec}
> map[]

Correction: it's passing in CI but failing locally.

Comment 12 Christy Norman 2020-10-06 15:00:38 UTC
Rafael, is this still failing for you? I closed my PR to skip it since it's passing CI. I'm okay with closing this bz unless you still need it for s390x.

Comment 13 Rafael Fonseca 2020-10-06 15:15:59 UTC
It's been passing in CI for all the latest runs, so feel free to close it.

Comment 14 Christy Norman 2020-10-06 15:47:07 UTC
Closing as WORKSFORME. Dan or anyone who cares, feel free to change the close reason if it's important. :)


Note You need to log in before you can comment on or make changes to this bug.