Bug 1779933
Summary: | Install fails on Z due to timeout in cluster-samples-operator | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yaakov Selkowitz <yselkowi> | |
Component: | Samples | Assignee: | Gabe Montero <gmontero> | |
Status: | CLOSED ERRATA | QA Contact: | Yaakov Selkowitz <yselkowi> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.2.z | CC: | adam.kaplan, amccrae, bparees, crawford, dbenoit, gmontero, ssadeghi, wking, wsun | |
Target Milestone: | --- | |||
Target Release: | 4.4.0 | |||
Hardware: | s390x | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: samples operator was failing to report its version when running on s390x or ppc64le
Consequence: installs on those architectures would not complete successfully
Fix: samples operator not reports version correctly on s390x and ppc64le
Result: samples operator no longer prevents installs on s390x and ppc64le from completing
|
Story Points: | --- | |
Clone Of: | ||||
: | 1779934 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-04 11:18:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1779934, 1779935 |
Description
Yaakov Selkowitz
2019-12-05 02:58:18 UTC
```[root@ocp-z-dev-2-9 ocp4-workdir]# oc logs cluster-samples-operator-66dcb6fddf-npdlc time="2019-12-05T00:55:29Z" level=info msg="Go Version: go1.11.13" time="2019-12-05T00:55:29Z" level=info msg="Go OS/Arch: linux/s390x" time="2019-12-05T00:55:29Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0003d0300)}" time="2019-12-05T00:55:29Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0003d03c0)}" time="2019-12-05T00:55:29Z" level=info msg="creating default Config" time="2019-12-05T00:55:32Z" level=info msg="got already exists error on create default" time="2019-12-05T00:55:32Z" level=info msg="waiting for informer caches to sync" time="2019-12-05T00:55:32Z" level=info msg="started events processor" time="2019-12-05T00:55:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false" time="2019-12-05T00:55:32Z" level=info msg="creation/update of credential in openshift namespace recognized" time="2019-12-05T00:55:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false" time="2019-12-05T00:55:32Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace" time="2019-12-05T00:55:32Z" level=info msg="management state set to managed" time="2019-12-05T00:55:32Z" level=info msg="Spec is valid because this operator has not processed a config yet" time="2019-12-05T00:55:32Z" level=info msg="samples are not installed on non-x86 architectures" time="2019-12-05T01:05:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false" time="2019-12-05T01:05:32Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace" time="2019-12-05T01:05:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false" time="2019-12-05T01:05:32Z" level=info msg="creation/update of credential in openshift namespace recognized" time="2019-12-05T01:05:32Z" level=info msg="management state set to managed" time="2019-12-05T01:05:32Z" level=info msg="Spec is valid because this operator has not processed a config yet" time="2019-12-05T01:05:32Z" level=info msg="samples are not installed on non-x86 architectures" ``` can you provide "oc get clusteroperator/openshift-samples -o yaml"? yaml was provided in slack: apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-12-05T00:22:00Z" generation: 1 name: openshift-samples resourceVersion: "10094" selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples uid: 40e8ab8d-16f5-11ea-868b-0200000c2211 spec: {} status: conditions: - lastTransitionTime: "2019-12-05T00:22:00Z" reason: NonX86Platform status: "False" type: Progressing - lastTransitionTime: "2019-12-05T00:22:00Z" reason: NonX86Platform status: "False" type: Degraded - lastTransitionTime: "2019-12-05T00:22:03Z" reason: NonX86Platform status: "True" type: Available extension: null relatedObjects: - group: samples.operator.openshift.io name: cluster resource: configs - group: "" name: openshift-cluster-samples-operator resource: namespaces - group: "" name: openshift resource: namespaces https://coreos.slack.com/files/UFHEG5WQ3/FRCGZA431/untitled as part of discussion: https://coreos.slack.com/archives/CFFJUNP6C/p1575522976131700 not seeing anything obviously wrong w/ it, so possibly this is a CVO problem? or the samples operator status updated to healthy after the failure? I bet the issue is the missing version. Compare versions: - name: operator version: 0.0.1-2019-12-05-035621 from this random, successful CI job [1]. Docs in [2]. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2753/pull-ci-openshift-installer-master-e2e-aws/8850/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-tytlx80s-stable-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/cluster-scoped-resources/config.openshift.io/clusteroperators/openshift-samples.yaml [2]: https://github.com/openshift/cluster-version-operator/blame/98d173e9f8679a7db87877cbdb177bc309dda6a2/docs/user/reconciliation.md#L120 Samples operator saying "yes, expect me to set an 'operator' version" [1]. [1]: https://github.com/openshift/cluster-samples-operator/blob/c8d02cb18cf94dd774c9391292ae1fd27ba32346/manifests/07-clusteroperator.yaml#L7-L9 yeah that would do it. Thanks Trevor. Hopefully Gabe can fix this in the morning. Setting code might be [1,2]. Not sure where the multi-arch-ness is guarding from that. [1]: https://github.com/openshift/cluster-samples-operator/blob/9d88c47dc607029e6ea48256697fea837dd0df40/pkg/operatorstatus/operatorstatus.go#L177 [2]: https://github.com/openshift/cluster-samples-operator/blob/9d88c47dc607029e6ea48256697fea837dd0df40/pkg/operatorstatus/operatorstatus.go#L203-L207 Ah, guard is [1], but [2] is not setting a version. [1]: https://github.com/openshift/cluster-samples-operator/blob/c8d02cb18cf94dd774c9391292ae1fd27ba32346/pkg/operatorstatus/operatorstatus.go#L90-L93 [2]: https://github.com/openshift/cluster-samples-operator/blob/c8d02cb18cf94dd774c9391292ae1fd27ba32346/pkg/operatorstatus/operatorstatus.go#L66-L82 Please provide the must-gather info, which contains the logs for the samples operator. This code _should_ be setting the operator version, but if it is failing to do so we would see errors in the log. I don't need must gather ... I believe I know why the version is not getting set in our special case path for s390 I should have a PR up soon. Hi Gabe, So per #comment 14 and #comment 15, this bug should be moved to 4.5 and set it to assigned status? No Wei Sun we should mark this as verified, as what we did in 4.4 was not attempt to install x86 samples on s390/ppc that were doomed to fail. However, samples operator was originally failing to set the version it was at as part of this, and thus the install complained. The PR with this bug addressed that. #Comment 14 and #Comment 15 talk to the next step, which is installing samples on s390/ppc that reference images that work on those platforms. Specifically 1) https://issues.redhat.com/browse/DEVEXP-465 and https://github.com/openshift/cluster-samples-operator/pull/225 will result in samples getting installed 2) https://issues.redhat.com/browse/MULTIARCH-149 is the work on the multi-arch side to enable testing of those sample in CI, to verify those imagestreams/images and templates from the non-openshift teams are functional We will merge 1) once 2) is ready. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |