Bug 1779933 - Install fails on Z due to timeout in cluster-samples-operator
Summary: Install fails on Z due to timeout in cluster-samples-operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.2.z
Hardware: s390x
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.4.0
Assignee: Gabe Montero
QA Contact: Yaakov Selkowitz
URL:
Whiteboard:
Depends On:
Blocks: 1779934 1779935
TreeView+ depends on / blocked
 
Reported: 2019-12-05 02:58 UTC by Yaakov Selkowitz
Modified: 2020-06-25 02:06 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: samples operator was failing to report its version when running on s390x or ppc64le Consequence: installs on those architectures would not complete successfully Fix: samples operator not reports version correctly on s390x and ppc64le Result: samples operator no longer prevents installs on s390x and ppc64le from completing
Clone Of:
: 1779934 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:18:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 205 'None' closed Bug 1779933: set version to start for s390/ppc until we have actual samples 2020-09-03 13:51:32 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:19:07 UTC

Description Yaakov Selkowitz 2019-12-05 02:58:18 UTC
On the latest commit on s390x, the cluster samples operator is reporting that it is available and finished progressing, however the openshift installer is not detecting that the operator has finished updating, and ends up timing out.

DEBUG Built from commit 6ed04f65b0f6a1e11f10afe658465ba8195ac459 
INFO Waiting up to 30m0s for the cluster at https://api.test.example.com:6443 to initialize... 
DEBUG Still waiting for the cluster to initialize: Cluster operator openshift-samples is still updating```

The offending commit is https://github.com/openshift/cluster-samples-operator/pull/187 which we do need overall but apparently isn't quite right.

Comment 2 David Benoit 2019-12-05 03:49:18 UTC
```[root@ocp-z-dev-2-9 ocp4-workdir]# oc logs cluster-samples-operator-66dcb6fddf-npdlc                                                                                                           
time="2019-12-05T00:55:29Z" level=info msg="Go Version: go1.11.13"                                                                                                                             
time="2019-12-05T00:55:29Z" level=info msg="Go OS/Arch: linux/s390x"                                                                                                                           
time="2019-12-05T00:55:29Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc0003d0300)}"                                                                 
time="2019-12-05T00:55:29Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc0003d03c0)}"                                                                       
time="2019-12-05T00:55:29Z" level=info msg="creating default Config"                                                                                                                           
time="2019-12-05T00:55:32Z" level=info msg="got already exists error on create default"                                                                                                        
time="2019-12-05T00:55:32Z" level=info msg="waiting for informer caches to sync"                                                                                                               
time="2019-12-05T00:55:32Z" level=info msg="started events processor"                                                                                                                          
time="2019-12-05T00:55:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false"                                                                       
time="2019-12-05T00:55:32Z" level=info msg="creation/update of credential in openshift namespace recognized"                                                                                   
time="2019-12-05T00:55:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false"                                                                       
time="2019-12-05T00:55:32Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"                                                      
time="2019-12-05T00:55:32Z" level=info msg="management state set to managed"                                                                                                                   
time="2019-12-05T00:55:32Z" level=info msg="Spec is valid because this operator has not processed a config yet"                                                                                
time="2019-12-05T00:55:32Z" level=info msg="samples are not installed on non-x86 architectures"                                                                                                
time="2019-12-05T01:05:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false"                                                                       
time="2019-12-05T01:05:32Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"                                                      
time="2019-12-05T01:05:32Z" level=info msg="processing secret watch event while in Managed state; deletion event: false"
time="2019-12-05T01:05:32Z" level=info msg="creation/update of credential in openshift namespace recognized"
time="2019-12-05T01:05:32Z" level=info msg="management state set to managed"
time="2019-12-05T01:05:32Z" level=info msg="Spec is valid because this operator has not processed a config yet"
time="2019-12-05T01:05:32Z" level=info msg="samples are not installed on non-x86 architectures"
```

Comment 3 Ben Parees 2019-12-05 05:34:36 UTC
can you provide "oc get clusteroperator/openshift-samples -o yaml"?

Comment 4 Ben Parees 2019-12-05 05:40:28 UTC
yaml was provided in slack:
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-12-05T00:22:00Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "10094"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: 40e8ab8d-16f5-11ea-868b-0200000c2211
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-05T00:22:00Z"
    reason: NonX86Platform
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-12-05T00:22:00Z"
    reason: NonX86Platform
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-12-05T00:22:03Z"
    reason: NonX86Platform
    status: "True"
    type: Available
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces


https://coreos.slack.com/files/UFHEG5WQ3/FRCGZA431/untitled

as part of discussion:
https://coreos.slack.com/archives/CFFJUNP6C/p1575522976131700

not seeing anything obviously wrong w/ it, so possibly this is a CVO problem?  or the samples operator status updated to healthy after the failure?

Comment 7 W. Trevor King 2019-12-05 06:16:37 UTC
Samples operator saying "yes, expect me to set an 'operator' version" [1].

[1]: https://github.com/openshift/cluster-samples-operator/blob/c8d02cb18cf94dd774c9391292ae1fd27ba32346/manifests/07-clusteroperator.yaml#L7-L9

Comment 8 Ben Parees 2019-12-05 06:18:07 UTC
yeah that would do it.  Thanks Trevor.  Hopefully Gabe can fix this in the morning.

Comment 11 Adam Kaplan 2019-12-05 13:45:59 UTC
Please provide the must-gather info, which contains the logs for the samples operator. This code _should_ be setting the operator version, but if it is failing to do so we would see errors in the log.

Comment 12 Gabe Montero 2019-12-05 14:34:00 UTC
I don't need must gather ... I believe I know why the version is not getting set in our special case path for s390

I should have a PR up soon.

Comment 16 Wei Sun 2020-04-17 02:13:37 UTC
Hi Gabe,
So per #comment 14 and #comment 15, this bug should be moved to 4.5 and set it to assigned status?

Comment 17 Gabe Montero 2020-04-17 13:49:39 UTC
No Wei Sun we should mark this as verified, as what we did in 4.4 was not attempt to install x86 samples on s390/ppc that were doomed to fail.

However, samples operator was originally failing to set the version it was at as part of this, and thus the install complained.

The PR with this bug addressed that.

#Comment 14 and #Comment 15 talk to the next step, which is installing samples on s390/ppc that reference images that work on those platforms.

Specifically
1) https://issues.redhat.com/browse/DEVEXP-465 and https://github.com/openshift/cluster-samples-operator/pull/225 will result in samples getting installed
2) https://issues.redhat.com/browse/MULTIARCH-149 is the work on the multi-arch side to enable testing of those sample in CI, to verify those imagestreams/images and templates from the non-openshift teams are functional

We will merge 1) once 2) is ready.

Comment 19 errata-xmlrpc 2020-05-04 11:18:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.