Bug 1772178

Summary: [Disconnected] openshift-samples operator is reporting bad status to make the whole installation get failed.
Product: OpenShift Container Platform Reporter: Gabe Montero <gmontero>
Component: SamplesAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.2.zCC: bparees, jialiu, wzheng, xiuwang
Target Milestone: ---Keywords: Regression
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1771321
: 1782683 (view as bug list) Environment:
Last Closed: 2019-12-11 22:36:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1771321    
Bug Blocks:    

Description Gabe Montero 2019-11-13 20:36:19 UTC
+++ This bug was initially created as a clone of Bug #1771321 +++

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
4.3.0-0.nightly-2019-11-12-000306

Steps to Reproduce:
1. Trigger a disconnected install on barmetal
2.
3.

Actual results:
installation failed. 

$ openshift-install wait-for install-complete --dir '/home/installer2/workspace/Launch Environment Flexy/workdir/install-dir'
level=info msg="Waiting up to 30m0s for the cluster at https://api.jialiu43-dis2.qe.devcluster.openshift.com:6443 to initialize..."

E1111 23:40:23.358556    2370 reflector.go:280] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: the server is currently unable to handle the request (get clusterversions.config.openshift.io)

level=info msg="Cluster operator insights Disabled is True with Disabled: Health reporting is disabled"
level=info msg="Cluster operator openshift-samples Progressing is True with : Samples processing to 4.3.0-0.nightly-2019-11-12-000306"
level=fatal msg="failed to initialize the cluster: Cluster operator openshift-samples is still updating"

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          172m    Unable to apply 4.3.0-0.nightly-2019-11-12-000306: the cluster operator openshift-samples has not yet successfully rolled out

# oc get co openshift-samples
NAME                VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-samples             True        True          False      154m

openshift-samples was reporting a bad status to CVO, that would lead to `wait-for install-complete` failure.

# oc get co openshift-samples -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-11-12T04:37:11Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "14353"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: ef102837-e4a8-4f5a-83d0-592da6145680
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-11-12T04:37:33Z"
    message: Samples installation successful at 4.3.0-0.nightly-2019-11-12-000306
    status: "True"
    type: Available
  - lastTransitionTime: "2019-11-12T04:37:30Z"
    message: Samples processing to 4.3.0-0.nightly-2019-11-12-000306
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-11-12T04:37:30Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces


Expected results:
In disconnected env, openshift-samples should not fail the whole of installation process. Maybe the idea state should be 'VERSION' is available, but only showing PROGRESSING=True to so that notify user the operator is still in progress. User can do mirror images operations as day 2 operation to get opoenshift-samples get ready.


Additional info:
1. This issue does not happen on 4.2, such as: 4.2.0-0.nightly-2019-11-11-110741
2. This is blocking QE's ci job.

--- Additional comment from Gabe Montero on 2019-11-12 15:42:33 UTC ---

given that I see that 4.3.0-0.nightly-2019-11-12-000306 was Accepted by CI, this would appear to be an intermittent thing, or you checked too soon thing

I am not seeing it locally when I try that level.

Based on what is provided, with samples getting created at 

creationTimestamp: "2019-11-12T04:37:11Z"

and the progressing last transition time only a few seconds later

lastTransitionTime: "2019-11-12T04:37:30Z"

either the updates are not happening, or you checked too soon.

I'm going to need 

a) the results from oc get configs.samples -o yaml

b) the results from oc logs -f `oc get pods -o name` -c cluster-samples-operator -n openshift-cluster-samples-operator 

to clarify which it is.

--- Additional comment from Johnny Liu on 2019-11-13 10:26:05 UTC ---



--- Additional comment from Johnny Liu on 2019-11-13 10:28:23 UTC ---

Maybe my comment 0 is not clear enough. 
This issue only happened in restricted network install.

Yesterday's cluster is terminated, I launched a new one, reproduced with the same behaviour.


a). # oc get configs.samples -o yaml
apiVersion: v1
items:
- apiVersion: samples.operator.openshift.io/v1
  kind: Config
  metadata:
    creationTimestamp: "2019-11-13T08:51:22Z"
    finalizers:
    - samples.operator.openshift.io/finalizer
    generation: 1
    name: cluster
    resourceVersion: "9481"
    selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster
    uid: eaad671d-6948-49ef-b435-3da5628741db
  spec:
    architectures:
    - x86_64
    managementState: Managed
  status:
    architectures:
    - x86_64
    conditions:
    - lastTransitionTime: "2019-11-13T08:51:23Z"
      lastUpdateTime: "2019-11-13T08:51:23Z"
      status: "True"
      type: ImportCredentialsExist
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: "2019-11-13T08:51:24Z"
      lastUpdateTime: "2019-11-13T08:51:24Z"
      status: "False"
      type: ImportImageErrorsExist
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      reason: 'fuse-apicurito-generator jboss-fuse70-eap-openshift jboss-processserver64-openshift
        rhpam74-businesscentral-monitoring-openshift jboss-datagrid65-client-openshift
        jboss-datagrid65-openshift jboss-eap71-openshift redhat-sso72-openshift jboss-webserver30-tomcat7-openshift
        jboss-datavirt64-openshift php ruby jboss-decisionserver64-openshift dotnet
        mariadb rhdm74-decisioncentral-openshift jboss-datagrid73-openshift dotnet-runtime
        openjdk-11-rhel7 jboss-eap64-openshift redhat-sso70-openshift fis-karaf-openshift
        httpd mongodb apicast-gateway jboss-datagrid72-openshift jboss-eap70-openshift
        rhpam74-businesscentral-openshift jboss-webserver31-tomcat7-openshift jboss-amq-63
        java mysql redis jboss-datagrid71-client-openshift apicurito-ui nginx fuse7-java-openshift
        redhat-openjdk18-openshift eap-cd-openshift jboss-webserver31-tomcat8-openshift
        jboss-fuse70-karaf-openshift jenkins-agent-maven nodejs jboss-amq-62 jboss-datagrid71-openshift
        jboss-fuse70-console rhpam74-kieserver-openshift redhat-sso73-openshift fuse7-karaf-openshift
        jenkins postgresql perl jboss-eap72-openshift fuse7-eap-openshift jboss-fuse70-java-openshift
        golang jenkins-agent-nodejs python rhdm74-kieserver-openshift redhat-sso71-openshift
        jboss-datavirt64-driver-openshift fis-java-openshift fuse7-console jboss-webserver30-tomcat8-openshift
        jboss-webserver50-tomcat9-openshift modern-webapp rhdm74-optaweb-employee-rostering-openshift
        rhpam74-smartrouter-openshift '
      status: "True"
      type: ImageChangesInProgress
    - lastTransitionTime: "2019-11-13T08:51:36Z"
      lastUpdateTime: "2019-11-13T08:51:36Z"
      status: "True"
      type: SamplesExist
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      status: "False"
      type: RemovePending
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      status: "False"
      type: MigrationInProgress
    managementState: Managed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

b). log is attached.

--- Additional comment from Gabe Montero on 2019-11-13 15:45:15 UTC ---

OK the data helped.

Basically Johnny:

- in 4.2, we were only retrying failed imagestream imports once ... so we gave up in your disconnected env quickly, and samples progressing went to false ... so basically you got lucky

- in 4.3, based on customer telemeter data, it was decided that we need retry on a more continual basis (10 minute gap between retries), but with so many imagestreams, we end up in a case where we are always in progress with at least one imagestream in your disconnected env, where the attempts to access registry.redhat.io will always fail

So what do we tell customers to do in a disconnected env for samples?  If you see https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-preparations.html#installation-restricted-network-samples_installing-restricted-networks-preparations we say that if you are not going to mirror in registry.redhat.io content, you should set the samples operator to Removed 

That said, Ben Parees and I have been discussing the situation this morning ... do we allow samples to stay progressing==true and cause the `wait-for install-complete` failure you noted in the description.

As such, he and I have agreed to adjust samples operator's interpretation of progressing wrt imagestream import.  We'll essentially no longer image import attempt in flight to progressing.

Comment 2 XiuJuan Wang 2019-11-29 08:32:02 UTC
Install a disconnect cluster with 4.2.0-0.nightly-2019-11-28-230858.
Don't blocked installation when some imagestreams report import failed.

$ oc get co  openshift-samples  
NAME                VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-samples   4.2.0-0.nightly-2019-11-28-230858   True        False         False      22m

$ oc get co  openshift-samples  -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-11-29T08:08:57Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "16913"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: 7e2d51bf-127f-11ea-a664-02d5cfbb615e
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-11-29T08:09:13Z"
    message: Samples installation successful at 4.2.0-0.nightly-2019-11-28-230858
    status: "True"
    type: Available
  - lastTransitionTime: "2019-11-29T08:22:16Z"
    message: 'Samples installed at 4.2.0-0.nightly-2019-11-28-230858, with image import
      failures for these imagestreams: redhat-sso71-openshift fis-karaf-openshift
      fuse7-java-openshift rhpam73-kieserver-openshift redhat-sso73-openshift jboss-eap72-openshift
      jboss-processserver64-openshift jboss-fuse70-eap-openshift jboss-amq-62 fuse7-karaf-openshift
      jboss-datagrid65-client-openshift ruby mariadb jboss-eap70-openshift redhat-sso70-openshift
      rhdm73-decisioncentral-indexing-openshift jboss-webserver50-tomcat9-openshift
      rhpam73-smartrouter-openshift nodejs rhdm73-kieserver-openshift rhpam73-businesscentral-indexing-openshift
      fuse7-eap-openshift jboss-datagrid71-client-openshift jboss-eap64-openshift
      ; last import attempt 2019-11-29 08:22:13 +0000 UTC'
    reason: FailedImageImports
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-11-29T08:09:04Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces
  versions:
  - name: operator
    version: 4.2.0-0.nightly-2019-11-28-230858

Comment 4 errata-xmlrpc 2019-12-11 22:36:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4093