Bug 1772178 - [Disconnected] openshift-samples operator is reporting bad status to make the whole installation get failed.
Summary: [Disconnected] openshift-samples operator is reporting bad status to make the...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.2.z
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On: 1771321
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-13 20:36 UTC by Gabe Montero
Modified: 2019-12-11 22:36 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1771321
: 1782683 (view as bug list)
Environment:
Last Closed: 2019-12-11 22:36:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 199 0 'None' closed [release-4.2] Bug 1772178: no longer gate setting progressing to false on in flight imagestream … 2020-10-22 07:22:30 UTC
Red Hat Product Errata RHBA-2019:4093 0 None None None 2019-12-11 22:36:21 UTC

Description Gabe Montero 2019-11-13 20:36:19 UTC
+++ This bug was initially created as a clone of Bug #1771321 +++

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
4.3.0-0.nightly-2019-11-12-000306

Steps to Reproduce:
1. Trigger a disconnected install on barmetal
2.
3.

Actual results:
installation failed. 

$ openshift-install wait-for install-complete --dir '/home/installer2/workspace/Launch Environment Flexy/workdir/install-dir'
level=info msg="Waiting up to 30m0s for the cluster at https://api.jialiu43-dis2.qe.devcluster.openshift.com:6443 to initialize..."

E1111 23:40:23.358556    2370 reflector.go:280] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: the server is currently unable to handle the request (get clusterversions.config.openshift.io)

level=info msg="Cluster operator insights Disabled is True with Disabled: Health reporting is disabled"
level=info msg="Cluster operator openshift-samples Progressing is True with : Samples processing to 4.3.0-0.nightly-2019-11-12-000306"
level=fatal msg="failed to initialize the cluster: Cluster operator openshift-samples is still updating"

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          172m    Unable to apply 4.3.0-0.nightly-2019-11-12-000306: the cluster operator openshift-samples has not yet successfully rolled out

# oc get co openshift-samples
NAME                VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-samples             True        True          False      154m

openshift-samples was reporting a bad status to CVO, that would lead to `wait-for install-complete` failure.

# oc get co openshift-samples -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-11-12T04:37:11Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "14353"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: ef102837-e4a8-4f5a-83d0-592da6145680
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-11-12T04:37:33Z"
    message: Samples installation successful at 4.3.0-0.nightly-2019-11-12-000306
    status: "True"
    type: Available
  - lastTransitionTime: "2019-11-12T04:37:30Z"
    message: Samples processing to 4.3.0-0.nightly-2019-11-12-000306
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-11-12T04:37:30Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces


Expected results:
In disconnected env, openshift-samples should not fail the whole of installation process. Maybe the idea state should be 'VERSION' is available, but only showing PROGRESSING=True to so that notify user the operator is still in progress. User can do mirror images operations as day 2 operation to get opoenshift-samples get ready.


Additional info:
1. This issue does not happen on 4.2, such as: 4.2.0-0.nightly-2019-11-11-110741
2. This is blocking QE's ci job.

--- Additional comment from Gabe Montero on 2019-11-12 15:42:33 UTC ---

given that I see that 4.3.0-0.nightly-2019-11-12-000306 was Accepted by CI, this would appear to be an intermittent thing, or you checked too soon thing

I am not seeing it locally when I try that level.

Based on what is provided, with samples getting created at 

creationTimestamp: "2019-11-12T04:37:11Z"

and the progressing last transition time only a few seconds later

lastTransitionTime: "2019-11-12T04:37:30Z"

either the updates are not happening, or you checked too soon.

I'm going to need 

a) the results from oc get configs.samples -o yaml

b) the results from oc logs -f `oc get pods -o name` -c cluster-samples-operator -n openshift-cluster-samples-operator 

to clarify which it is.

--- Additional comment from Johnny Liu on 2019-11-13 10:26:05 UTC ---



--- Additional comment from Johnny Liu on 2019-11-13 10:28:23 UTC ---

Maybe my comment 0 is not clear enough. 
This issue only happened in restricted network install.

Yesterday's cluster is terminated, I launched a new one, reproduced with the same behaviour.


a). # oc get configs.samples -o yaml
apiVersion: v1
items:
- apiVersion: samples.operator.openshift.io/v1
  kind: Config
  metadata:
    creationTimestamp: "2019-11-13T08:51:22Z"
    finalizers:
    - samples.operator.openshift.io/finalizer
    generation: 1
    name: cluster
    resourceVersion: "9481"
    selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster
    uid: eaad671d-6948-49ef-b435-3da5628741db
  spec:
    architectures:
    - x86_64
    managementState: Managed
  status:
    architectures:
    - x86_64
    conditions:
    - lastTransitionTime: "2019-11-13T08:51:23Z"
      lastUpdateTime: "2019-11-13T08:51:23Z"
      status: "True"
      type: ImportCredentialsExist
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: "2019-11-13T08:51:24Z"
      lastUpdateTime: "2019-11-13T08:51:24Z"
      status: "False"
      type: ImportImageErrorsExist
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      reason: 'fuse-apicurito-generator jboss-fuse70-eap-openshift jboss-processserver64-openshift
        rhpam74-businesscentral-monitoring-openshift jboss-datagrid65-client-openshift
        jboss-datagrid65-openshift jboss-eap71-openshift redhat-sso72-openshift jboss-webserver30-tomcat7-openshift
        jboss-datavirt64-openshift php ruby jboss-decisionserver64-openshift dotnet
        mariadb rhdm74-decisioncentral-openshift jboss-datagrid73-openshift dotnet-runtime
        openjdk-11-rhel7 jboss-eap64-openshift redhat-sso70-openshift fis-karaf-openshift
        httpd mongodb apicast-gateway jboss-datagrid72-openshift jboss-eap70-openshift
        rhpam74-businesscentral-openshift jboss-webserver31-tomcat7-openshift jboss-amq-63
        java mysql redis jboss-datagrid71-client-openshift apicurito-ui nginx fuse7-java-openshift
        redhat-openjdk18-openshift eap-cd-openshift jboss-webserver31-tomcat8-openshift
        jboss-fuse70-karaf-openshift jenkins-agent-maven nodejs jboss-amq-62 jboss-datagrid71-openshift
        jboss-fuse70-console rhpam74-kieserver-openshift redhat-sso73-openshift fuse7-karaf-openshift
        jenkins postgresql perl jboss-eap72-openshift fuse7-eap-openshift jboss-fuse70-java-openshift
        golang jenkins-agent-nodejs python rhdm74-kieserver-openshift redhat-sso71-openshift
        jboss-datavirt64-driver-openshift fis-java-openshift fuse7-console jboss-webserver30-tomcat8-openshift
        jboss-webserver50-tomcat9-openshift modern-webapp rhdm74-optaweb-employee-rostering-openshift
        rhpam74-smartrouter-openshift '
      status: "True"
      type: ImageChangesInProgress
    - lastTransitionTime: "2019-11-13T08:51:36Z"
      lastUpdateTime: "2019-11-13T08:51:36Z"
      status: "True"
      type: SamplesExist
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      status: "False"
      type: RemovePending
    - lastTransitionTime: "2019-11-13T08:51:33Z"
      lastUpdateTime: "2019-11-13T08:51:33Z"
      status: "False"
      type: MigrationInProgress
    managementState: Managed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

b). log is attached.

--- Additional comment from Gabe Montero on 2019-11-13 15:45:15 UTC ---

OK the data helped.

Basically Johnny:

- in 4.2, we were only retrying failed imagestream imports once ... so we gave up in your disconnected env quickly, and samples progressing went to false ... so basically you got lucky

- in 4.3, based on customer telemeter data, it was decided that we need retry on a more continual basis (10 minute gap between retries), but with so many imagestreams, we end up in a case where we are always in progress with at least one imagestream in your disconnected env, where the attempts to access registry.redhat.io will always fail

So what do we tell customers to do in a disconnected env for samples?  If you see https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-preparations.html#installation-restricted-network-samples_installing-restricted-networks-preparations we say that if you are not going to mirror in registry.redhat.io content, you should set the samples operator to Removed 

That said, Ben Parees and I have been discussing the situation this morning ... do we allow samples to stay progressing==true and cause the `wait-for install-complete` failure you noted in the description.

As such, he and I have agreed to adjust samples operator's interpretation of progressing wrt imagestream import.  We'll essentially no longer image import attempt in flight to progressing.

Comment 2 XiuJuan Wang 2019-11-29 08:32:02 UTC
Install a disconnect cluster with 4.2.0-0.nightly-2019-11-28-230858.
Don't blocked installation when some imagestreams report import failed.

$ oc get co  openshift-samples  
NAME                VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-samples   4.2.0-0.nightly-2019-11-28-230858   True        False         False      22m

$ oc get co  openshift-samples  -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-11-29T08:08:57Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "16913"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: 7e2d51bf-127f-11ea-a664-02d5cfbb615e
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-11-29T08:09:13Z"
    message: Samples installation successful at 4.2.0-0.nightly-2019-11-28-230858
    status: "True"
    type: Available
  - lastTransitionTime: "2019-11-29T08:22:16Z"
    message: 'Samples installed at 4.2.0-0.nightly-2019-11-28-230858, with image import
      failures for these imagestreams: redhat-sso71-openshift fis-karaf-openshift
      fuse7-java-openshift rhpam73-kieserver-openshift redhat-sso73-openshift jboss-eap72-openshift
      jboss-processserver64-openshift jboss-fuse70-eap-openshift jboss-amq-62 fuse7-karaf-openshift
      jboss-datagrid65-client-openshift ruby mariadb jboss-eap70-openshift redhat-sso70-openshift
      rhdm73-decisioncentral-indexing-openshift jboss-webserver50-tomcat9-openshift
      rhpam73-smartrouter-openshift nodejs rhdm73-kieserver-openshift rhpam73-businesscentral-indexing-openshift
      fuse7-eap-openshift jboss-datagrid71-client-openshift jboss-eap64-openshift
      ; last import attempt 2019-11-29 08:22:13 +0000 UTC'
    reason: FailedImageImports
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-11-29T08:09:04Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces
  versions:
  - name: operator
    version: 4.2.0-0.nightly-2019-11-28-230858

Comment 4 errata-xmlrpc 2019-12-11 22:36:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4093


Note You need to log in before you can comment on or make changes to this bug.