Bug 1828065 - After upgrade, openshift-samples is moving between ready and "Progressing" status
Summary: After upgrade, openshift-samples is moving between ready and "Progressing" st...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 1832344
TreeView+ depends on / blocked
 
Reported: 2020-04-26 15:22 UTC by Aviad Polak
Modified: 2023-10-06 19:46 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: if a sample template available in a prior release was removed in a subsequent release, then during upgrade to that subsequent release the removed template could be incorrectly tracked as needing to be updated, and the subsequent failure to read that template from the current payload leads to confusing 'no such file or directory error reading file []' messsages and apiserver errors reported in the samples clusteroperator status Consequence: overall upgraded could be marked as failed if the problematic timing windows in samples operator occurred Fix: samples operated was updated to not attempt to update templates which existed in a prior release but not in the release we are upgrading to Result: templates that are removed release to release should not cause samples operator to fail upgrade or report confusing clusteroperator status
Clone Of:
: 1832344 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:31:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
samples operator log (92.95 KB, application/gzip)
2020-04-27 09:26 UTC, XiuJuan Wang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 265 0 None closed Bug 1828065: correctly handle removed templates watch events as part of an upgrade 2021-01-13 13:34:18 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:31:56 UTC

Description Aviad Polak 2020-04-26 15:22:58 UTC
Description of problem:
after upgrade from 4.3 nightly build, to 4.4-rc11,
openshift-samples is moving between ready and "Progressing" status
that keep happen 1-2 hours after upgrade completed.


Version-Release number of selected component (if applicable):
oc version
Client Version: 4.3.0-0.nightly-2020-03-19-220321
Server Version: 4.4.0-rc.11
Kubernetes Version: v1.17.1


Steps to Reproduce:
1. install ocp4.3
2. install ocs4.4 on lso setup (vmware with direct attached storage)
3. run ocp upgrade (patch channel and upgrade)

Actual results:
upgrade seems to be successful, but openshift-samples keep moving between ready and "Progressing" status. its visible via web UI or CLI

Expected results:
all cluster operators need to be "ready"

must-gather:
https://drive.google.com/file/d/1koNffYVx3J-dIUhIXZDoQ_1MYj-53S9q/view?usp=sharing

Comment 1 XiuJuan Wang 2020-04-27 09:26:04 UTC
Created attachment 1682087 [details]
samples operator log

Imagestreams have imported successfully, but ImageChangesInProgress keep true for a long time.
$oc get config.samples -o yaml 
apiVersion: v1
items:
- apiVersion: samples.operator.openshift.io/v1
  kind: Config
  metadata:
    creationTimestamp: "2020-04-27T03:08:16Z"
    finalizers:
    - samples.operator.openshift.io/finalizer
    generation: 1
    name: cluster
    resourceVersion: "137599"
    selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster
    uid: 28ec00ff-52d9-4dab-bdac-ad9ec0675979
  spec:
    architectures:
    - x86_64
    managementState: Managed
  status:
    architectures:
    - x86_64
    conditions:
    - lastTransitionTime: "2020-04-27T03:08:16Z"
      lastUpdateTime: "2020-04-27T03:08:16Z"
      status: "True"
      type: ImportCredentialsExist
    - lastTransitionTime: "2020-04-27T03:08:19Z"
      lastUpdateTime: "2020-04-27T03:08:19Z"
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: "2020-04-27T03:08:17Z"
      lastUpdateTime: "2020-04-27T09:18:23Z"
      status: "False"
      type: ImportImageErrorsExist
    - lastTransitionTime: "2020-04-27T09:18:25Z"
      lastUpdateTime: "2020-04-27T09:18:25Z"
      reason: 'jboss-datagrid71-openshift jboss-datavirt64-driver-openshift fuse-apicurito-generator
        jboss-fuse70-eap-openshift redhat-sso71-openshift jboss-datagrid65-client-openshift
        redhat-sso70-openshift eap-cd-openshift openjdk-11-rhel8 jenkins-agent-maven
        nodejs jboss-webserver30-tomcat7-openshift jboss-webserver31-tomcat7-openshift
        jboss-datagrid65-openshift fuse7-console fuse7-karaf-openshift nginx rhdm-optaweb-employee-rostering-rhel8
        rhpam-businesscentral-monitoring-rhel8 dotnet-runtime mysql perl jboss-amq-62
        jboss-decisionserver64-openshift python redis jboss-eap72-openshift mariadb
        rhdm-decisioncentral-rhel8 rhpam-businesscentral-rhel8 redhat-sso73-openshift
        jboss-amq-63 fuse7-java-openshift jenkins-agent-nodejs postgresql jboss-fuse70-console
        httpd openjdk-8-rhel8 dotnet jboss-fuse70-java-openshift jenkins jboss-processserver64-openshift
        rhpam-kieserver-rhel8 jboss-datagrid72-openshift openjdk-11-rhel7 mongodb
        rhpam-smartrouter-rhel8 golang java jboss-datagrid71-client-openshift jboss-datagrid73-openshift
        jboss-eap70-openshift apicurito-ui fis-karaf-openshift fuse7-eap-openshift
        redhat-openjdk18-openshift jboss-webserver31-tomcat8-openshift jboss-datavirt64-openshift
        jboss-eap71-openshift ruby jboss-webserver30-tomcat8-openshift apicast-gateway
        jboss-eap64-openshift jboss-fuse70-karaf-openshift php rhdm-kieserver-rhel8
        redhat-sso72-openshift fis-java-openshift modern-webapp jboss-webserver50-tomcat9-openshift '
      status: "True"
      type: ImageChangesInProgress
    - lastTransitionTime: "2020-04-27T09:18:31Z"
      lastUpdateTime: "2020-04-27T09:18:31Z"
      status: "True"
      type: SamplesExist
    - lastTransitionTime: "2020-04-27T03:08:19Z"
      lastUpdateTime: "2020-04-27T03:08:19Z"
      status: "False"
      type: RemovePending
    - lastTransitionTime: "2020-04-27T06:03:34Z"
      lastUpdateTime: "2020-04-27T06:03:34Z"
      status: "False"
      type: MigrationInProgress
    managementState: Managed
    version: 4.4.0-rc.12
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

$oc describe is jboss-datagrid71-openshift -n openshift 
Name:			jboss-datagrid71-openshift
Namespace:		openshift
Created:		6 hours ago
Labels:			samples.operator.openshift.io/managed=true
Annotations:		openshift.io/display-name=Red Hat JBoss Data Grid 7.1
			openshift.io/provider-display-name=Red Hat, Inc.
			samples.operator.openshift.io/version=4.4.0-rc.12
			version=1.4.16
Image Repository:	image-registry.openshift-image-registry.svc:5000/openshift/jboss-datagrid71-openshift
Image Lookup:		local=false
Unique Images:		4
Tags:			4

1.3
  tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.3
    prefer registry pullthrough when referencing this tag

  JBoss Data Grid 7.1 S2I images.
  Tags: datagrid, jboss, hidden
  Supports: datagrid:7.1

  * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:f1355e0714431109c6b4c64fe8bcc565c096395ccce0690bb21ce5fb49d51d6c
      6 hours ago

1.2
  tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.2
    prefer registry pullthrough when referencing this tag

  JBoss Data Grid 7.1 S2I images.
  Tags: datagrid, jboss, hidden
  Supports: datagrid:7.1

  * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:5909bd1b08fd6b4428e259c4262dd70ca928089e5a2949cc57edfab019b503d0
      6 hours ago

1.1
  tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.1
    prefer registry pullthrough when referencing this tag

  JBoss Data Grid 7.1 S2I images.
  Tags: datagrid, jboss, hidden
  Supports: datagrid:7.1

  * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:8c7690d21bedc1aa822661b5e5b72626e52178ea8295ab734d2f9c9bc7548e19
      6 hours ago

1.0
  tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.0
    prefer registry pullthrough when referencing this tag

  JBoss Data Grid 7.1 S2I images.
  Tags: datagrid, jboss, hidden
  Supports: datagrid:7.1

  * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:df13fa689944b028595f593480bc54b17e3f81ca685a6b4290089e6a3c4e3a8e
      6 hours ago

Comment 2 Stephen Cuppett 2020-04-27 12:37:54 UTC
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 3 Gabe Montero 2020-04-27 14:05:37 UTC
I suspsect this is a dup of existing bugs on the 4.3 / 4.4 but will triage the data today and report back

Comment 4 Gabe Montero 2020-04-27 16:22:12 UTC
I'm working off of the logs @XiuJuan posted for now, but the must gather logs attachment from Avia is getting errors when I attempt to process it using multiple 
tools, including gunzip/tar/archive mgr

Can a valid must gather archive file be resubmitted?

Comment 5 Gabe Montero 2020-04-27 16:52:56 UTC
OK seeing a common thread between XiuJuan's datat and the data from https://bugzilla.redhat.com/show_bug.cgi?id=1827325#c11

Definite timing bug related with some the upgrade issues previously seen where sample providers have started deleting samples between releases, but just in a different spot this time.

Comment 6 Gabe Montero 2020-04-27 17:21:24 UTC
I have handle on this .... essentially https://github.com/openshift/cluster-samples-operator/pull/240 was incomplete and did not include templates with set of samples that were removed and needed to be ignore

we ended up with a bunch of 

time="2020-04-27T06:43:12Z" level=info msg="open : no such file or directory error reading file []"
time="2020-04-27T06:43:12Z" level=info msg="CRDUPDATE event temp udpate err"


that slows down progressing == false

Comment 7 Aviad Polak 2020-04-27 17:54:29 UTC
Added new must-gather, it is not zipped this time,
its from today's upgrade rc-11 ro rc-12, issue still here

https://drive.google.com/open?id=171vGyhchGcvC598UtSdlqxRm415L2oLW

Comment 8 Gabe Montero 2020-04-27 18:14:01 UTC
The must gather Aviad just posted showed the same root cause as the other 2 examples.

Comment 9 Gabe Montero 2020-04-27 19:25:13 UTC
upgrade related e2e failures are also showing this 

sig-devex][Feature:Templates] templateinstance readiness test should report failed soon after an annotated objects has failed [Suite:openshift/conformance/parallel]

in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci/48

for example

Comment 10 Gabe Montero 2020-04-27 19:25:38 UTC
[sig-devex][Feature:Templates] templateinstance readiness test should report failed soon after an annotated objects has failed [Suite:openshift/conformance/parallel]

Comment 13 W. Trevor King 2020-04-28 21:03:03 UTC
Cross-linking bug 1811143, which is also about 4.5 samples Progessing.

Comment 14 XiuJuan Wang 2020-04-29 10:49:35 UTC
Upgrade 4.4.0-0.nightly-2020-04-27-013217 to 4.5.0-0.nightly-2020-04-28-110828(which doesn't include fix), could reproduce this issue.
Continue upgrade to 4.5.0-0.nightly-2020-04-29-040854, the fix works. openshift-samples processing change to false sooner, and don't change to true without changing.
Also no such error "open : no such file or directory error reading file []" in samples operator logs.

Mark this bug as verified.

Comment 15 Hongkai Liu 2020-05-01 20:42:02 UTC
We saw this too in CI build cluster.

time="2020-05-01T20:37:50Z" level=info msg="CRDERROR event temp udpate err"
time="2020-05-01T20:37:50Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-01T20:37:51Z" level=info msg="CRDERROR event temp udpate err"
time="2020-05-01T20:37:51Z" level=info msg="open : no such file or directory error reading file []"

oc --context build01 get  clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.2     True        False         6h38m   Cluster version is 4.4.2


What is the impact of the bug?

Comment 16 Gabe Montero 2020-05-01 20:57:45 UTC
It can slows down by a non trivial amount of minutes the sample operator noting to the CVO that it has reached progressing==false state as part of an upgrade.

Which can ultimately slow down when a upgrade says it completes.

Once you upgrade is finished, as your 

oc --context build01 get  clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.2     True        False         6h38m   Cluster version is 4.4.2

notes, you are OK.

Also, if you do not care about samples, you are OK.  I would think for CI you do not care about using the samples.
So you might want to mark its management state as Removed, which will remove the samples and effectively disable the operator.

Comment 17 jooho lee 2020-05-05 18:08:41 UTC
This is still there in OCP 4.4.3

```
openshift-samples                          4.4.3     True        False         False      5m50s
operator-lifecycle-manager                 4.4.3     True        False         False      48d
```
==>

```
openshift-samples                          4.4.3     True        True          False      2m39s
operator-lifecycle-manager                 4.4.3     True        False         False      48d
```

There are a lot of error messages in operator
~~~
time="2020-05-05T17:49:46Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing"
time="2020-05-05T17:49:46Z" level=info msg="CRDUPDATE exist true update"
time="2020-05-05T17:49:46Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing"
time="2020-05-05T17:49:46Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing"
time="2020-05-05T17:49:48Z" level=info msg="CRDERROR event temp udpate err"
time="2020-05-05T17:49:48Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-05T17:49:48Z" level=info msg="CRDUPDATE event temp udpate err"
time="2020-05-05T17:49:48Z" level=info msg="CRDERROR event temp udpate err"

~~~

I believe the operator keep restarting because of some errors.

~~~
Client Version: v4.3.0
Server Version: 4.4.3
Kubernetes Version: v1.17.1
[root@ocp4 home]# 
~~~

Comment 18 Gabe Montero 2020-05-06 14:53:25 UTC
After catching up with Vadim in slack, he informs me that #Comment 17 is related to broad set of 4.3 to 4.4 upgrade issues being triaged in https://app.slack.com/client/T027F3GAJ/C0108DW4J7P and https://app.slack.com/client/T027F3GAJ/C0134TXLGG3

For samples operator, I will be improving on the 'no such file or directory error reading file []' via https://bugzilla.redhat.com/show_bug.cgi?id=1832193

Now, this bug has yet to be cherry picked back to 4.4 ... it is only fixed in 4.5 at the moment.  Hence what was seen in #Comment 17

I will be initiating clone / cherry pick now.

Comment 19 Gabe Montero 2020-05-06 18:30:33 UTC
Turns out the cherry pick associated with the 4.4.z clone of this bug, https://bugzilla.redhat.com/show_bug.cgi?id=1832344, is sufficient for the 'no such file or directory error reading file []'

once we stop trying to read in template files for templates that a no longer in our current inventory (but were in past inventories) we no longer produced the message

Comment 20 miwilson@redhat.com 2020-06-02 13:05:06 UTC
All, just wanted to circle around on this. We're still seeing this in the latest 4.4 nightlies (openshift-v4.4.0-0.nightly-2020-06-01-211921, more detailed job results here: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/osde2e-int-aws-e2e-upgrade-to-osd-default-plus-one-nightly/335).

Comment 21 Gabe Montero 2020-06-02 13:58:55 UTC
Looking at that run in #Comment 20, it is a different cause than this bug:

2020/06/02 11:40:03 Operator openshift-samples type Progressing is True: Samples installation in error at 4.4.0-0.nightly-2020-06-01-211921: FailedImageImports

that one stems from the imagestream imports failing.

Looking at the artifacts for that run, I do not see the must gather data, so I can't pull up the imagestream status for the precise reason, but those failed image imports
typically occur either because 

a) registry.redhat.io has some sort of outage or intermittent issues preventing image download by the image registry
b) there is a problem with the credentials supplied during install for registry.redhat.io (where those issues could be "user" error or again more problems on the infrastructure side of registry.redhat.io

We do have an open issue with that team for intermittent outages:  https://projects.engineering.redhat.com/browse/ITESPLAT-1817

If must gather could be gathered for that particular nightly job, I could get more precise on the analysis and conclude if the failure should be added to https://projects.engineering.redhat.com/browse/ITESPLAT-1817 or is a separate item should be filed.

Comment 22 errata-xmlrpc 2020-07-13 17:31:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.