Bug 1811206 - upgrades where entire sample imagestreams were removed in the new version can get stuck in progressing
Summary: upgrades where entire sample imagestreams were removed in the new version can...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.3.z
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On: 1811204
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-06 20:28 UTC by Gabe Montero
Modified: 2020-06-17 20:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: if a sample imagestream available in a prior release was removed in a subsequent release, then during upgrade to that subsequent releast the removed imagestream could be incorrectly tracked as needing imagestreamimports to complete, and since no imagestreamimmports are even occurring, samples will not report its upgrade as complete Consequence: overall upgraded would be marked as failed if the problematic timing windows in samples operator occured Fix: samples operated was updated to not attempt to track imagestreams which existed in a prior release but not in the release we are upgrading to Result: imagestreams that are removed release to release should not cause samples operator to fail upgrade
Clone Of: 1811143
Environment:
Last Closed: 2020-06-17 20:27:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 243 0 None closed [release-4.3] Bug 1811206: purge removed imagestreams as a part of upgrade from progessing/impor… 2020-06-05 17:49:05 UTC
Github openshift cluster-samples-operator pull 280 0 None closed [release-4.3] Bug 1841996: correctly handle removed templates watch events as part of an upgrade 2020-06-08 02:04:42 UTC
Red Hat Product Errata RHBA-2020:2436 0 None None None 2020-06-17 20:28:16 UTC

Comment 1 Gabe Montero 2020-03-19 15:23:48 UTC
PR https://github.com/openshift/cluster-samples-operator/pull/243 is up but we have bugzilla bot hoopla complaining the dependent bug is 4.5 instead of 4.4

Comment 4 XiuJuan Wang 2020-05-06 07:56:34 UTC
We still met https://bugzilla.redhat.com/show_bug.cgi?id=1828065, it should backport to 4.3.z. 

After upgrade from 4.2.0-0.nightly-2020-05-05-113123 to 4.3.0-0.nightly-2020-05-04-051714 
$oc get co openshift-samples -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-05-06T04:21:09Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "65829"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: 02e57dae-8f51-11ea-8c83-0a9cd2bbc20c
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-05-06T07:02:03Z"
    status: "False"
    type: Available
  - lastTransitionTime: "2020-05-06T07:02:03Z"
    message: 'Samples installation in error at 4.3.0-0.nightly-2020-05-04-051714:
      APIServerError'
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-05-06T07:02:03Z"
    message: 'open : no such file or directory error reading file [];'
    reason: APIServerError
    status: "True"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces
  versions:
  - name: operator
    version: 4.3.0-0.nightly-2020-05-04-051714

$oc logs -f cluster-samples-operator-746f66c95f-vc67w -c cluster-samples-operator
time="2020-05-06T07:48:23Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"
time="2020-05-06T07:48:23Z" level=info msg="processing secret watch event while in Managed state; deletion event: false"
time="2020-05-06T07:48:23Z" level=info msg="creation/update of credential in openshift namespace recognized"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"

Comment 5 Gabe Montero 2020-05-06 17:25:58 UTC
OK so that bug is getting in the way

The 4.4 clone of https://bugzilla.redhat.com/show_bug.cgi?id=1828065 is https://bugzilla.redhat.com/show_bug.cgi?id=1832344 with PR https://github.com/openshift/cluster-samples-operator/pull/269
currently up.

Once it merges, I'll clone / cherry pick back to 4.3.z

Comment 7 Gabe Montero 2020-05-08 13:29:32 UTC
The PR is already linked and marked bz valid but the bot was unable to move this to POST (probably got hung up during a recent git outage)

Manually moving to POST

Comment 8 Gabe Montero 2020-05-29 02:46:49 UTC
concur with Adam's analysis in https://github.com/openshift/cluster-samples-operator/pull/243#issuecomment-623593592

moving back to Modified

Comment 11 XiuJuan Wang 2020-06-01 08:48:43 UTC
Still could reproduce https://bugzilla.redhat.com/show_bug.cgi?id=1828065

Upgrade cluster from 4.2.24 to 4.3.0-0.nightly-2020-06-01-043839, openshift-samples always change to processing
$oc get co openshift-samples -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-06-01T03:08:51Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "109520"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: 38592dce-a3b5-11ea-9a14-000d3a9c36e3
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-06-01T08:40:52Z"
    message: Samples installation successful at 4.3.0-0.nightly-2020-06-01-043839
    status: "True"
    type: Available
  - lastTransitionTime: "2020-06-01T08:40:52Z"
    message: Samples processing to 4.3.0-0.nightly-2020-06-01-043839
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-06-01T08:40:52Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces
  versions:
  - name: operator
    version: 4.3.0-0.nightly-2020-06-01-043839

 oc logs -f cluster-samples-operator-544c95bdbd-m694r -c cluster-samples-operator  | grep -i such 
========snip=======================
time="2020-06-01T08:40:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:40:48Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:45:24Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:45:24Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:45:24Z" level=info msg="open : no such file or directory error reading file []"
========snip=======================

Comment 12 Gabe Montero 2020-06-01 12:09:01 UTC
Hmmm ... this was the bug with the inadvertent merge and removal from the advisory @XiuJuan.

Given the snippet you posted from the logs, I suspect there was a disconnect with the commit 
getting into dist git and the nightlies.

We had another instance of this recently.

I'll dig up my aos-art threads and investigate / pursue with them.

Comment 13 Gabe Montero 2020-06-01 12:52:20 UTC
I remember now ....not an issue with the inadvertent merge.

Back in master it took 2 attempts at fixing upgrades with both removed imagestreams *AND* removed templates.

For the upgrade to miss all the timing windows, we need both this BZ/PR in 4.3.z *AND*

https://github.com/openshift/cluster-samples-operator/pull/280
https://bugzilla.redhat.com/show_bug.cgi?id=1841996

That PR is waiting cherrypick approval, which should happen this week.

Essentially, we have to verify both upgrade bugs together depending on which timing window gets hit.

Moving this to POST for now.  Will move to modified once PR 280 merges.

Comment 16 Scott Dodson 2020-06-03 00:09:09 UTC
I'm just going to drop this bug from the errata since we're waiting on https://github.com/openshift/cluster-samples-operator/pull/280

Comment 18 Gabe Montero 2020-06-05 17:51:19 UTC
OK 280 has merged 

moving back to MODIFIED

Comment 21 XiuJuan Wang 2020-06-08 02:22:22 UTC
Upgrade from 4.2 to 4.3.0-0.nightly-2020-06-05-003056, the openshift-samples doesn't keep processing always. And no "open : no such file or directory error reading file []" error in samples operator pod.

Comment 23 errata-xmlrpc 2020-06-17 20:27:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2436


Note You need to log in before you can comment on or make changes to this bug.