Bug 1811206

Summary: upgrades where entire sample imagestreams were removed in the new version can get stuck in progressing
Product: OpenShift Container Platform Reporter: Gabe Montero <gmontero>
Component: SamplesAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: eparis, xiuwang
Target Milestone: ---Keywords: Upgrades
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: if a sample imagestream available in a prior release was removed in a subsequent release, then during upgrade to that subsequent releast the removed imagestream could be incorrectly tracked as needing imagestreamimports to complete, and since no imagestreamimmports are even occurring, samples will not report its upgrade as complete Consequence: overall upgraded would be marked as failed if the problematic timing windows in samples operator occured Fix: samples operated was updated to not attempt to track imagestreams which existed in a prior release but not in the release we are upgrading to Result: imagestreams that are removed release to release should not cause samples operator to fail upgrade
Story Points: ---
Clone Of: 1811143 Environment:
Last Closed: 2020-06-17 20:27:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1811204    
Bug Blocks:    

Comment 1 Gabe Montero 2020-03-19 15:23:48 UTC
PR https://github.com/openshift/cluster-samples-operator/pull/243 is up but we have bugzilla bot hoopla complaining the dependent bug is 4.5 instead of 4.4

Comment 4 XiuJuan Wang 2020-05-06 07:56:34 UTC
We still met https://bugzilla.redhat.com/show_bug.cgi?id=1828065, it should backport to 4.3.z. 

After upgrade from 4.2.0-0.nightly-2020-05-05-113123 to 4.3.0-0.nightly-2020-05-04-051714 
$oc get co openshift-samples -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-05-06T04:21:09Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "65829"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: 02e57dae-8f51-11ea-8c83-0a9cd2bbc20c
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-05-06T07:02:03Z"
    status: "False"
    type: Available
  - lastTransitionTime: "2020-05-06T07:02:03Z"
    message: 'Samples installation in error at 4.3.0-0.nightly-2020-05-04-051714:
      APIServerError'
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-05-06T07:02:03Z"
    message: 'open : no such file or directory error reading file [];'
    reason: APIServerError
    status: "True"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces
  versions:
  - name: operator
    version: 4.3.0-0.nightly-2020-05-04-051714

$oc logs -f cluster-samples-operator-746f66c95f-vc67w -c cluster-samples-operator
time="2020-05-06T07:48:23Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"
time="2020-05-06T07:48:23Z" level=info msg="processing secret watch event while in Managed state; deletion event: false"
time="2020-05-06T07:48:23Z" level=info msg="creation/update of credential in openshift namespace recognized"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"
time="2020-05-06T07:49:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-05-06T07:49:46Z" level=info msg="CRDUPDATE event temp udpate err"

Comment 5 Gabe Montero 2020-05-06 17:25:58 UTC
OK so that bug is getting in the way

The 4.4 clone of https://bugzilla.redhat.com/show_bug.cgi?id=1828065 is https://bugzilla.redhat.com/show_bug.cgi?id=1832344 with PR https://github.com/openshift/cluster-samples-operator/pull/269
currently up.

Once it merges, I'll clone / cherry pick back to 4.3.z

Comment 7 Gabe Montero 2020-05-08 13:29:32 UTC
The PR is already linked and marked bz valid but the bot was unable to move this to POST (probably got hung up during a recent git outage)

Manually moving to POST

Comment 8 Gabe Montero 2020-05-29 02:46:49 UTC
concur with Adam's analysis in https://github.com/openshift/cluster-samples-operator/pull/243#issuecomment-623593592

moving back to Modified

Comment 11 XiuJuan Wang 2020-06-01 08:48:43 UTC
Still could reproduce https://bugzilla.redhat.com/show_bug.cgi?id=1828065

Upgrade cluster from 4.2.24 to 4.3.0-0.nightly-2020-06-01-043839, openshift-samples always change to processing
$oc get co openshift-samples -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-06-01T03:08:51Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "109520"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: 38592dce-a3b5-11ea-9a14-000d3a9c36e3
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-06-01T08:40:52Z"
    message: Samples installation successful at 4.3.0-0.nightly-2020-06-01-043839
    status: "True"
    type: Available
  - lastTransitionTime: "2020-06-01T08:40:52Z"
    message: Samples processing to 4.3.0-0.nightly-2020-06-01-043839
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-06-01T08:40:52Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces
  versions:
  - name: operator
    version: 4.3.0-0.nightly-2020-06-01-043839

 oc logs -f cluster-samples-operator-544c95bdbd-m694r -c cluster-samples-operator  | grep -i such 
========snip=======================
time="2020-06-01T08:40:46Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:40:48Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:45:24Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:45:24Z" level=info msg="open : no such file or directory error reading file []"
time="2020-06-01T08:45:24Z" level=info msg="open : no such file or directory error reading file []"
========snip=======================

Comment 12 Gabe Montero 2020-06-01 12:09:01 UTC
Hmmm ... this was the bug with the inadvertent merge and removal from the advisory @XiuJuan.

Given the snippet you posted from the logs, I suspect there was a disconnect with the commit 
getting into dist git and the nightlies.

We had another instance of this recently.

I'll dig up my aos-art threads and investigate / pursue with them.

Comment 13 Gabe Montero 2020-06-01 12:52:20 UTC
I remember now ....not an issue with the inadvertent merge.

Back in master it took 2 attempts at fixing upgrades with both removed imagestreams *AND* removed templates.

For the upgrade to miss all the timing windows, we need both this BZ/PR in 4.3.z *AND*

https://github.com/openshift/cluster-samples-operator/pull/280
https://bugzilla.redhat.com/show_bug.cgi?id=1841996

That PR is waiting cherrypick approval, which should happen this week.

Essentially, we have to verify both upgrade bugs together depending on which timing window gets hit.

Moving this to POST for now.  Will move to modified once PR 280 merges.

Comment 16 Scott Dodson 2020-06-03 00:09:09 UTC
I'm just going to drop this bug from the errata since we're waiting on https://github.com/openshift/cluster-samples-operator/pull/280

Comment 18 Gabe Montero 2020-06-05 17:51:19 UTC
OK 280 has merged 

moving back to MODIFIED

Comment 21 XiuJuan Wang 2020-06-08 02:22:22 UTC
Upgrade from 4.2 to 4.3.0-0.nightly-2020-06-05-003056, the openshift-samples doesn't keep processing always. And no "open : no such file or directory error reading file []" error in samples operator pod.

Comment 23 errata-xmlrpc 2020-06-17 20:27:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2436