Cause: When updating a Catalog Source a Get call is immediately followed by a Delete call on a number of resources related to the Catalog Source.
Consequence: In some instances, the resource has already been deleted but the resource still exists in the cache. This allows the Get call to succeed but the following delete call fails as the resource does not exist on cluster. This leads to the catalog address not being updated to the new source.
Fix: Updated OLM to ignore the error returned by the Delete call if the resource is not found.
Result: OLM no longer reports an error when updating a catalog due to a caching issue that results in a "Resource Not Found" error from the delete call.
Created attachment 1792003[details]
openshift-operator-lifecycle-manager log
Created attachment 1792003[details]
openshift-operator-lifecycle-manager log
Description of problem:
after upgrade from 4.5.40-x86_64 to 4.6.35-x86_64
.status.connectionState.address of catsrc certified-operators is not correct
[root@preserve-olm-agent-test ~]# oc get catsrc certified-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
annotations:
operatorframework.io/managed-by: marketplace-operator
creationTimestamp: "2021-06-18T01:08:51Z"
generation: 2
labels:
olm-visibility: hidden
openshift-marketplace: "true"
opsrc-datastore: "true"
opsrc-provider: certified
name: certified-operators
namespace: openshift-marketplace
resourceVersion: "211535"
selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-marketplace/catalogsources/certified-operators
uid: 92c1031b-245b-4292-92fc-d958019fc1c5
spec:
displayName: Certified Operators
icon:
base64data: ""
mediatype: ""
image: registry.redhat.io/redhat/certified-operator-index:v4.6
priority: -200
publisher: Red Hat
sourceType: grpc
updateStrategy:
registryPoll:
interval: 10m0s
status:
connectionState:
address: '..svc:'
lastConnect: "2021-06-18T08:01:33Z"
lastObservedState: TRANSIENT_FAILURE
latestImageRegistryPoll: "2021-06-18T07:53:29Z"
registryService:
createdAt: "2021-06-18T01:08:52Z"
protocol: grpc
time="2021-06-18T04:12:45Z" level=error msg="failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: lookup ..svc: no such host\"" catalog="{certified-operators openshift-marketplace}"
Version-Release number of selected component (if applicable):
upgrade from 4.5.40-x86_64 to 4.6.35-x86_64
[root@preserve-olm-agent-test ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.35 True False 4h30m Cluster version is 4.6.35
How reproducible:
not always
Steps to Reproduce:
1.upgrade from 4.5.40-x86_64 to 4.6.35-x86_64
2.
3.
Actual results:
.status.connectionState.address of catsrc certified-operators is not correct
Expected results:
.status.connectionState.address of catsrc certified-operators is correct
Additional info:
attached is the log on ns openshift-operator-lifecycle-manager
Checking the upgrade ci result, looks good from now, didn't find the issue on version release 4.9. Marking as verified.
test case "[upgrade] Check the marketplace status" is success.
LGTM, verified.
(In reply to Kevin Rizza from comment #5)
> Looks like https://bugzilla.redhat.com/show_bug.cgi?id=1967621 was resolved.
> We believe this is likely the same issue. Can QE confirm and, if so, mark
> this one as a duplicate?
(In reply to xzha from comment #7)
> LGTM, verified.
Do we really want both this bug and bug 1967621 in the 4.9 errata? I thought the confirmation from comment 7 would lead to this being closed as a dup, per the request in comment 5.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2021:3759