Bug 1905599 - Errant change to lastupdatetime in copied CSV status can trigger runaway csv syncs
Summary: Errant change to lastupdatetime in copied CSV status can trigger runaway csv ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Ben Luddy
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1905624 (view as bug list)
Depends On:
Blocks: 1906416
TreeView+ depends on / blocked
 
Reported: 2020-12-08 15:53 UTC by Evan Cordell
Modified: 2022-10-11 06:35 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1906416 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:41:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1892 0 None closed Bug 1905599: Preserve original .status.lastUpdateTime in copied CSVs. 2021-02-16 17:07:58 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:41:37 UTC

Description Evan Cordell 2020-12-08 15:53:00 UTC
Description of problem:

When a CSV is created and it is copied across target namespaces, it is possible that the lastupdatetime timestamp on the copied version doesn't match the original CSV. This triggers a runaway sync where the copied CSV never converges to match the original


Version-Release number of selected component (if applicable): 4.7


How reproducible: Always


Steps to Reproduce:
1. Create a large number of namespaces
2. Install an AllNamespace operator

Actual results:

Copied CSVs are constantly reconciled and never settle.


Expected results:

A small spike to copy CSVs followed by no further changes.


Additional info:

This is triggered only when the CSV copy takes place at a different time than the original csv was last updated. There is a higher likelihood this can happen if there are lots of namespaces on the cluster.

Comment 2 Evan Cordell 2020-12-09 14:45:13 UTC
*** Bug 1905624 has been marked as a duplicate of this bug. ***

Comment 3 Jian Zhang 2020-12-10 04:02:03 UTC
Cluster version is 4.7.0-0.nightly-2020-12-09-112139
[root@preserve-olm-env data]# oc -n openshift-operator-lifecycle-manager exec catalog-operator-5bff7985dc-bc764  -- olm --version
OLM version: 0.17.0
git commit: 2294bcc907c834c160c5b99fbf15988d0706853c

LGTM verify it.

1, subscribe to an operator for the cluster scope. Such as, etcd.
[root@preserve-olm-env data]# oc get sub -A
NAMESPACE                    NAME                     PACKAGE                  SOURCE                CHANNEL
openshift-operators          etcd                     etcd                     community-operators   clusterwide-alpha

[root@preserve-olm-env data]# oc get csv -n openshift-operators 
NAME                                           DISPLAY                            VERSION                 REPLACES                                       PHASE
etcdoperator.v0.9.4-clusterwide                etcd                               0.9.4-clusterwide       etcdoperator.v0.9.2-clusterwide                Succeeded

2, Create many namespaces.
3, check the lastUpdateTime of the copied csv if is the same as the origin csv.
[root@preserve-olm-env data]# oc get csv -n jian4 etcdoperator.v0.9.4-clusterwide -o yaml
...
  - lastTransitionTime: "2020-12-09T08:59:52Z"
    lastUpdateTime: "2020-12-09T08:59:52Z"
    message: install strategy completed with no errors
    phase: Succeeded
    reason: InstallSucceeded

[root@preserve-olm-env data]# oc get csv -n openshift-operators etcdoperator.v0.9.4-clusterwide  -o yaml
...
  - lastTransitionTime: "2020-12-09T08:59:52Z"
    lastUpdateTime: "2020-12-09T08:59:52Z"
    message: install strategy completed with no errors
    phase: Succeeded
    reason: InstallSucceeded

Comment 4 Ruairi Hayes 2020-12-11 16:26:35 UTC
Hi ecordell,

I've tested the CSV update frequency issue and it still seems to be present in OCP 4.6.8
See the OCP 4.6.8 Cluster settings page attached
https://bugzilla.redhat.com/attachment.cgi?id=1738461


Over 5 minutes there were 6549 PUT operations on etcd and 3899 of those were to CSVs

sh-4.4# cat etcd_watch.log | grep "Key" | wc -l
6549
sh-4.4# cat etcd_watch.log | grep "Key" | grep "clusterserviceversions" | wc -l
3899

In one namespace I can see the lastUpdateTime of the CSV also still incrementing so suspect fix is not in place

[ruairi@localhost ibm-apicatalog]$ oc get csv ibm-apiconnect.v2.1.0 -o yaml | grep lastUpdateTime
  lastUpdateTime: "2020-12-11T16:08:55Z"
[ruairi@localhost ibm-apicatalog]$ oc get csv ibm-apiconnect.v2.1.0 -o yaml | grep lastUpdateTime
  lastUpdateTime: "2020-12-11T16:09:15Z"
[ruairi@localhost ibm-apicatalog]$ oc get csv ibm-apiconnect.v2.1.0 -o yaml | grep lastUpdateTime
  lastUpdateTime: "2020-12-11T16:10:13Z"
[ruairi@localhost ibm-apicatalog]$ oc get csv ibm-apiconnect.v2.1.0 -o yaml | grep lastUpdateTime
  lastUpdateTime: "2020-12-11T16:10:52Z"
[ruairi@localhost ibm-apicatalog]$ oc get csv ibm-apiconnect.v2.1.0 -o yaml | grep lastUpdateTime
  lastUpdateTime: "2020-12-11T16:11:28Z"
  
  
You can see the memory rising as well from the attached memory metrics graph
https://bugzilla.redhat.com/attachment.cgi?id=1738462

Can you confirm that the fix didn't make 4.6.8 and that it should be available in the next release?

Comment 5 Ben Luddy 2020-12-11 17:46:04 UTC
Hi Ruairi, the 4.6 backport only merged a couple hours ago due to a test infrastructure problem. The progress of that backport is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1906416. At the moment, it's awaiting QE verification. Since the backport is also marked as urgent, it should be verified soon and I'd expect it to be present for the following z-release.

Comment 6 Ruairi Hayes 2020-12-15 16:52:36 UTC
Hi bluddy, 
Do you have a timeline on when the 4.6.9 release is scheduled which will have this fix in it? 
Thanks,

Ruairi

Comment 8 Daniel Sover 2020-12-21 20:03:11 UTC
4.6.9 is now released and has this hotfix in the payload -- ready for testing

Comment 17 errata-xmlrpc 2021-02-24 15:41:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.