Bug 2097067

Summary: ClusterVersion history pruner does not always retain initial completed update entry
Product: OpenShift Container Platform Reporter: Jack Ottofaro <jack.ottofaro>
Component: Cluster Version OperatorAssignee: Jack Ottofaro <jack.ottofaro>
Status: CLOSED ERRATA QA Contact: Evgeni Vakhonin <evakhoni>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.11CC: aos-team-ota, wking
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:17:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2108292    

Description Jack Ottofaro 2022-06-14 20:25:17 UTC
Description of problem:

CVO's CV history pruner has a maxHistory of 50. Whenever an entry is added and history length is > maxHistory the pruner should remove the first partial update added. If there are no partial updates then it simply removes the entry at index 0 which would be the first completed update added.

Instead, when history length is > maxHistory the entry at index 0 gets removed regardless of whether it's a partial or completed update.

How reproducible:

Test_pruneStatusHistory unit test case was modified to show this behavior.

Comment 2 Jack Ottofaro 2022-06-22 18:41:16 UTC
I added new var MaxHistory here [1]. Perhaps build your own release and drop MaxHistory to 3, or something reasonable, and do that many upgrades. The first completed installed version will end up at index MaxHistory and that should always be retained. Whenever a version must be removed it should be at index MaxHistory-1. On the other hand if the first installed version were not to complete it would be removed.

Otherwise you're looking at a lot of upgrades - although you don't have to wait for them to complete.

[1] https://github.com/openshift/cluster-version-operator/blob/dc927a4c63e2d9fb7f469ecb77503687a60c6564/pkg/cvo/status.go#L35

Comment 4 Evgeni Vakhonin 2022-06-23 17:56:33 UTC
reproduced on 4.11.0-0.nightly-2022-06-21-040754 to 4.11.0-0.nightly-2022-06-21-151125

ran in a loop in python:
1) upgrade to 4.11.0-0.nightly-2022-06-21-151125
$ oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --force --to-image=.....
2) the moment cvo .status.conditions[] 'progressing' had in message '4.11.0-0.nightly-2022-06-21-151125', commanded rollback to 06-21-040754
3) looped until history is 50

45 : 06-23T12:02:36Z 06-21-040754 - Partial 
46 : 06-23T10:47:20Z 06-21-151125 - Completed 
47 : 06-23T10:43:54Z 06-21-040754 - Partial 
48 : 06-23T10:22:19Z 06-21-151125 - Partial 
49 : 06-23T09:13:15Z 06-21-040754 - Completed < -- this is the first history entry, the in "installed version"

4) and did one final rollback to 06-21-040754.

46 : 06-23T12:02:36Z 06-21-040754 - Partial 
47 : 06-23T10:47:20Z 06-21-151125 - Completed 
48 : 06-23T10:43:54Z 06-21-040754 - Partial 
49 : 06-23T10:22:19Z 06-21-151125 - Partial 
                                            < -- this is the first history entry, removed

result: "Completed" version from T09:13:15 removed, instead of next "Partial" from T10:22:19

Comment 5 Jack Ottofaro 2022-06-23 18:45:47 UTC
(In reply to Evgeni Vakhonin from comment #4)
> reproduced on 4.11.0-0.nightly-2022-06-21-040754 to
> 4.11.0-0.nightly-2022-06-21-151125
> 
> ran in a loop in python:
> 1) upgrade to 4.11.0-0.nightly-2022-06-21-151125
> $ oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings
> --force --to-image=.....
> 2) the moment cvo .status.conditions[] 'progressing' had in message
> '4.11.0-0.nightly-2022-06-21-151125', commanded rollback to 06-21-040754
> 3) looped until history is 50
> 
> 45 : 06-23T12:02:36Z 06-21-040754 - Partial 
> 46 : 06-23T10:47:20Z 06-21-151125 - Completed 
> 47 : 06-23T10:43:54Z 06-21-040754 - Partial 
> 48 : 06-23T10:22:19Z 06-21-151125 - Partial 
> 49 : 06-23T09:13:15Z 06-21-040754 - Completed < -- this is the first history
> entry, the in "installed version"
> 
> 4) and did one final rollback to 06-21-040754.
> 
> 46 : 06-23T12:02:36Z 06-21-040754 - Partial 
> 47 : 06-23T10:47:20Z 06-21-151125 - Completed 
> 48 : 06-23T10:43:54Z 06-21-040754 - Partial 
> 49 : 06-23T10:22:19Z 06-21-151125 - Partial 
>                                             < -- this is the first history
> entry, removed
> 
> result: "Completed" version from T09:13:15 removed, instead of next
> "Partial" from T10:22:19

Can you attach the CVO log after running the test. Also, send me your python script and I'll give it a run. Thanks.

Comment 6 Evgeni Vakhonin 2022-06-23 18:49:09 UTC
    verifying on 4.11.0-0.nightly-2022-06-22-235234 to 4.11.0-0.nightly-2022-06-23-044003 using the same method as #C4

    looped until history is 50

    45 : 2022-06-23T18:12:47Z 06-22-235234 - Partial 
    46 : 2022-06-23T18:12:28Z 06-23-044003 - Partial 
    47 : 2022-06-23T18:12:11Z 06-22-235234 - Partial 
    48 : 2022-06-23T18:11:50Z 06-23-044003 - Partial    <-- this should be removed
    49 : 2022-06-23T17:40:13Z 06-22-235234 - Completed  <-- this should not

    and another one..

    46 : 2022-06-23T18:12:47Z 06-22-235234 - Partial 
    47 : 2022-06-23T18:12:28Z 06-23-044003 - Partial 
    48 : 2022-06-23T18:12:11Z 06-22-235234 - Partial 
                                                              <-- success!!
    49 : 2022-06-23T17:40:13Z 06-22-235234 - Completed 


    Verified successfully!

Comment 12 errata-xmlrpc 2022-08-10 11:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069