Bug 1802553 - CVO allows new unforced updates even when it is currently midway through a partial update. It should require a force to retarget mid-update
Summary: CVO allows new unforced updates even when it is currently midway through a pa...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Lalatendu Mohanty
QA Contact: liujia
URL:
Whiteboard:
: 1947566 2069480 2083988 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-13 11:55 UTC by Andre Costa
Modified: 2023-03-24 17:01 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-18 20:29:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andre Costa 2020-02-13 11:55:42 UTC
Currently we had a customer that triggered the upgrade from 4.1.27 to 4.3, having intermediate versions on 4.2 in partial state. We have asked for details of the CVO from the customer to understand better the procedure taken, but we might need to implment a way to either stop the upgrade in case customer makes a mistake or block the upgrade if the customer changes the channel on the console to a version which the upgrade does not support, like in this case

Comment 1 Lalatendu Mohanty 2020-02-13 12:01:20 UTC
From the version object

"history": [                                                                                                       
      {                                                                                                                
        "state": "Partial",                                
        "startedTime": "2020-02-13T08:15:27Z",                                                                         
        "completionTime": null,                            
        "version": "4.3.0",                                
        "image": "quay.io/openshift-release-dev/ocp-release@sha256:3a516480dfd68e0f87f702b4d7bdd6f6a0acfdac5cd2e9767b838ceede34d70d",
        "verified": true                                   
      },                                                   
      {                                                    
        "state": "Partial",                                
        "startedTime": "2020-01-29T16:03:01Z",                                                                         
        "completionTime": "2020-02-13T08:15:27Z",                                                                      
        "version": "4.2.16",                               
        "image": "quay.io/openshift-release-dev/ocp-release@sha256:e5a6e348721c38a78d9299284fbb5c60fb340135a86b674b038500bf190ad514",
        "verified": true                                   
      },                                                   
      {                                                    
        "state": "Partial",                                
        "startedTime": "2020-01-13T13:05:10Z",                                                                         
        "completionTime": "2020-01-29T16:03:01Z",                                                                      
        "version": "4.2.13",                               
        "image": "quay.io/openshift-release-dev/ocp-release@sha256:782b41750f3284f3c8ee2c1f8cb896896da074e362cf8a472846356d1617752d",
        "verified": true                                   
      },                                                   
      {                                                    
        "state": "Partial",                                
        "startedTime": "2019-12-11T12:38:42Z",                                                                         
        "completionTime": "2020-01-13T13:05:10Z",                                                                      
        "version": "4.2.10",                               
        "image": "quay.io/openshift-release-dev/ocp-release@sha256:dc2e38fb00085d6b7f722475f8b7b758a0cb3a02ba42d9acf8a8298a6d510d9c",
        "verified": true                                   
      },

Comment 2 Abhinav Dahiya 2020-02-13 17:32:27 UTC
users are allowed to `FORCE` updates and CVO is expected to move forward because otherwise users can get stuck.

oc adm upgrade prevents upgrades when there's already one on flight, but again users not force it.

Comment 3 Scott Dodson 2020-02-13 18:30:58 UTC
Please get back to us with exactly the previous upgrade actions that were taken on this cluster. At first glance this appears that they have gone against our recommendations and applied updates not found in the graph.

Comment 4 W. Trevor King 2020-02-17 23:08:00 UTC
> oc adm upgrade prevents upgrades when there's already one on flight...

Linking the source for this [1], in case other folks are wondering where it is ;).

[1]: https://github.com/openshift/oc/blob/5d7a12f03389b03b651f963cb5ee8ddfa9cff559/pkg/cli/admin/upgrade/upgrade.go#L295-L300

Comment 5 W. Trevor King 2020-02-17 23:34:56 UTC
I don't actually see a CVO-side precondition for this.  I'd expect one in [1], but the only ClusterVersion precondition we have now is around the Upgradeable [2].  Do we need to grow a CVO-side precondition for "and we're not currently Progressing"?  Do we have one that I'm just missing?  Is there some reason why we want to guard against this client-side but not guard against it in the CVO?

[1]: https://github.com/openshift/cluster-version-operator/tree/2afd105d0291006f940022b048e927ab3778ebf6/pkg/payload/precondition/clusterversion
[2]: https://github.com/openshift/cluster-version-operator/blob/2afd105d0291006f940022b048e927ab3778ebf6/pkg/payload/precondition/clusterversion/upgradeable.go#L30-L33

Comment 6 Lalatendu Mohanty 2020-02-19 13:56:26 UTC
+1 to a CVO-side precondition for not upgrading when  the current status is  "progressing". That's the fix we should do as part of this bug.

Comment 7 W. Trevor King 2020-03-13 23:59:29 UTC
Moving to high.  Allowing folks to retarget to a 4.3 release when they are only partially through a 4.1 -> 4.2 update is really risky.

Comment 8 Clayton Coleman 2020-03-16 14:34:17 UTC
Any precondition check has to be overridable by the user, so as long as the precondition allows override we're ok doing this.

Comment 9 Lalatendu Mohanty 2020-05-14 11:52:10 UTC
As per the discussion on the PR, this does not seem to be something as important as we thought initially, hence moving this to 4.6.0.

Comment 12 Lalatendu Mohanty 2020-08-17 18:58:17 UTC
Reducing the severity of bug as we did not see this issue getting reproduced much.

Comment 13 Lalatendu Mohanty 2020-08-21 18:58:51 UTC
Not critical for 4.6 , hence moving to 4.7.

Comment 14 Lalatendu Mohanty 2020-09-02 17:43:08 UTC
The right way to do this without breaking the API would be to add upgradable=false if the upgrade is not supported. Going to do this for this bug and see if folks likes this approach or not.

Comment 15 Lalatendu Mohanty 2020-09-03 10:51:31 UTC
In this case we will only set upgradable=false for y stream upgrades(i.e. between minor versions)

Comment 16 W. Trevor King 2020-09-04 04:32:00 UTC
Only setting Upgradeable=False during minor bumps would guard against extreme 4.(y-1) with 4.(y+1) version skew.  But I'd also be ok setting Upgradeable=False during all updates.  It would be easier to code that way, and minor version bumps are exciting enough that I'm ok forcing folks to consolidate their cluster on a well-defined jumping-off point before attempting them.

Comment 17 W. Trevor King 2020-09-13 05:20:05 UTC
Lala is working on this, but no PR yet.

Comment 18 W. Trevor King 2020-10-02 23:16:49 UTC
PR is up, master is open for 4.7.  Just needs review.

Comment 19 W. Trevor King 2020-10-25 15:46:18 UTC
Clayton is not convinced [1].  We'll keep hunting for a consensus fix next sprint.

[1]: https://github.com/openshift/cluster-version-operator/pull/460#pullrequestreview-503280048

Comment 20 liujia 2020-11-26 09:35:30 UTC
Try to reproduce it with following steps(unexpected):

1. install ocp v4.5.20

2. patch upstream and channel in cv for the 1st upgrade
# ./oc get clusterversion -o json|jq .items[].spec
{
  "channel": "stable-4.6",
  "clusterID": "a2fcbdf7-a8a4-4685-b0c8-2dc328203478",
  "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"
}

3. upgrade the cluster from v4.5.20 to v4.6.5
# ./oc adm upgrade --to 4.6.5
Updating to 4.6.5
# ./oc adm upgrade
info: An upgrade is in progress. Working towards 4.6.5: 11% complete

4. patch channel in cv for the 2nd upgrade while 1st upgrade is ongoing.
# ./oc get clusterversion -o json|jq .items[].spec
{
  "channel": "stable-4.7",
  "clusterID": "a2fcbdf7-a8a4-4685-b0c8-2dc328203478",
  "desiredUpdate": {
    "force": false,
    "image": "registry.svc.ci.openshift.org/ocp/release@sha256:b8154e802c17dae57d1cfb0580e6a79544712cea0f77e01ae6171854f75975ea",
    "version": "4.6.5"
  },
  "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"
}

5. try to upgrade the cluster to v4.7 through cli without --force, failed(expected)
# ./oc adm upgrade --to 4.7.0-0.nightly-2020-11-26-042221
error: already upgrading.

  Reason: 
  Message: Working towards 4.6.5: 11% complete

If you want to upgrade anyway, use --allow-upgrade-with-warnings.
# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.20    True        True          110s    Working towards 4.6.5: 11% complete

6. try to upgrade the cluster to v4.7 through web-console, succeed(unexpected and reproduced)
# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.20    True        True          13m     Working towards 4.7.0-0.nightly-2020-11-25-114114: 15% complete
# ./oc get clusterversion -o json|jq .items[].status.history
[
  {
    "completionTime": null,
    "image": "registry.svc.ci.openshift.org/ocp/release@sha256:bf37e13af0e254d0b744b62ace0dcf5560230374d7877a8fde16cf9134ec7862",
    "startedTime": "2020-11-26T09:22:49Z",
    "state": "Partial",
    "verified": false,
    "version": "4.7.0-0.nightly-2020-11-25-114114"
  },
  {
    "completionTime": "2020-11-26T09:22:49Z",
    "image": "registry.svc.ci.openshift.org/ocp/release@sha256:b8154e802c17dae57d1cfb0580e6a79544712cea0f77e01ae6171854f75975ea",
    "startedTime": "2020-11-26T09:19:00Z",
    "state": "Partial",
    "verified": false,
    "version": "4.6.5"
  },
  {
    "completionTime": "2020-11-26T09:02:15Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:78b878986d2d0af6037d637aa63e7b6f80fc8f17d0f0d5b077ac6aca83f792a0",
    "startedTime": "2020-11-26T08:24:11Z",
    "state": "Completed",
    "verified": false,
    "version": "4.5.20"
  }
]

Comment 21 W. Trevor King 2020-11-26 14:35:09 UTC
(In reply to liujia from comment #20)
> 6. try to upgrade the cluster to v4.7 through web-console,
> succeed(unexpected and reproduced)

Right.  I don't think we want to rely on client-side guards (like oc has today) for this.  I'd rather have the CVO itself say "sorry, I'm in the middle of 4.y->4.(y+1), so I'm not going to pick up your requested 4.(y+2) target".  We could just hold it in ClusterVersion.spec while finishing out the 4.(y+1) target and then pick it up.  And folks could force if they wanted to waive the CVO-side guard.  But I would like a CVO-side guard of some sort to close out this bug.

Comment 22 Lalatendu Mohanty 2021-03-22 15:05:51 UTC
 I am planning to close https://bugzilla.redhat.com/show_bug.cgi?id=1802553 as this does not seem to an issue impacting clusters and putting a guard for Y+2 does not seem critical to me at this point of time. Also we did not reach any agreement with Clayton around how we should fix this.

Comment 24 W. Trevor King 2021-04-08 20:26:37 UTC
*** Bug 1947566 has been marked as a duplicate of this bug. ***

Comment 27 W. Trevor King 2022-03-29 19:15:08 UTC
*** Bug 2069480 has been marked as a duplicate of this bug. ***

Comment 28 W. Trevor King 2022-03-30 16:59:18 UTC
*** Bug 2069480 has been marked as a duplicate of this bug. ***

Comment 29 W. Trevor King 2022-05-12 00:13:12 UTC
*** Bug 2083988 has been marked as a duplicate of this bug. ***

Comment 37 Lalatendu Mohanty 2023-01-18 20:29:17 UTC
Moving the bug as an enhancement request[1]

[1] https://issues.redhat.com/browse/OTA-861


Note You need to log in before you can comment on or make changes to this bug.