Bug 1931025 - 4.5.15 and later cluster-version operator does not sync ClusterVersion status before exiting, leaving 'verified: false' even for verified updates
Summary: 4.5.15 and later cluster-version operator does not sync ClusterVersion status...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.z
Assignee: W. Trevor King
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On: 1927515
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-20 06:46 UTC by W. Trevor King
Modified: 2021-03-11 06:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator was not syncing ClusterVersion during graceful shutdowns. Consequence: During updates, the outgoing cluster-version operator was likely to exit after verifying the incoming release, but before pushing the 'verified: true' value into ClusterVersion history. Fix: The cluster-version operator now allows some additional time to perform a final ClusterVersion status synchronization during graceful shutdowns. Result: The ClusterVersion 'verified' values are again consistently 'true' for releases which were verified before being applied, returning to the behavior we had before 4.5.15 and 4.6.0.
Clone Of: 1927515
Environment:
Last Closed: 2021-03-11 06:55:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 525 0 None closed Bug 1931025: pkg/cvo: Use shutdownContext for final status synchronization 2021-02-27 21:42:17 UTC
Red Hat Product Errata RHBA-2021:0714 0 None None None 2021-03-11 06:55:35 UTC

Description W. Trevor King 2021-02-20 06:46:21 UTC
+++ This bug was initially created as a clone of Bug #1927515 +++

+++ This bug was initially created as a clone of Bug #1916384 +++

--- Additional comment from wking on 2021-01-20 04:02:12 UTC ---

Thanks for pushing this, Neil :).  I've pushed up a PR for master/4.7 which has what I expect to be a fix.  As that is verified for 4.7, we'll backport to 4.6, and then to 4.5.  Filling out a formal impact statement:

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* 4.5.15's bug 1872906 and 4.6.0's bug 1843505 broke the outgoing ClusterVersion status sync.  Clusters updating out of those releases, regardless of which release they are updating to, will be impacted by this bug.  To avoid the bug, we could theoretically block all edged into impacted releases, but that's an awful lot of releases, and as discussed below, the impact isn't particularly terrible.

What is the impact?  Is it serious enough to warrant blocking edges?
* Late-breaking changes to ClusterVersion status may not be pushed into the cluster.  Because it takes some time to pull down and verify the release image, and because the incoming CVO knows what version it's been asked to run, the version name and release image are unlikely to be corrupted, but 'verified' might be reported as 'false' when in reality the incoming release was successfully verified (the outgoing CVO just exited without attempting to sync that 'verified: true' out to the cluster.  This corrupted data is unfortunate, but has no known in-cluster consumers, and the main "was the target signed?" condition can be confirmed later by manually looking up the signature for the target release image.  That is probably limited enough to not be worth blocking edges.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* Updating out to fixed releases will avoid the problem for future updates.  There's no repairing updates out of impacted releases short of manually forcing 'verified' values, and that's probably not something we want to recommend.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* Yes, from 4.5.14 (and earlier) into 4.5.15, 4.6, and later.

Comment 2 Yang Yang 2021-03-01 06:17:31 UTC
Verified with 4.5.0-0.nightly-2021-02-26-170201

Steps to verify it:
1. Install a cluster with 4.5.0-0.nightly-2021-02-26-170201
2. Create a dummy cincy server with 4.5.0-0.nightly-2021-02-26-170201 and 4.6.19
3. Patch to use the cincy server
4. Upgrade the cluster to 4.6.19

# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2021-03-01T04:17:22Z"
    generation: 3
    managedFields:
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:clusterID: {}
      manager: cluster-bootstrap
      operation: Update
      time: "2021-03-01T04:17:22Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:channel: {}
          f:upstream: {}
      manager: kubectl-edit
      operation: Update
      time: "2021-03-01T06:07:17Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:desiredUpdate:
            .: {}
            f:force: {}
            f:image: {}
            f:version: {}
      manager: oc
      operation: Update
      time: "2021-03-01T06:07:51Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:availableUpdates: {}
          f:conditions: {}
          f:desired:
            .: {}
            f:channels: {}
            f:image: {}
            f:url: {}
            f:version: {}
          f:history: {}
          f:observedGeneration: {}
          f:versionHash: {}
      manager: cluster-version-operator
      operation: Update
      time: "2021-03-01T06:12:52Z"
    name: version
    resourceVersion: "50753"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 5b136b58-5a12-40c0-9b59-45d3f462f387
  spec:
    channel: stable-4.6
    clusterID: 16a9d8a3-a65d-4dda-a23f-dc717ed35a75
    desiredUpdate:
      force: false
      image: quay.io/openshift-release-dev/ocp-release@sha256:47df4bfe1cfd6d63dd2e880f00075ed1d37f997fd54884ed823ded9f5d96abfc
      version: 4.6.19
    upstream: https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy4.json
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-03-01T05:00:59Z"
      message: Done applying 4.5.0-0.nightly-2021-02-26-170201
      status: "True"
      type: Available
    - lastTransitionTime: "2021-03-01T06:08:28Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2021-03-01T06:07:59Z"
      message: 'Working towards 4.6.19: 15% complete'
      status: "True"
      type: Progressing
    - lastTransitionTime: "2021-03-01T06:07:17Z"
      status: "True"
      type: RetrievedUpdates
    desired:
      channels:
      - stable-4.6
      image: quay.io/openshift-release-dev/ocp-release@sha256:47df4bfe1cfd6d63dd2e880f00075ed1d37f997fd54884ed823ded9f5d96abfc
      url: https://access.redhat.com/errata/RHBA-2021:0634
      version: 4.6.19
    history:
    - completionTime: null
      image: quay.io/openshift-release-dev/ocp-release@sha256:47df4bfe1cfd6d63dd2e880f00075ed1d37f997fd54884ed823ded9f5d96abfc
      startedTime: "2021-03-01T06:07:59Z"
      state: Partial
      verified: true        <--- The state is changed to True.
      version: 4.6.19
    - completionTime: "2021-03-01T05:00:59Z"
      image: registry.ci.openshift.org/ocp/release@sha256:e54366af2e363c90249dceb97a1496d3b4249da69c5400ab383eca63799db762
      startedTime: "2021-03-01T04:17:39Z"
      state: Completed
      verified: false
      version: 4.5.0-0.nightly-2021-02-26-170201
    observedGeneration: 3
    versionHash: llINEEKbEPQ=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The verified: true is visible, so move it to verified state

Comment 5 errata-xmlrpc 2021-03-11 06:55:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.34 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0714


Note You need to log in before you can comment on or make changes to this bug.