Bug 1916384 - 4.5.15 and later cluster-version operator does not sync ClusterVersion status before exiting, leaving 'verified: false' even for verified updates
Summary: 4.5.15 and later cluster-version operator does not sync ClusterVersion status...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: W. Trevor King
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On:
Blocks: 1927515
TreeView+ depends on / blocked
 
Reported: 2021-01-14 16:51 UTC by Neil Girard
Modified: 2022-01-17 07:08 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator was not syncing ClusterVersion during graceful shutdowns. Consequence: During updates, the outgoing cluster-version operator was likely to exit after verifying the incoming release, but before pushing the 'verified: true' value into ClusterVersion history. Fix: The cluster-version operator now allows some additional time to perform a final ClusterVersion status synchronization during graceful shutdowns. Result: The ClusterVersion 'verified' values are again consistently 'true' for releases which were verified before being applied, returning to the behavior we had before 4.5.15 and 4.6.0.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:53:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log (647.48 KB, text/plain)
2021-01-19 22:45 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 501 0 None closed Bug 1916384: pkg/cvo: Use shutdownContext for final status synchronization 2021-02-17 00:37:25 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:53:39 UTC

Description Neil Girard 2021-01-14 16:51:11 UTC
Description of problem:
It seems when installing or upgrading to OCP 4.6.9 (and maybe others), the verified flag in the CVO's history show verified is false.

  history:
  - completionTime: '2021-01-06T14:45:13Z'
    image: repo.ercot.com:443/ocp4/openshift4@sha256:43d5c84169a4b3ff307c29d7374f6d69a707de15e9fa90ad352b432f77c0cead
    startedTime: '2021-01-05T21:57:09Z'
    state: Completed
    verified: false
    version: 4.6.9


How reproducible:
always

Steps to Reproduce:
1. Upgrade cluster to 4.6.9

Actual results:
verified = false

Expected results:
verified = true

Additional info:

Attempted to created 4.6.8 cluster in quick labs and it too showed not verified.

    history:
    - completionTime: "2021-01-12T20:39:40Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf
      startedTime: "2021-01-12T20:05:46Z"
      state: Completed
      verified: false
      version: 4.6.8

Comment 1 W. Trevor King 2021-01-14 18:39:45 UTC
Verification is something an outgoing CVO performs on an incoming update target request.  On install, there is no outgoing CVO to perform the verification.  Verification for the install must be performed by the user, e.g. by checking signatures on the installer they use to install the cluster, and then relying on transitive trust for the rest of the cluster (e.g. that the signed installer is using a compiled-in, by-digest release pullspec).  So 'verified: false' for the original, installed-version history entry is as designed, not a bug.

Comment 2 Neil Girard 2021-01-14 19:27:54 UTC
Hello,

Ok, the install clean makes sense.  Customer upgraded from 4.5.21 to 4.6.9 and the verified was false.  Is there a way to fix this or a way to determine why it is false from that upgrade?

Thanks,
Neil Girard

Comment 3 W. Trevor King 2021-01-14 20:54:35 UTC
> Customer upgraded from 4.5.21 to 4.6.9 and the verified was false.  Is there a way to fix this or a way to determine why it is false from that upgrade?

'verified' is set false when the outgoing CVO performs its validation checks (target release was signed, Upgradeable=False conditions were ok, etc.), at least one of those checks failed, but the user set --force to waive the failed guard (more background in bug 1825396).  We do not record the specific guard(s) which failed and were waived.  If the user used 'oc adm upgrade --force ...', or otherwise set ClusterVersion's spec.desiredUpdate.force [1], then ask them to try again without forcing, and the CVO will explain why it is refusing the requested update.

[1]: https://github.com/openshift/api/blob/790e0a84e3e01dc3f6218e655b0e92b809ae7fd0/config/v1/types_cluster_version.go#L238-L250

Comment 4 Neil Girard 2021-01-18 14:22:03 UTC
Hello,

Customer has stated they did not use the --force when upgrading, but somehow the update has resulted in the flag being false.

Thanks,
Neil Girard

Comment 5 W. Trevor King 2021-01-19 00:20:37 UTC
Can you set a tail on the outgoing CVO, trigger the update, and then attach the CVO logs here?  Example of tailing the CVO in [1]:

  $ POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)"
  $ oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log"

although you may be able to do something more elegant than POD with:

  $ oc -n openshift-cluster-version logs -f --selector k8s-app=cluster-version-operator

That should have the CVO explaining why it is failing to validate the requested update target.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1884334#c16

Comment 6 Neil Girard 2021-01-19 16:25:56 UTC
Hello,

Customer is already at the level they wish to be.  Is it possible to trigger upgrading to the same version?  I can save the scripts for the next time the customer wishes to upgrade and have them try it at that time.

Thanks,
Neil Girard

Comment 7 W. Trevor King 2021-01-19 21:48:42 UTC
> Is it possible to trigger upgrading to the same version?

I installed a 4.5.21 cluster, and then bumped the channel to stable-4.6:

  $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "stable-4.6"}]'

*->4.6.9 has been blocked [1], but 4.6.8 is still currently recommended:

  $ oc adm upgrade
  Cluster version is 4.5.21

  Updates:

  VERSION IMAGE
  4.5.22  quay.io/openshift-release-dev/ocp-release@sha256:38d0bcb5443666b93a0c117f41ce5d5d8b3602b411c574f4e164054c43408a01
  4.5.23  quay.io/openshift-release-dev/ocp-release@sha256:368970a5052e600b467de7ab2868a04cdd6cdb991eb9dfbf3473a199d6fe4a4e
  4.5.24  quay.io/openshift-release-dev/ocp-release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11
  4.6.8   quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf

Start tailing the CVO in the background:

  $ oc -n openshift-cluster-version logs -f --selector k8s-app=cluster-version-operator >cvo.log &

Trigger the update to 4.6.8:

  $ oc adm upgrade --to 4.6.8

Wait until it accepts the update and gets into it:

  $ oc adm upgrade
  info: An upgrade is in progress. Working towards 4.6.8: 11% complete

  No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.

Check the history:

  $ oc get -o json clusterversion version | jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + " 
" + .version + " " + (.verified | tostring)'
  2021-01-19T21:17:30Z - 4.6.8 false
  2021-01-19T20:50:19Z 2021-01-19T21:14:15Z 4.5.21 false

So yeah, I'd have expected that 4.6.8 target to be 'verified: true', but it's not.  Checking the pod and verification history for my cluster:

  $ oc -n openshift-cluster-version get -o json events | jq -r '.items[] | select(.reason == "SuccessfulCreate" or (.reason | contains("VerifyPayload"))) | .firstTimestamp + " " + (.count | tostring) + " " + .message' | sort
  2021-01-19T20:49:53Z 1 Created pod: cluster-version-operator-844c77584d-8frtb
  2021-01-19T20:50:26Z 1 Created pod: cluster-version-operator-59bc479b54-7bp7b
  2021-01-19T21:17:18Z 1 Created pod: version-4.6.8-c49dj-zn6bm
  2021-01-19T21:17:39Z 1 Created pod: cluster-version-operator-86b768cb58-v96lm
  2021-01-19T21:17:51Z 1 verifying payload version="4.6.8" image="quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf"
  2021-01-19T21:17:57Z 1 Created pod: cluster-version-operator-56ffb56bdc-r9d2s
  2021-01-19T21:17:59Z 1 verifying payload version="4.6.8" image="quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf"

Hmm.  I grabbed logs from the currently-running CVO:

  $ POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)"
  $ echo "${POD}"
  cluster-version-operator-56ffb56bdc-r9d2s
  $ oc -n openshift-cluster-version logs "${POD}" > "$(date --utc --iso=s)-${POD}.log"

And checking its start time and version:

  $ oc -n openshift-cluster-version get -o json pod "${POD}" | jq -r '.metadata.creationTimestamp + " " + .spec.containers[0].image'
  2021-01-19T21:17:57Z quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf
  $ oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf | head -n1
  Name:      4.6.8

$ grep sync_worker '2021-01-19T21:23:51+00:00-cluster-version-operator-56ffb56bdc-r9d2s.log' | grep -v 'Update work is equal to current target; no change required' | grep -A2 'Loading payl
oad'
I0119 21:17:59.916165       1 sync_worker.go:502] Loading payload
I0119 21:18:00.873910       1 sync_worker.go:569] Skipping preconditions for a local operator image payload.
I0119 21:18:00.874021       1 sync_worker.go:602] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf with hash BYHEg9BBpfk=

We probably shouldn't be recording a VerifyPayload event if we're about to skip verification because of the local payload.  I'll run a fresh update and loop the tail, to see if I can get logs from the intermediate pods. 

[1]: https://github.com/openshift/cincinnati-graph-data/pull/603

Comment 8 W. Trevor King 2021-01-19 22:45:55 UTC
Created attachment 1748863 [details]
2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log

Ok, take two:

  $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "stable-4.6"}]'
  $ while sleep 1; do POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log"; done &
  $ oc adm upgrade --to 4.6.8

Wait for the CVO to get into the update...  Confirm it's progressing:

  $ oc adm upgrade
  info: An upgrade is in progress. Working towards 4.6.8: 11% complete

  No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.

Kill the background logger:

  $ fg
  while sleep 1; do
      POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log";
  done
  ^C

Confirm 4.6.8 was not verified (this is the bug):

  $ oc get -o json clusterversion version | jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .version + " " + (.verified | tostring)'
  2021-01-19T22:23:59Z - 4.6.8 false
  2021-01-19T21:53:38Z 2021-01-19T22:19:33Z 4.5.21 false

History of pods and verification events:

  $ oc -n openshift-cluster-version get -o json events | jq -r '.items[] | select(.reason == "SuccessfulCreate" or (.reason | contains("VerifyPayload"))) | .firstTimestamp + " " + (.count | tostring) + " " + .message' | sort
  2021-01-19T21:53:12Z 1 Created pod: cluster-version-operator-844c77584d-47v82
  2021-01-19T21:53:45Z 1 Created pod: cluster-version-operator-59bc479b54-h54zb
  2021-01-19T22:23:59Z 1 Created pod: version-4.6.8-hdlrb-hj59t
  2021-01-19T22:24:17Z 1 Created pod: cluster-version-operator-86b768cb58-26cv6
  2021-01-19T22:24:20Z 1 verifying payload version="4.6.8" image="quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf"
  2021-01-19T22:24:23Z 1 Created pod: cluster-version-operator-8546dd559-6zxmp
  2021-01-19T22:24:25Z 1 verifying payload version="4.6.8" image="quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf"

Gathered logs:

  $ head -n1 2021-01-19T*.log
  ==> 2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log <==
  I0119 22:15:04.457399       1 start.go:21] ClusterVersionOperator v4.5.0-202011132127.p0-dirty

  ==> 2021-01-19T22:24:17+00:00-cluster-version-operator-86b768cb58-26cv6.log <==
  I0119 22:24:19.133039       1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256

  ==> 2021-01-19T22:24:23+00:00-cluster-version-operator-8546dd559-6zxmp.log <==
  I0119 22:24:24.592206       1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256

Not sure why we don't have logs for cluster-version-operator-844c77584d-47v82... Ah, never got scheduled:

  $ oc -n openshift-cluster-version get -o json events | jq -r '.items[] | select(.involvedObject.name == "cluster-version-operator-844c77584d-47v82") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort 
  2021-01-19T21:53:12Z 2 FailedScheduling: no nodes available to schedule pods
  2021-01-19T21:53:45Z 1 FailedScheduling: skip schedule deleting pod: openshift-cluster-version/cluster-version-operator-844c77584d-47v82

That's weird, but seems orthogonal.  Log entries discussing verification:

  $ grep sync_worker 2021-01-19T*.log | grep -v 'Update work is equal to current target; no change required' | grep -A2 'Loading payload'
  2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log:I0119 22:19:03.311946       1 sync_worker.go:477] Loading payload
  2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log:I0119 22:19:03.806839       1 sync_worker.go:518] Skipping preconditions for a local operator image payload.
  2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log:I0119 22:19:03.806867       1 sync_worker.go:547] Payload loaded from registry.ci.openshift.org/ocp/release@sha256:7684a3daa92827b313a682b3615f2a348a941f8557b5b509b736226f40e01e07 with hash YkD2j0aj5ug=
  --
  2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log:I0119 22:23:59.787724       1 sync_worker.go:477] Loading payload
  2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log:I0119 22:24:15.636222       1 sync_worker.go:547] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf with hash BYHEg9BBpfk=
  2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log:I0119 22:24:15.649806       1 sync_worker.go:639] Precreated resource clusteroperator "etcd" (68 of 617)
  --
  2021-01-19T22:24:17+00:00-cluster-version-operator-86b768cb58-26cv6.log:I0119 22:24:20.182443       1 sync_worker.go:502] Loading payload
  2021-01-19T22:24:17+00:00-cluster-version-operator-86b768cb58-26cv6.log:I0119 22:24:20.768077       1 sync_worker.go:569] Skipping preconditions for a local operator image payload.
  2021-01-19T22:24:17+00:00-cluster-version-operator-86b768cb58-26cv6.log:I0119 22:24:20.768136       1 sync_worker.go:602] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf with hash BYHEg9BBpfk=
  --
  2021-01-19T22:24:23+00:00-cluster-version-operator-8546dd559-6zxmp.log:I0119 22:24:25.249971       1 sync_worker.go:502] Loading payload
  2021-01-19T22:24:23+00:00-cluster-version-operator-8546dd559-6zxmp.log:I0119 22:24:25.839554       1 sync_worker.go:569] Skipping preconditions for a local operator image payload.
  2021-01-19T22:24:23+00:00-cluster-version-operator-8546dd559-6zxmp.log:I0119 22:24:25.839637       1 sync_worker.go:602] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf with 
hash BYHEg9BBpfk=

So you can see cluster-version-operator-59bc479b54-h54zb verifying itself with the local-image out at 22:19:03 and then loading the 4.6.8 target at 22:23:59 without being all that chatty about any verification.  The subsequent two CVO pods are from 4.6.8, so they both load while skipping verification with the local-image out.  I'm attaching 2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log in case folks want to poke around there, while I go off and re-read the 4.5 verification code to see what sort of logging I should expect to find...

Comment 9 W. Trevor King 2021-01-19 23:54:43 UTC
The VerifyPayload* events are new in 4.6 [1], so that's why we don't see them in 4.5.  In 4.5.21, verified-ness comes from RetrievePayload [2].  Looks like we don't log anything on successful signature validation in RetrievePayload [3], but I don't see anything about "that failed verification" in the logs either.  You can see a bit of the signature verification going on though:

  $ grep -1 'store.go' '2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log'
  I0119 22:23:59.787724       1 sync_worker.go:477] Loading payload
  I0119 22:23:59.787746       1 store.go:74] use cached most recent signature config maps
  I0119 22:23:59.797665       1 store.go:65] remember most recent signature config maps: signatures-managed
  I0119 22:23:59.797699       1 store.go:116] searching for sha256-6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf in signature config map signatures-managed
  I0119 22:23:59.918826       1 cvo.go:500] Finished syncing available updates "openshift-cluster-version/version" (131.142709ms)

And while pkg/verify/verify.go includes logging for some failure modes, it has none for successfully verifying the signature [4,5].  A later precondition check [6] is also in the logs, where it's passing out due to a failure to find an Upgradeable condition on the ClusterVersion object [7]:

  $ grep -i precondition '2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log'
  I0119 22:19:03.806839       1 sync_worker.go:518] Skipping preconditions for a local operator image payload.
  I0119 22:24:15.636190       1 upgradeable.go:51] Precondition ClusterVersionUpgradeable passed: no Upgradeable condition on ClusterVersion.

So timeline:

* 22:23:59: "Loading payload" logged.
* 22:23:59: store.go logs, as the CVO hunts for a signature.
* 22:23:59: startedTime history entry with 'verified: false'.  I expected us to log these history merges, but I guess they're V(5) [8] and the bump to --v=5 is new in 4.6 [9].
* 22:24:15: Precondition ClusterVersionUpgradeable passed
* 22:24:15: "Payload loaded from" logged.

Still digging...

[1]: https://github.com/openshift/cluster-version-operator/pull/411
[2]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/cvo/sync_worker.go#L485-L507
[3]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/cvo/updatepayload.go#L91-L104
[4]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/verify/verify.go#L303-L306
[5]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/verify/verify.go#L331-L349
[6]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/cvo/sync_worker.go#L528
[7]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/payload/precondition/clusterversion/upgradeable.go#L51-L52
[8]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/cvo/status.go#L87
[9]: https://github.com/openshift/cluster-version-operator/pull/448

Comment 10 W. Trevor King 2021-01-20 03:31:06 UTC
Aha [1]:

  $ grep 'unable to perform final sync' '2021-01-19T22:22:42+00:00-cluster-version-operator-59bc479b54-h54zb.log'
  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

So we're probably performing the status/history merge correctly, but then exiting before we successfully flush the state into the in-cluster object.  Bug would have been introduced to 4.5.z with [2], bug 1872906, which went out with 4.5.15 [3].  Solution is to give that flush a bit of extra time, similar to bug 1906100.

[1]: https://github.com/openshift/cluster-version-operator/blob/0a34ac3632dd869e9975fc6e9ebdcb95ef9374b8/pkg/cvo/cvo.go#L344
[2]: https://github.com/openshift/cluster-version-operator/pull/446
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1872906#c10

Comment 11 W. Trevor King 2021-01-20 04:02:12 UTC
Thanks for pushing this, Neil :).  I've pushed up a PR for master/4.7 which has what I expect to be a fix.  As that is verified for 4.7, we'll backport to 4.6, and then to 4.5.  Filling out a formal impact statement:

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* 4.5.15's bug 1872906 and 4.6.0's bug 1843505 broke the outgoing ClusterVersion status sync.  Clusters updating out of those releases, regardless of which release they are updating to, will be impacted by this bug.  To avoid the bug, we could theoretically block all edged into impacted releases, but that's an awful lot of releases, and as discussed below, the impact isn't particularly terrible.

What is the impact?  Is it serious enough to warrant blocking edges?
* Late-breaking changes to ClusterVersion status may not be pushed into the cluster.  Because it takes some time to pull down and verify the release image, and because the incoming CVO knows what version it's been asked to run, the version name and release image are unlikely to be corrupted, but 'verified' might be reported as 'false' when in reality the incoming release was successfully verified (the outgoing CVO just exited without attempting to sync that 'verified: true' out to the cluster.  This corrupted data is unfortunate, but has no known in-cluster consumers, and the main "was the target signed?" condition can be confirmed later by manually looking up the signature for the target release image.  That is probably limited enough to not be worth blocking edges.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* Updating out to fixed releases will avoid the problem for future updates.  There's no repairing updates out of impacted releases short of manually forcing 'verified' values, and that's probably not something we want to recommend.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* Yes, from 4.5.14 (and earlier) into 4.5.15, 4.6, and later.

Comment 12 Yang Yang 2021-01-22 08:28:18 UTC
Updating the reproduced status from QE:

Cluster upgrades from 4.5.21 -> 4.6.8 -> 4.6.12 -> 4.7.0-fc.3. The issue is reproduced when upgrading from 4.6.8 to  4.6.12.

  history:
    - completionTime: null
      image: quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537
      startedTime: "2021-01-22T07:48:23Z"
      state: Partial
      verified: false
      version: 4.7.0-fc.3
    - completionTime: "2021-01-22T07:37:38Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:5c3618ab914eb66267b7c552a9b51c3018c3a8f8acf08ce1ff7ae4bfdd3a82bd
      startedTime: "2021-01-22T06:40:58Z"
      state: Completed
      verified: false
      version: 4.6.12    -- issue reproduced
    - completionTime: "2021-01-22T02:52:13Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf
      startedTime: "2021-01-22T01:47:41Z"
      state: Completed
      verified: true       -- different from comment #7
      version: 4.6.8
    - completionTime: "2021-01-21T11:37:56Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:7684a3daa92827b313a682b3615f2a348a941f8557b5b509b736226f40e01e07
      startedTime: "2021-01-21T11:02:11Z"
      state: Completed
      verified: false   -- expected based on comment#1
      version: 4.5.21

Cluster upgrades from 4.5.14 -> 4.5.15 -> 4.5.24. The issue is reproduced when upgrading from 4.5.15.
  history:
    - completionTime: null
      image: quay.io/openshift-release-dev/ocp-release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11
      startedTime: "2021-01-22T08:02:38Z"
      state: Partial
      verified: false
      version: 4.5.24
    - completionTime: "2021-01-22T06:42:53Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:1df294ebe5b84f0eeceaa85b2162862c390143f5e84cda5acc22cc4529273c4c
      startedTime: "2021-01-22T06:08:49Z"
      state: Completed
      verified: true
      version: 4.5.15
    - completionTime: "2021-01-22T05:29:55Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:95cfe9273aecb9a0070176210477491c347f8e69e41759063642edf8bb8aceb6
      startedTime: "2021-01-22T05:02:22Z"
      state: Completed
      verified: false
      version: 4.5.14

Comment 13 Yang Yang 2021-01-25 07:41:14 UTC
Reproduced it when upgrading from 4.7.0-fc.1 to 4.7.0-fc.2

history:
    - completionTime: null
      image: quay.io/openshift-release-dev/ocp-release@sha256:2f00e3016ca5678e51e9d79d4d3ac5a2926e0c09a8e75df19ea983b9cd6c5d05
      startedTime: "2021-01-25T05:56:51Z"
      state: Partial
      verified: false
      version: 4.7.0-fc.2
    - completionTime: "2021-01-25T03:28:13Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:bf3a3f8bb1690ba1fa33e0a75437f377dde00a3dd19fe0398573f6bed48cfe04
      startedTime: "2021-01-25T02:57:19Z"
      state: Completed
      verified: false
      version: 4.7.0-fc.1
    observedGeneration: 3

Comment 15 Yang Yang 2021-02-07 06:33:42 UTC
I'm trying to verify it by upgrading from 4.7.0-0.nightly-2021-02-05-161444 to 4.7.0-0.nightly-2021-02-05-221250.

Install a cluster using 4.7.0-0.nightly-2021-02-05-161444 and then update the cincinnati server
 
    # oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"}}' --type=merge

    # oc adm upgrade 
    Cluster version is 4.7.0-0.nightly-2021-02-05-161444

    Updates:

    VERSION                           IMAGE
    4.7.0-0.nightly-2021-02-05-221250 registry.ci.openshift.org/ocp  /release@sha256:4ae3e3c8aed290eae504af397c0fe4f141c7d30ed7276da0cedf9d271d1f8de4

Upgrading...

    # oc adm upgrade --to=4.7.0-0.nightly-2021-02-05-221250

Checking clusterversion status...

    conditions:
    - lastTransitionTime: "2021-02-07T06:17:02Z"
      message: 'The update cannot be verified: unable to locate a valid signature for one or more sources'
      reason: ImageVerificationFailed
      status: "True"
      type: Failing
    - lastTransitionTime: "2021-02-07T06:13:21Z"
      message: 'Unable to apply 4.7.0-0.nightly-2021-02-05-221250: the image may not be safe to use'
      reason: ImageVerificationFailed
      status: "True"
      type: Progressing
    - lastTransitionTime: "2021-02-07T05:55:10Z"
      status: "True"
      type: RetrievedUpdates

    history:
    - completionTime: null
      image: registry.ci.openshift.org/ocp/release@sha256:4ae3e3c8aed290eae504af397c0fe4f141c7d30ed7276da0cedf9d271d1f8de4
      startedTime: "2021-02-07T06:13:21Z"
      state: Partial
      verified: false
      version: 4.7.0-0.nightly-2021-02-05-221250


It looks like the target nightly build cannot be verified. Is there any way to get it worked, or do I have to wait for the next fc build?

Comment 16 W. Trevor King 2021-02-08 03:52:53 UTC
CI-registry nightlies like the one you used are apparently not signed:

  $ curl -sI https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=4ae3e3c8aed290eae50
4af397c0fe4f141c7d30ed7276da0cedf9d271d1f8de4/signature-1 | head -n1
  HTTP/1.1 404 Not Found

But nightlies which are promoted to quay.io and indexed on the mirrors [1] are signed:

  $ curl -s https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.7.0-0.nightly-2021-02-04-031352/release.txt | grep Pull   
  Pull From: quay.io/openshift-release-dev/ocp-release-nightly@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4
  $ curl -s https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4/signature-1 | gpg --decrypt
  {"critical": {"image": {"docker-manifest-digest": "sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4"}, "type": "atomic container signature", "identity": {"docker-reference": "quay.io/openshift-release-dev/ocp-release-nightly:4.7.0-0.nightly-2021-02-04-031352-x86_64"}}, "optional": {"creator": "Red Hat OpenShift Signing Authority 0.0.1"}}gpg: Signature made Wed Feb  3 23:35:08 2021 PST
  gpg:                using RSA key 938A80CAF21541EB
  gpg: Can't check signature: No public key

0x938A80CAF21541EB is the beta-2 key [2].

[1]: https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/
[2]: https://access.redhat.com/security/team/key

Comment 17 Yang Yang 2021-02-08 07:54:43 UTC
Trevor, thanks a lot. 

The latest 4.7 build in mirror server does not have the fix. Then I'll get back to it when the next 4.7 builds are available.

Comment 18 Yang Yang 2021-02-10 08:07:49 UTC
Verified with upgrading from 4.7.0-0.nightly-2021-02-06-022227 to 4.7.0-rc.0

Steps to verify it:

1) Install a cluster with 4.7.0-0.nightly-2021-02-06-022227
2) Create a dummy cincy server with 4.7.0-0.nightly-2021-02-06-022227 and 4.7.0-rc.0
3) Upgrade the cluster
4) Check the verify state is true

# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2021-02-10T06:56:41Z"
    generation: 3
    managedFields:
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:clusterID: {}
      manager: cluster-bootstrap
      operation: Update
      time: "2021-02-10T06:56:41Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:channel: {}
          f:upstream: {}
      manager: kubectl-edit
      operation: Update
      time: "2021-02-10T07:53:52Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:desiredUpdate:
            .: {}
            f:force: {}
            f:image: {}
            f:version: {}
      manager: oc
      operation: Update
      time: "2021-02-10T08:02:12Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:availableUpdates: {}
          f:conditions: {}
          f:desired:
            .: {}
            f:channels: {}
            f:image: {}
            f:url: {}
            f:version: {}
          f:history: {}
          f:observedGeneration: {}
          f:versionHash: {}
      manager: cluster-version-operator
      operation: Update
      time: "2021-02-10T08:02:35Z"
    name: version
    resourceVersion: "44932"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: a03003cc-1cd4-44e4-bcca-940158ecff9c
  spec:
    channel: nightly-4.7
    clusterID: 2d39a2e8-56b6-4cf1-8f04-a4ffc9262e83
    desiredUpdate:
      force: false
      image: quay.io/openshift-release-dev/ocp-release@sha256:497fa748b89619aba312a926a0be0ad155c4b894ca3e2824a00167421e3441b0
      version: 4.7.0-rc.0
    upstream: https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy4.json
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-02-10T07:24:27Z"
      message: Done applying 4.7.0-0.nightly-2021-02-06-022227
      status: "True"
      type: Available
    - lastTransitionTime: "2021-02-10T07:24:27Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2021-02-10T08:02:20Z"
      message: 'Working towards 4.7.0-rc.0: 70 of 668 done (10% complete)'
      status: "True"
      type: Progressing
    - lastTransitionTime: "2021-02-10T08:00:08Z"
      status: "True"
      type: RetrievedUpdates
    desired:
      channels:
      - nightly-4.7
      image: quay.io/openshift-release-dev/ocp-release@sha256:497fa748b89619aba312a926a0be0ad155c4b894ca3e2824a00167421e3441b0
      url: https://access.redhat.com/errata/RHBA-2020:5633
      version: 4.7.0-rc.0
    history:
    - completionTime: null
      image: quay.io/openshift-release-dev/ocp-release@sha256:497fa748b89619aba312a926a0be0ad155c4b894ca3e2824a00167421e3441b0
      startedTime: "2021-02-10T08:02:20Z"
      state: Partial
      verified: true            <----------- It's as expected
      version: 4.7.0-rc.0
    - completionTime: "2021-02-10T07:24:27Z"
      image: registry.ci.openshift.org/ocp/release@sha256:ee7e4d58dd6e001d4f621e5cbf893d203910c2d3c2dce88b8ac65538eb1f1411
      startedTime: "2021-02-10T06:56:41Z"
      state: Completed
      verified: false
      version: 4.7.0-0.nightly-2021-02-06-022227
    observedGeneration: 3
    versionHash: Q_5mmldRekY=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


The verified state is changed to true. Moving it to verified.

Comment 19 Neil Girard 2021-02-10 21:09:37 UTC
Thanks guys for taking a long look at it.  It was definitely troubling.  I'm exited to see the fix!

Comment 20 Patrick Strick 2021-02-10 23:15:18 UTC
For reference in case you need to manually verify the release's signature, the following steps can be performed:

## Steps to verify cluster version release signature

1. Get SHA for running version  
`oc get clusterversion -o jsonpath='{.items[].status.history[0].image}'`
2. Copy the SHA value which is given after `sha256:` to use in step 3.
3. Get and verify the signature used to sign the release package that your cluster shows it is running.  
   1. On a RHEL 7 or 8 host, import the Red Hat public keys into your gpg keyring.  
   `gpg  --import /etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-beta`
   `gpg  --import /etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release`
   2. Get and verify the signature  
   `curl -s https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=[SHA_VALUE]/signature-1 | gpg --decrypt`
4. Confirm that the SHA in the `.critical.image["docker-manifest-digest"]` value matches the SHA for your running version from step 1.
5. Confirm that the RSA key used to decrypt the signature from step 3 is one of the product signing keys listed on https://access.redhat.com/security/team/key.
6. You can also verify that the Primary key fingerprint matches the corresponding entry on https://access.redhat.com/security/team/key.

## Example

1. Get SHA for running version
    ```sh
    $ oc get clusterversion -o jsonpath='{.items[].status.history[0].image}'

    quay.io/openshift-release-dev/ocp-release@sha256:c8746322036ee50c9b015404f6bcae3003f537b128641bd29ad9ab4048c7b274
    ```
2. Copy `c8746322036ee50c9b015404f6bcae3003f537b128641bd29ad9ab4048c7b274`
3. Get and verify the signature used to sign the release package that your cluster shows it is running.
    ```sh
    $ curl -s https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=c8746322036ee50c9b015404f6bcae3003f537b128641bd29ad9ab4048c7b274/signature-1 | gpg --decrypt

    {"critical": {"image": {"docker-manifest-digest": "sha256:c8746322036ee50c9b015404f6bcae3003f537b128641bd29ad9ab4048c7b274"}, "type": "atomic container signature", "identity": {"docker-reference": "quay.io/openshift-release-dev/ocp-release-nightly:4.7.0-fc.5-x86_64"}}, "optional": {"creator": "Red Hat OpenShift Signing Authority 0.0.1"}}gpg: Signature made Mon 01 Feb 2021 02:20:39 AM EST
    gpg:                using RSA key 938A80CAF21541EB
    gpg: Good signature from "Red Hat, Inc. (beta key 2) <security>" [unknown]
    gpg: WARNING: This key is not certified with a trusted signature!
    gpg:          There is no indication that the signature belongs to the owner.
    Primary key fingerprint: B08B 659E E86A F623 BC90  E8DB 938A 80CA F215 41EB
    ```
4. Confirm that the SHA from your cluster matches the SHA in the encrypted signature.  
    ```
    {"critical": {"image": {"docker-manifest-digest": "sha256: **c8746322036ee50c9b015404f6bcae3003f537b128641bd29ad9ab4048c7b274** "}

    ==  
    
    quay.io/openshift-release-dev/ocp-release@sha256: **c8746322036ee50c9b015404f6bcae3003f537b128641bd29ad9ab4048c7b274**
    ````
5. Confirm the RSA key used to decrypt the signature is a Red Hat product signing key.
On page https://access.redhat.com/security/team/key, RSA key `938A80CAF21541EB` is listed as `4096R/938a80caf21541eb (2009-02-24): Red Hat, Inc. (beta key 2) <security>`

Comment 23 errata-xmlrpc 2021-02-24 15:53:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.