Bug 1906100 - Disconnected cluster upgrades are failing from the cli, when signature retrieval is being blackholed instead of quickly rejected
Summary: Disconnected cluster upgrades are failing from the cli, when signature retrie...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.7.0
Assignee: Jack Ottofaro
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1918494
TreeView+ depends on / blocked
 
Reported: 2020-12-09 16:58 UTC by Jonathan Edwards
Modified: 2021-08-06 08:41 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:41:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 493 0 None closed Bug 1906100: use child context to verify payload signature on forced update 2021-02-18 03:28:06 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:42:08 UTC

Description Jonathan Edwards 2020-12-09 16:58:13 UTC
Disconnected openshift cli based upgrades seem to be failing more regularly now.  For the disconnected environments there's no access on the cluster to the api.openshift.com endpoint, mirror.openshift.com, or the storage.googleapis.com to load the signatures so we run the upgrade against the sha with the --force flag and the --allow-explicit-upgrade option:
# oc version
Client Version: 4.6.6
Server Version: 4.6.1
Kubernetes Version: v1.19.0+d59ce34
# oc adm upgrade --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image=quay.io/openshift-release-dev/ocp-release@sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39

in the past this would proceed to grab to the image, extract the version tag and then proceed with the remainder of the upgrade logic - what we're seeing now (4.5.2+ and 4.6.z) is the upgrade will fail somewhere between downloading the image and loop with a "failed to download" message.  The CVO log complains that it cannot validate the signature (as it did in the past) and seems to loop after downloading the update complaining that the download failed:

I1209 15:50:51.579895       1 cvo.go:406] Started syncing cluster version "openshift-cluster-version/version" (2020-12-09 15:50:51.579874617 +0000 UTC m=+1375900.207690373)
I1209 15:50:51.579991       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66
c8300662ef30563d7104f39", Force:true}
I1209 15:50:51.580041       1 sync_worker.go:222] Update work is equal to current target; no change required
I1209 15:50:51.580057       1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:12, Step:"RetrievePayload", Failure:error(nil), Fraction:0, Complete
d:0, Reconciling:false, Initial:false, VersionHash:"", LastProgress:time.Time{wall:0xbfec5a2365dd8863, ext:1375566263091063, loc:(*time.Location)(0x26b0400)}, Actual:v1.Release{Version:"", I
mage:"quay.io/openshift-release-dev/ocp-release@sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39", URL:"", Channels:[]string(nil)}, Verified:false}
I1209 15:50:51.580163       1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:c7e8f18e8116356
701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63743125521, loc:
(*time.Location)(0x26b0400)}}, CompletionTime:(*v1.Time)(nil), Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d71
04f39", Verified:false}
I1209 15:50:51.580293       1 cvo.go:408] Finished syncing cluster version "openshift-cluster-version/version" (416.053µs)
I1209 15:50:55.260829       1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version
I1209 15:50:59.548377       1 sigstore.go:95] unable to load signature: Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=c7e8f18e8116356701bd23ae3a23fb9
892dd5ea66c8300662ef30563d7104f39/signature-1": context deadline exceeded
I1209 15:50:59.548369       1 sigstore.go:95] unable to load signature: Get "https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=c7e8f18e8116356701b
d23ae3a23fb9892dd5ea66c8300662ef30563d7104f39/signature-1": dial tcp 172.217.15.112:443: i/o timeout
I1209 15:50:59.548432       1 verify.go:154] error retrieving signature for sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39: Get "https://mirror.openshift.com/pub/ope
nshift-v4/signatures/openshift/release/sha256=c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39/signature-1": context deadline exceeded
I1209 15:50:59.548451       1 verify.go:173] Failed to retrieve signatures for sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39 (should never happen)
W1209 15:50:59.548458       1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d710
4f39" that failed verification: The update cannot be verified: context deadline exceeded
W1209 15:50:59.548588       1 updatepayload.go:206] failed to prune jobs: context deadline exceeded
E1209 15:50:59.548712       1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): Unable to download and prepare the update: context deadline exceeded
I1209 15:50:59.548767       1 event.go:282] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39" failure=Unable to download and prepare the update: context deadline exceeded
I1209 15:51:05.345520       1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version

Have validated we can pull the images fine with podman on the host, as well as on the nodes of the cluster through the image content source policy.  Have also validated that we can pull all the images within the update in the same manor.  Have also tried putting the signature in a configmap in the openshift-config-managed namespace and running without the --force flag - but do not have a validated procedure for this.

Comment 1 Jonathan Edwards 2020-12-09 17:03:48 UTC
that should read 4.5.21+ and 4.6.1+ .. earlier 4.5 upgrades in this manner were okay.

Comment 2 Vadim Rutkovsky 2020-12-09 18:01:58 UTC
>I1209 15:50:59.548432       1 verify.go:154] error retrieving signature for sha256:c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39: Get "https://mirror.openshift.com/pub/ope
nshift-v4/signatures/openshift/release/sha256=c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39/signature-1": context deadline exceeded

Your cluster can't reach https://mirror.openshift.com to verify the signature

Comment 3 Vadim Rutkovsky 2020-12-09 18:02:59 UTC
you must have access to mirror.openshift.com to pass this step - or have "force" it skip in clusterversion.

Please attach must-gather

Comment 4 W. Trevor King 2020-12-09 18:52:10 UTC
The fact that your cluster is trying to pull signatures over HTTPS at all suggests you may have fumbled something when you attempted to add the signature ConfigMap to the openshift-config-managed namespace.  Please provide steps for how you attempted that, or we can try to reconstruct based on the must-gather Vadim requested in comment 3.

Comment 5 Jonathan Edwards 2020-12-09 21:11:27 UTC
Thanks - yes, reattempting the openshift-config-managed ConfigMap seems to have worked fine as per:
https://docs.openshift.com/container-platform/4.6/updating/updating-restricted-network-cluster.html#updating-restricted-network-image-signature-configmap

# cat<<EOF | oc create -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: release-image-4.6.6
  namespace: openshift-config-managed
  labels:
    release.openshift.io/verification-signatures: ""
binaryData:
  sha256-c7e8f18e8116356701bd23ae3a23fb9892dd5ea66c8300662ef30563d7104f39: owGbwMvMwMEoOU9/4l9n2UDGtYxJSWLxRQW5xZnpukWphboebpUmlY56SZl58fvXTahWSi7KLMlMTsxRslKoVsrMTUxPBbNS8pOzU4t0cxPzMtNSi0t0UzLTgRRQSqk4I9HI1Mwq2TzVIs3QItXC0NDM2NTM3MAwKcXIODHVONHIOC3J0sLSKCXFNDXRzCzZwtjAwMzMKDXN2MDUzDjF3NDAJM3YUqlWR0GppLIAZJ1SYkl+bmayQnJ+XkliZl5qkQLQtXmJJaVFqUpAVZkpqXklmSWVyA4rSk1LLUrNSwZrLyxNrNTLzNfPL0jNK87ITCsBSuekJhan6qaklunnJxfA+FYmemZ6ZroVFmbxZiZKtSAn5BeUZObnQf2fXJQKdEoRyMyg1BQFj8QSBX+gmcEgMxWCgW7KzEtXcCwtycgHhlqlgoGegZ4h0JhOJlFmVgZQeMLDnWNzCv9vdt9TPXOTrDSnc2wvXpY1X/6H/UrLipMBFXU3s7i31tyP/8+h7PhRcUk2z1ePBWf782+rrGRPZ5g9b7l8XkX7hD2/Tq66+Erix9WW/woHup/9DGV/+rbxbBPb17+/vsypOeeaEnutcMEzBu6j/yXn5c0+Iu+dsyXj80eH2tNMGb+t426tKtBLStLsXmy/5YGq1cUvH7qP/5ptkx2WcGP95I3yYlOecGhPWWtxkGG19N59ERZ/w53O7Ko69LVI73jP5KCH2t4yKiKPPossvZOwzHDOftXukIVL0hTVBNWNjm75lNi/Kj7BYhLH/f47j7gqrn3jrv5y8crc57GSEa8WTe/6wfvFS3u5yEnDB4Gu2W+sfiZcVdFvvGmnNI83RPvQMf3f5yd8SzBSYI10mPCvPq6k+bvtnexXRl1WqhcPf1y5SJLxv67kLm+nzdfdu/pD/Qz3RuybWJj48+b9z9zN68o0depFjigWVDz4vaU8a3LtoTS+NcGzolfclHdhm3hJ5YuJfI/87Ycnf5W9cjpjFrr4eb/FmzNHEsP7WF/Efo51nP52vqW3sk/p2TzF8zkZV1fqtu56dzzxdS3XZYb/u2R99Y8JnP8d3yz9acWOiCgnHtarra7GTD+U9wl/YQ33Cnt05ea7P2yK/8OcZl9YICyonbhx1uLSoAMzAoIN5iVVcR2zKCnNk+8wqVtjIPwsYLUsy0XGlvWx8WV3AA==
EOF

# oc get cm -n openshift-config-managed
NAME                                                  DATA   AGE
bound-sa-token-signing-certs                          1      33d
console-public                                        1      33d
csr-controller-ca                                     1      33d
default-ingress-cert                                  1      33d
grafana-dashboard-cluster-total                       1      33d
grafana-dashboard-etcd                                1      33d
grafana-dashboard-k8s-resources-cluster               1      33d
grafana-dashboard-k8s-resources-namespace             1      33d
grafana-dashboard-k8s-resources-node                  1      33d
grafana-dashboard-k8s-resources-pod                   1      33d
grafana-dashboard-k8s-resources-workload              1      33d
grafana-dashboard-k8s-resources-workloads-namespace   1      33d
grafana-dashboard-node-cluster-rsrc-use               1      33d
grafana-dashboard-node-rsrc-use                       1      33d
grafana-dashboard-prometheus                          1      33d
kube-apiserver-aggregator-client-ca                   1      33d
kube-apiserver-client-ca                              1      33d
kube-apiserver-server-ca                              1      33d
kubelet-bootstrap-kubeconfig                          1      33d
kubelet-serving-ca                                    1      33d
monitoring-shared-config                              4      33d
oauth-openshift                                       1      33d
ocp-upgrade-4.6.6                                     0      22h
release-image-4.6.6                                   1      135m
release-verification                                  3      33d
sa-token-signing-certs                                2      33d
service-ca                                            1      33d
signatures-managed                                    0      33d
trusted-ca-bundle                                     1      33d

--
can see the binary data as 0 in the previous ocp-upgrade-4.6.6 .. also is the name of the configmap significant?

it seems though that the --force flag doesn't seem to be respected to bypass signature validation now.  Is this by design?
Will recommend moving to the ConfigMap method moving forward.

Comment 6 W. Trevor King 2020-12-09 22:12:22 UTC
> ... is the name of the configmap significant?

No, the cluster-version operator just hunts for the release.openshift.io/verification-signatures label in the openshift-config-managed namespace [1].

> it seems though that the --force flag doesn't seem to be respected to bypass signature validation now.

The cluster-version operator should still run all the checks, e.g. attempting to hunt down valid signatures.  But when a check fails, forcing should waive the failure and carry on with the update regardless.  Maybe we have a bug there around context timeouts...

[1]: https://github.com/openshift/library-go/blob/19c8a18cddcd49ee18b34531a18122f0e3844cfa/pkg/verify/store/configmap/configmap.go#L25-L33

Comment 7 Johnny Liu 2020-12-29 09:49:23 UTC
I also hit the same issue here when upgrading a disconnected cluster from 4.6.9 to 4.7.0-0.nightly-2020-12-21-131655 with --force option.

Comment 9 To Hung Sze 2021-01-04 14:32:49 UTC
Added upgrade-blocker keyword as it may block upgrade regression testing.

Comment 10 Lalatendu Mohanty 2021-01-05 16:43:20 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 11 Lalatendu Mohanty 2021-01-05 16:54:06 UTC
As per the documentation we suggest the signature configmap route to update airgap clusters. So removing the upgrade blocker keyword and setting the severity to high.

Comment 12 Lalatendu Mohanty 2021-01-05 18:06:35 UTC
As the documented steps works fine, reducing the severity to medium and priority to low.

Comment 13 Johnny Liu 2021-01-06 03:57:29 UTC
For our internal testing, especially for those unsigned release image, force upgrade for an air-gapped cluster is still often required, so I think it is better to fix it ASAP.

Comment 14 Jack Ottofaro 2021-01-06 14:59:13 UTC
Believe the context timeout is short-circuiting any further processing of the update that would normally occur with "force" true. Perhaps a child context is needed at [1] that can be handled there locally depending on value of force.

[1] https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/pkg/cvo/updatepayload.go#L91

Comment 15 To Hung Sze 2021-01-12 03:38:51 UTC
@lmohanty 
Does Johnny's input above answers your questions when you set the NeedInfo flag?

Comment 16 W. Trevor King 2021-01-13 21:26:28 UTC
Jack's proposal in comment 14 makes sense to me.  Picking the timeout for a child context sounds fiddly, but we could also pass down two Context arguments if we feel too jumpy making a local decision about how much time is on a single Context that got passed in.  Probably worth working out the chain down from wherever is setting the current timeout before we pick where to set the child timeout.

Comment 18 Johnny Liu 2021-01-20 05:38:57 UTC
Retest this bug with 4.7.0-0.nightly-2021-01-19-095812, still fail.

01-20 12:22:45 Command: oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315 --force
01-20 12:22:45 warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
01-20 12:22:45 warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
01-20 12:22:45 Updating to release image registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315

01-20 12:27:45 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 12:32:46 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 12:37:46 Status: Unable to apply registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: could not download the update Progress: True Available: True

01-20 12:42:47 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 12:47:48 Status: Unable to apply registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: could not download the update Progress: True Available: True

01-20 12:52:49 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 12:57:49 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 13:02:50 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 13:07:50 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 13:12:51 Status: Unable to apply registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: could not download the update Progress: True Available: True

01-20 13:17:51 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 13:22:52 Status: Unable to apply registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: could not download the update Progress: True Available: True

01-20 13:27:54 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

01-20 13:32:54 Status: Working towards registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315: downloading update Progress: True Available: True

Comment 19 W. Trevor King 2021-01-20 06:22:02 UTC
> Retest this bug with 4.7.0-0.nightly-2021-01-19-095812, still fail.

Looks like you were trying to update from an unspecified release to 4.7.0-0.nightly-2021-01-19-095812:

  $ oc adm release info registry.ci.openshift.org/ocp/release@sha256:ac57098ad18ed07977b54b90be79dc44f34eb03e42e0be2a95963a316bcde315 | head -n1
  Name:           4.7.0-0.nightly-2021-01-19-095812

But the bug fix needs to be in the outgoing release to matter, because it's the outgoing release that's trying to verify the desired target.  Can you install 4.7.0-0.nightly-2021-01-19-095812 and then try to update out to some other release (doesn't really matter what the target is, as long as the target is accepted and a CVO is launched to start attempting to apply it).

Comment 20 Johnny Liu 2021-01-20 07:08:12 UTC
I was doing upgrading from 4.6.9 to 4.7.0-0.nightly-2021-01-19-095812, so the outgoing release is 4.6.9. Per your statement, once 4.7 nightly build fixed this issue, we need to backport to 4.6?

Comment 21 W. Trevor King 2021-01-20 07:29:08 UTC
This bug targets 4.7, so we should be able to verify as it stands with 4.7.0-0.nightly-2021-01-19-095812 -> whatever.  Once this bug is VERIFIED, we can clone the bug back to 4.6.z to fix 4.6.(fixed) -> whatever.  And then we may want to keep backporting to 4.5 and earlier, although 4.4 might go end-of-life before we get back that far [1].  Looking at the pkg/cvo/updatepayload.go history, this is not a regression.  Although it's possible that 4.5->whatever, etc. are not vulnerable for some other reason.  Would be good to check.

[1]: https://access.redhat.com/support/policy/updates/openshift#dates

Comment 22 Johnny Liu 2021-01-20 10:31:36 UTC
Reproduced this bug upgrading from 4.7.0-0.nightly-2021-01-13-124141 to 4.7.0-0.nightly-2021-01-19-095812.
Verified this bug upgrading from 4.7.0-0.nightly-2021-01-18-053817 to 4.7.0-0.nightly-2021-01-19-095812.

Comment 25 errata-xmlrpc 2021-02-24 15:41:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.