Bug 1825396 - verified=true even when 'force' was set to override some preconditions
Summary: verified=true even when 'force' was set to override some preconditions
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Jack Ottofaro
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-17 21:52 UTC by W. Trevor King
Modified: 2021-07-14 20:34 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-14 20:34:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-04-17 21:52:43 UTC
On 4.3.10 -> 4.3.13, starting a verified update (with force false) and then updating to set force true will cause the cluster-version operator to bypass any blocking preconditions (appropriate) but not update the history to set verified false (bug).  Getting stuck on a precondition (like bug 1822752):

$ oc get -o json clusterversion version | jq -r .status.desired.version
4.3.10
$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
...no output...
$ oc patch scc privileged --type json -p '[{"op": "add", "path": "/users/-", "value": "kubeadmin"}]'
$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
2020-04-17T21:16:13Z kube-apiserver DefaultSecurityContextConstraints_Mutated
$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.3"}]'
$ oc adm upgrade --to 4.3.13
Updating to 4.3.13
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-17T20:54:15Z RetrievedUpdates True
2020-04-17T21:11:52Z Available True Done applying 4.3.10
2020-04-17T21:17:47Z Progressing True Unable to apply 4.3.13: it may not be safe to apply this update
2020-04-17T21:17:47Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-17T21:18:07Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]

Patching force to waive the failures:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/desiredUpdate/force", "value": true}]'
clusterversion.config.openshift.io/version patched

Update moves forward:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-17T20:54:15Z RetrievedUpdates True
2020-04-17T21:11:52Z Available True Done applying 4.3.10
2020-04-17T21:17:47Z Progressing True Unable to apply 4.3.13: the update could not be applied
2020-04-17T21:17:47Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-17T21:20:14Z Failing True Could not update deployment "openshift-cluster-version/cluster-version-operator" (5 of 498)
$ oc get --watch clusterversion version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.10    True        True          3m42s   Working towards 4.3.13: 13% complete
version   4.3.10    True        True          4m42s   Working towards 4.3.13: 16% complete
version   4.3.10    True        True          6m12s   Working towards 4.3.13: 23% complete
...

But history still shows verified true:

$ oc get -o json clusterversion version | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-04-17T21:17:47Z  4.3.13 Partial true
2020-04-17T20:54:15Z 2020-04-17T21:11:52Z 4.3.10 Completed false

We need to set verified=false if any precondition failures were waived.

Comment 1 Lalatendu Mohanty 2020-05-19 11:41:26 UTC
Moving this to 4.6.0 as we have higher priority bugs for 4.5.

Comment 2 Lalatendu Mohanty 2020-05-19 11:41:55 UTC
Also setting sev/prio to low as it is does not impact the functionality and relatively a corner case.

Comment 4 Abhinav Dahiya 2020-07-02 00:43:53 UTC
> We need to set verified=false if any precondition failures were waived.
> https://github.com/openshift/api/blob/de5b010b2b386867fc61c4f718080755f9707748/config/v1/types_cluster_version.go#L180-L184
```
	// verified indicates whether the provided update was properly verified
	// before it was installed. If this is false the cluster may not be trusted.
	// +kubebuilder:validation:Required
	// +required
	Verified bool `json:"verified"`
```

Why would be set verified false if integrity checks i.e. correctly signed passes for a release image even though the preconditions like not-upradeable are true ??

Comment 5 Abhinav Dahiya 2020-07-02 00:45:14 UTC
I think the verified field is in place only for Signature checking and not if other aspect of viability of the upgrade have passed/failed.

Comment 6 W. Trevor King 2020-07-02 01:21:38 UTC
The fact that there's ambiguity around whether non-signature checks are counted sounds like at least "we need to clarify the API godocs" to me.  I would like to have a record of whether these preconditions are waived.  As long as the trusted keys are RH-managed, we can already recover signed-ness in post-mortems from history[].Image digests, because we can look for a matching signature in our store.  But it's harder to reconstruct the state of Upgradeable conditions or other preconditions that may have been waved.  I'd be even happier if the API was a string summarizing overridden conditions (signatures, preconditions, etc.), but... schema growth is so much work :p.  So:

a. I'm in the minority, Abhinav's position that verified is just about signatures is the consensus, and this bug should be an API bug about clarifying the godocs.
b. My initial understanding that all upgrade verification, including preconditions and eventually preflights, feeds verified is the consensus.  This bug should be about clarifying API godocs and adjusting CVO behavior to match.
c. Other options?

Thoughts?

Comment 7 Jack Ottofaro 2020-07-02 15:45:17 UTC
(In reply to W. Trevor King from comment #6)
...
> a. I'm in the minority, Abhinav's position that verified is just about
> signatures is the consensus, and this bug should be an API bug about
> clarifying the godocs.
> b. My initial understanding that all upgrade verification, including
> preconditions and eventually preflights, feeds verified is the consensus. 
> This bug should be about clarifying API godocs and adjusting CVO behavior to
> match.
> c. Other options?
> 
> Thoughts?

What is Abinav's "position" and your "initial understanding" based on? I'm really asking what's the requirement and from where does it come? Asked differently, do we know why the Verified field was originally created?

Comment 8 W. Trevor King 2020-07-02 15:53:28 UTC
Verified landed in the API via [1,2] to track forcing.  At the time, the only check around to be forced was signature verification.  Now we also have preconditions.  In the future, we will have preflights.  The API godocs which establish the semantics do not sufficiently distinguish between the "signatures only" and "all pre-update condition checks" interpretation, and Ahbinav is already here.  I guess we could loop in Clayton too and see if he has an opinion?  They were the only two involved in [2].

[1]: https://github.com/openshift/api/commit/ab4ff93d200a6c3e9d2fcdf0a161a6ada9dfd716
[2]: https://github.com/openshift/api/pull/293

Comment 9 Jack Ottofaro 2020-07-10 12:21:15 UTC
Adding UpcomingSprint keyword. The bug needs additional work which will not be completed by the end of this Sprint.

Comment 10 Jack Ottofaro 2020-07-13 20:39:25 UTC
Based on the provided history and the current api doc's use of the term "verified", I think we should use this bug to clarify the godocs (Trevor's option a). If we feel that use of --force should be extended beyond just signature verification (option b), work that through a RFE Jira to determine implementation, e.g. whether we add another flag or simply expand scope of --force, etc.

Comment 11 W. Trevor King 2020-07-17 22:09:55 UTC
My options in comment 6 were just about the 'verified' semantics.  We can have a whole separate discussion about the --force semantics.  oc seems a bit inconsistent around it today, with [1] saying it can waive verification errors or precondition failures, [2] just talking about signature verification (and including an "If ..." non-sentence), and [3] also just talking about verification.

[1]: https://github.com/openshift/oc/blob/c33851e73e8e2ed66e00da81fa30cee493ac76e2/pkg/cli/admin/upgrade/upgrade.go#L196
[2]: https://github.com/openshift/oc/blob/c33851e73e8e2ed66e00da81fa30cee493ac76e2/pkg/cli/admin/upgrade/upgrade.go#L57-L61
[3]: https://github.com/openshift/oc/blob/c33851e73e8e2ed66e00da81fa30cee493ac76e2/pkg/cli/admin/upgrade/upgrade.go#L83

Comment 12 Jack Ottofaro 2020-07-21 13:47:14 UTC
(In reply to W. Trevor King from comment #11)
> My options in comment 6 were just about the 'verified' semantics.

Exactly and hence my first and main point/opinion that this bug should be an API bug about clarifying the godocs.

Beyond that I'm just saying that further consideration of the semantics of 'verified' and/or the "--force" option seem more appropriate as a Jira.

Comment 13 Lalatendu Mohanty 2020-07-30 14:41:01 UTC
> Exactly and hence my first and main point/opinion that this bug should be an API bug about clarifying the godocs.

+1 Verified only used for image signature verification.


> Beyond that I'm just saying that further consideration of the semantics of 'verified' and/or the "--force" option seem more appropriate as a Jira.

I do not see the benefit mixing Verified with --force.

Comment 14 Jack Ottofaro 2020-07-30 19:57:30 UTC
Adding UpcomingSprint keyword. Bug resolution still being discussed.

Comment 15 W. Trevor King 2020-08-01 19:55:53 UTC
We discussed this a bit in the 2020-07-30 arch call.  Didn't actually spend much time on 'verified', but 'force' got kicked around a bit.  Erica raised concerns about force covering both inability to find a valid signature (which could be "restricted-network, and admins haven't bothered to copy the associate signature into a local ConfigMap", not so bad, or "admin accidentally pointing cluster at a malicious image pullspec", which is very bad) and failing preconditions (e.g. operator A claims Upgradeable=False for a minor bump, but admins are confident they aren't actually impacted).  Abhinav pointed out that the goal was to keep the API and CVO implementation simple, and split intelligent checks and scoped overrides out into clients.  Colin pointed out that expecting client-side validation via 'oc' might lead to surprises if/when folks write their own code to drive the less-specific ClusterVersion API.  I pointed out, sort of a corollary to Colin's point, that if we rely on intelligent client-side verification, we'd need to re-implement it in all clients (e.g. both 'oc' and the web console), or accept that some clients (like the web-console today) will expose a less-rich interface than others (like 'oc' today).

On the 'verified' front, we just summarized the existing discussion from the bug, mostly as given in comment 6.

So, lots of discussion, but still no clear consensus around intended semantics for either 'force' or 'verified', as far as I can tell.

Comment 16 W. Trevor King 2020-08-21 22:27:41 UTC
Still working on a consensus opinion for the intended semantics.

Comment 17 W. Trevor King 2020-09-12 21:01:06 UTC
Comment 16 is still current.

Comment 18 W. Trevor King 2020-10-02 23:17:08 UTC
Comment 16 is still current.

Comment 19 W. Trevor King 2020-10-25 15:50:58 UTC
Comment 16 is still current.

Comment 20 Lalatendu Mohanty 2020-11-16 13:11:33 UTC
Clearing the needinfo as I believe this has been answered already.

Comment 21 W. Trevor King 2021-04-07 20:39:32 UTC
I've opened [1] to catch the API docs up with the current reality, which will hopefully unblock the "still no clear consensus" from comment 15.

[1]: https://github.com/openshift/api/pull/885

Comment 23 W. Trevor King 2021-07-14 20:34:55 UTC
api#885 just landed, including:

  Verified does not cover upgradeable checks that depend on the cluster state at the time when the update target was accepted.

which brings those docs into line with the long-standing CVO implementation.


Note You need to log in before you can comment on or make changes to this bug.