Bug 1822922

Summary: cluster-version operator stops applying manifests when blocked by a precondition check
Product: OpenShift Container Platform Reporter: Scott Dodson <sdodson>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED DEFERRED QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: adahiya, aos-bugs, jiajliu, jokerman, wking
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1822752
: 1822923 (view as bug list) Environment:
Last Closed: 2020-07-15 00:34:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1822752, 2064991    
Bug Blocks: 1822923    

Description Scott Dodson 2020-04-10 13:58:12 UTC
+++ This bug was initially created as a clone of Bug #1822752 +++

For example, blocking on 4.3.10 -> 4.3.11 via the mechanism in bug 1821905:

$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
...no output...
$ oc patch scc privileged --type json -p '[{"op": "add", "path": "/users/-", "value": "kubeadmin"}]'
$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
2020-04-09T18:12:01Z kube-apiserver DefaultSecurityContextConstraints_Mutated
$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.3"}]'
$ oc adm upgrade --to 4.3.11
Updating to 4.3.11
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-09T18:00:34Z Available True Done applying 4.3.10
2020-04-09T18:13:10Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-09T18:27:21Z RetrievedUpdates True
2020-04-09T18:27:49Z Progressing True Unable to apply 4.3.11: it may not be safe to apply this update
2020-04-09T18:28:19Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
$ oc get -o json clusterversion version | jq -r '.status.availableUpdates[].version'
4.3.11
$ oc adm upgrade --to 4.3.10
error: The update 4.3.10 is not one of the available updates: 4.3.11
$ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator")).metadata.name'
cluster-version-operator-7bbc4c5dcc-w287k
$ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name == "cluster-version-operator-7bbc4c5dcc-w287k").spec.containers[].image'
registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
$ oc image info -o json registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd | jq -r '.config.config.Labels["io.openshift.release"]'
4.3.10
$ oc -n openshift-cluster-version logs cluster-version-operator-7bbc4c5dcc-w287k
I0409 17:38:45.549400       1 start.go:19] ClusterVersionOperator v4.3.10-202003311428-dirty
...
I0409 18:26:05.631324       1 sync_worker.go:634] Done syncing for clusteroperator "service-ca" (388 of 498)
I0409 18:26:05.631365       1 task_graph.go:524] Graph is complete
I0409 18:26:05.631390       1 task_graph.go:587] No more work for 1
I0409 18:26:05.631402       1 task_graph.go:587] No more work for 0
I0409 18:26:05.631414       1 task_graph.go:603] Workers finished
I0409 18:26:05.631425       1 task_graph.go:611] Result of work: []
...
I0409 18:27:47.543958       1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 0
...
I0409 18:40:26.944180       1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 4
I0409 18:40:26.944219       1 sync_worker.go:477] Loading payload
I0409 18:40:27.247572       1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/0vmj3337PDbtCKhSPEWayQ"
E0409 18:40:27.549821       1 precondition.go:49] Precondition "ClusterVersionUpgradeable" failed: Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]

So that is the 4.3.10 CVO saying:

> They've asked me to update to 4.3.11, let me stop applying manifests and take a look at the preconditions.  Oh no!  A precondition check is failing!  I will complain about it until it gets fixed, but in the meantime I will do nothing about manifest reconciliation and hope that nobody in the cluster is stomping on manifests which I'm supposed to be monitoring.

I think that's a bug, and that we want the CVO pod to continue to reconcile manifests while it works through update preconditions to vet the proposed target.

It should also be possible for admins to say "ah, precondition failed, please forget I asked and return to the source version".  As the "error: The update 4.3.10 is not one of the available updates: 4.3.11" shows, that is not currently possible either (without using --force or other risky stuff).

--- Additional comment from W. Trevor King on 2020-04-09 14:54:14 EDT ---

Also, without dipping into .status.history, there is nothing in ClusterVersion to show that we're actually still running 4.3.10:

$ oc get clusterversion -o yaml version
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-04-09T17:35:43Z"
  generation: 3
  name: version
  resourceVersion: "27378"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da
spec:
  channel: candidate-4.3
  clusterID: c42dd7e1-...
  desiredUpdate:
    force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
  availableUpdates:
  - force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  conditions:
  - lastTransitionTime: "2020-04-09T18:00:34Z"
    message: Done applying 4.3.10
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-09T18:28:19Z"
    message: 'Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated":
      Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: UpgradePreconditionCheckFailed
    status: "True"
    type: Failing
  - lastTransitionTime: "2020-04-09T18:27:49Z"
    message: 'Unable to apply 4.3.11: it may not be safe to apply this update'
    reason: UpgradePreconditionCheckFailed
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-04-09T18:27:21Z"
    status: "True"
    type: RetrievedUpdates
  - lastTransitionTime: "2020-04-09T18:13:10Z"
    message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: DefaultSecurityContextConstraints_Mutated
    status: "False"
    type: Upgradeable
  desired:
    force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  history:
  - completionTime: null
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    startedTime: "2020-04-09T18:27:49Z"
    state: Partial
    verified: true
    version: 4.3.11
  - completionTime: "2020-04-09T18:00:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T17:35:48Z"
    state: Completed
    verified: false
    version: 4.3.10
  observedGeneration: 3
  versionHash: vSLGMQhseGg=

--- Additional comment from W. Trevor King on 2020-04-09 14:57:06 EDT ---

Ok, so there is an unforced-ish way out:

$ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade
error: Already upgrading, pass --allow-upgrade-with-warnings to override.

  Reason: UpgradePreconditionCheckFailed
  Message: Unable to apply 4.3.11: it may not be safe to apply this update

$ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade --allow-upgrade-with-warnings
Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd

$ oc get clusterversion -o yaml version
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-04-09T17:35:43Z"
  generation: 4
  name: version
  resourceVersion: "34618"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da
spec:
  channel: candidate-4.3
  clusterID: c42dd7e1-...
  desiredUpdate:
    force: false
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    version: ""
  upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
  availableUpdates:
  - force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  conditions:
  - lastTransitionTime: "2020-04-09T18:00:34Z"
    message: Done applying 4.3.10
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-09T18:55:16Z"
    status: "False"
    type: Failing
  - lastTransitionTime: "2020-04-09T18:55:34Z"
    message: Cluster version is 4.3.10
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-04-09T18:27:21Z"
    status: "True"
    type: RetrievedUpdates
  - lastTransitionTime: "2020-04-09T18:13:10Z"
    message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: DefaultSecurityContextConstraints_Mutated
    status: "False"
    type: Upgradeable
  desired:
    force: false
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    version: 4.3.10
  history:
  - completionTime: "2020-04-09T18:55:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T18:55:16Z"
    state: Completed
    verified: false
    version: 4.3.10
  - completionTime: "2020-04-09T18:55:16Z"
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    startedTime: "2020-04-09T18:27:49Z"
    state: Partial
    verified: true
    version: 4.3.11
  - completionTime: "2020-04-09T18:00:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T17:35:48Z"
    state: Completed
    verified: false
    version: 4.3.10
  observedGeneration: 4
  versionHash: vSLGMQhseGg=

--- Additional comment from Scott Dodson on 2020-04-10 09:57:33 EDT ---

This should be backported to at least 4.3 when we fix this.

Comment 1 W. Trevor King 2020-05-15 05:19:03 UTC
No point in claiming this sprint until we have a fix in master, so punting to UpcomingSprint again.

Comment 2 W. Trevor King 2020-06-21 14:15:51 UTC
Blocking bugs will need to be addressed first; adding UpcomingSprint

Comment 3 W. Trevor King 2020-07-06 22:15:37 UTC
Comment 2 is still current, restoring UpcomingSprint.

Comment 4 Scott Dodson 2020-07-15 00:34:06 UTC
This clone has been opened for 3 months now with no master branch fix merged, this can be re-opened once we master branch fix has been commited, though we'd have to create a 4.5 in between. CLOSED DEFERRED