Bug 1822922 - cluster-version operator stops applying manifests when blocked by a precondition check
Summary: cluster-version operator stops applying manifests when blocked by a precondit...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.z
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On: 1822752 2064991
Blocks: 1822923
TreeView+ depends on / blocked
 
Reported: 2020-04-10 13:58 UTC by Scott Dodson
Modified: 2022-03-17 04:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1822752
: 1822923 (view as bug list)
Environment:
Last Closed: 2020-07-15 00:34:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Scott Dodson 2020-04-10 13:58:12 UTC
+++ This bug was initially created as a clone of Bug #1822752 +++

For example, blocking on 4.3.10 -> 4.3.11 via the mechanism in bug 1821905:

$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
...no output...
$ oc patch scc privileged --type json -p '[{"op": "add", "path": "/users/-", "value": "kubeadmin"}]'
$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
2020-04-09T18:12:01Z kube-apiserver DefaultSecurityContextConstraints_Mutated
$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.3"}]'
$ oc adm upgrade --to 4.3.11
Updating to 4.3.11
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-09T18:00:34Z Available True Done applying 4.3.10
2020-04-09T18:13:10Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-09T18:27:21Z RetrievedUpdates True
2020-04-09T18:27:49Z Progressing True Unable to apply 4.3.11: it may not be safe to apply this update
2020-04-09T18:28:19Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
$ oc get -o json clusterversion version | jq -r '.status.availableUpdates[].version'
4.3.11
$ oc adm upgrade --to 4.3.10
error: The update 4.3.10 is not one of the available updates: 4.3.11
$ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator")).metadata.name'
cluster-version-operator-7bbc4c5dcc-w287k
$ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name == "cluster-version-operator-7bbc4c5dcc-w287k").spec.containers[].image'
registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
$ oc image info -o json registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd | jq -r '.config.config.Labels["io.openshift.release"]'
4.3.10
$ oc -n openshift-cluster-version logs cluster-version-operator-7bbc4c5dcc-w287k
I0409 17:38:45.549400       1 start.go:19] ClusterVersionOperator v4.3.10-202003311428-dirty
...
I0409 18:26:05.631324       1 sync_worker.go:634] Done syncing for clusteroperator "service-ca" (388 of 498)
I0409 18:26:05.631365       1 task_graph.go:524] Graph is complete
I0409 18:26:05.631390       1 task_graph.go:587] No more work for 1
I0409 18:26:05.631402       1 task_graph.go:587] No more work for 0
I0409 18:26:05.631414       1 task_graph.go:603] Workers finished
I0409 18:26:05.631425       1 task_graph.go:611] Result of work: []
...
I0409 18:27:47.543958       1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 0
...
I0409 18:40:26.944180       1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 4
I0409 18:40:26.944219       1 sync_worker.go:477] Loading payload
I0409 18:40:27.247572       1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/0vmj3337PDbtCKhSPEWayQ"
E0409 18:40:27.549821       1 precondition.go:49] Precondition "ClusterVersionUpgradeable" failed: Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]

So that is the 4.3.10 CVO saying:

> They've asked me to update to 4.3.11, let me stop applying manifests and take a look at the preconditions.  Oh no!  A precondition check is failing!  I will complain about it until it gets fixed, but in the meantime I will do nothing about manifest reconciliation and hope that nobody in the cluster is stomping on manifests which I'm supposed to be monitoring.

I think that's a bug, and that we want the CVO pod to continue to reconcile manifests while it works through update preconditions to vet the proposed target.

It should also be possible for admins to say "ah, precondition failed, please forget I asked and return to the source version".  As the "error: The update 4.3.10 is not one of the available updates: 4.3.11" shows, that is not currently possible either (without using --force or other risky stuff).

--- Additional comment from W. Trevor King on 2020-04-09 14:54:14 EDT ---

Also, without dipping into .status.history, there is nothing in ClusterVersion to show that we're actually still running 4.3.10:

$ oc get clusterversion -o yaml version
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-04-09T17:35:43Z"
  generation: 3
  name: version
  resourceVersion: "27378"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da
spec:
  channel: candidate-4.3
  clusterID: c42dd7e1-...
  desiredUpdate:
    force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
  availableUpdates:
  - force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  conditions:
  - lastTransitionTime: "2020-04-09T18:00:34Z"
    message: Done applying 4.3.10
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-09T18:28:19Z"
    message: 'Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated":
      Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: UpgradePreconditionCheckFailed
    status: "True"
    type: Failing
  - lastTransitionTime: "2020-04-09T18:27:49Z"
    message: 'Unable to apply 4.3.11: it may not be safe to apply this update'
    reason: UpgradePreconditionCheckFailed
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-04-09T18:27:21Z"
    status: "True"
    type: RetrievedUpdates
  - lastTransitionTime: "2020-04-09T18:13:10Z"
    message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: DefaultSecurityContextConstraints_Mutated
    status: "False"
    type: Upgradeable
  desired:
    force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  history:
  - completionTime: null
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    startedTime: "2020-04-09T18:27:49Z"
    state: Partial
    verified: true
    version: 4.3.11
  - completionTime: "2020-04-09T18:00:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T17:35:48Z"
    state: Completed
    verified: false
    version: 4.3.10
  observedGeneration: 3
  versionHash: vSLGMQhseGg=

--- Additional comment from W. Trevor King on 2020-04-09 14:57:06 EDT ---

Ok, so there is an unforced-ish way out:

$ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade
error: Already upgrading, pass --allow-upgrade-with-warnings to override.

  Reason: UpgradePreconditionCheckFailed
  Message: Unable to apply 4.3.11: it may not be safe to apply this update

$ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade --allow-upgrade-with-warnings
Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd

$ oc get clusterversion -o yaml version
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-04-09T17:35:43Z"
  generation: 4
  name: version
  resourceVersion: "34618"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da
spec:
  channel: candidate-4.3
  clusterID: c42dd7e1-...
  desiredUpdate:
    force: false
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    version: ""
  upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
  availableUpdates:
  - force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  conditions:
  - lastTransitionTime: "2020-04-09T18:00:34Z"
    message: Done applying 4.3.10
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-09T18:55:16Z"
    status: "False"
    type: Failing
  - lastTransitionTime: "2020-04-09T18:55:34Z"
    message: Cluster version is 4.3.10
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-04-09T18:27:21Z"
    status: "True"
    type: RetrievedUpdates
  - lastTransitionTime: "2020-04-09T18:13:10Z"
    message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: DefaultSecurityContextConstraints_Mutated
    status: "False"
    type: Upgradeable
  desired:
    force: false
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    version: 4.3.10
  history:
  - completionTime: "2020-04-09T18:55:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T18:55:16Z"
    state: Completed
    verified: false
    version: 4.3.10
  - completionTime: "2020-04-09T18:55:16Z"
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    startedTime: "2020-04-09T18:27:49Z"
    state: Partial
    verified: true
    version: 4.3.11
  - completionTime: "2020-04-09T18:00:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T17:35:48Z"
    state: Completed
    verified: false
    version: 4.3.10
  observedGeneration: 4
  versionHash: vSLGMQhseGg=

--- Additional comment from Scott Dodson on 2020-04-10 09:57:33 EDT ---

This should be backported to at least 4.3 when we fix this.

Comment 1 W. Trevor King 2020-05-15 05:19:03 UTC
No point in claiming this sprint until we have a fix in master, so punting to UpcomingSprint again.

Comment 2 W. Trevor King 2020-06-21 14:15:51 UTC
Blocking bugs will need to be addressed first; adding UpcomingSprint

Comment 3 W. Trevor King 2020-07-06 22:15:37 UTC
Comment 2 is still current, restoring UpcomingSprint.

Comment 4 Scott Dodson 2020-07-15 00:34:06 UTC
This clone has been opened for 3 months now with no master branch fix merged, this can be re-opened once we master branch fix has been commited, though we'd have to create a 4.5 in between. CLOSED DEFERRED


Note You need to log in before you can comment on or make changes to this bug.