Bug 1951339

Summary:	Cluster-version operator does not manage operand container environments when manifest lacks opinions
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cluster Version Operator	Assignee:	W. Trevor King <wking>
Status:	CLOSED ERRATA	QA Contact:	Yang Yang <yanyang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1.z	CC:	aos-bugs, jokerman, mbargenq, yanyang
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The cluster-version operator did not reconcile env and envFrom for manifests which did not set those properties. Consequence: Cluster administrators were able to manipulate those properties, potentially adding broken values, without the cluster-version operator stomping their changes. Fix: The cluster-version operator now clears env and envFrom if they are unset in the manifest. Result: The cluster will automatically recover from invalid cluster-admin injections to these properties.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:01:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-04-20 00:00:16 UTC

Seen in the wild, a cluster admin stuck an update by inserting:

  $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].spec.template.spec.containers[].envFrom[]'
  {
    "secretRef": {
      "name": "openshift-reg"
    }
  }

This broke the deploy:

  $ yaml2json <namespaces/openshift-marketplace/pods/marketplace-operator-f7cc88d59-hhh75/marketplace-operator-f7cc88d59-hhh75.yaml | jq -r '.status.containerStatuses[].state'
  {
    "waiting": {
      "message": "secret \"openshift-reg\" not found",
      "reason": "CreateContainerConfigError"
    }
  }

We should reconcile the environment settings even if the manifest has no opinions on their content.

Also, the CVO should have complained about the Deployment failing to reconcile, but it does not:

  $ grep 'deployment.*marketplace' namespaces/openshift-cluster-version/pods/cluster-version-operator-5d8d8bf546-qgbsd/cluster-version-operator/cluster-version-operator/logs/current.log | grep -v Throttling
  2021-04-19T17:23:26.881236307Z I0419 17:23:26.881220       1 sync_worker.go:769] Running sync for deployment "openshift-marketplace/marketplace-operator" (512 of 668)
  2021-04-19T17:23:29.926595332Z I0419 17:23:29.926544       1 sync_worker.go:781] Done syncing for deployment "openshift-marketplace/marketplace-operator" (512 of 668)
  ...
  2021-04-19T20:37:18.117408461Z I0419 20:37:18.117372       1 sync_worker.go:769] Running sync for deployment "openshift-marketplace/marketplace-operator" (512 of 668)
  2021-04-19T20:37:21.312143536Z I0419 20:37:21.312085       1 sync_worker.go:781] Done syncing for deployment "openshift-marketplace/marketplace-operator" (512 of 668)

Checking the Deployment status:

  $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].status.conditions'
  [
    {
      "lastTransitionTime": "2021-04-12T22:04:41Z",
      "lastUpdateTime": "2021-04-12T22:04:41Z",
      "message": "Deployment has minimum availability.",
      "reason": "MinimumReplicasAvailable",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2021-04-19T17:33:30Z",
      "lastUpdateTime": "2021-04-19T17:33:30Z",
      "message": "ReplicaSet \"marketplace-operator-f7cc88d59\" has timed out progressing.",
      "reason": "ProgressDeadlineExceeded",
      "status": "False",
      "type": "Progressing"
    }
  ]

We should be complaining about that ProgressDeadlineExceeded directly as well.

Comment 2 Yang Yang 2021-04-23 07:01:57 UTC

Reproducing it with 4.8.0-fc.0.

Steps to reproduce:
1. Install a cluster with 4.8.0-fc.0
2. Update the marketplace deployment by inject env variable for containers:
            envFrom:
            - secretRef:
                name: openshift-reg
3. Check the results:

marketplace-operator is recreated with config error

# oc get all
NAME                                        READY   STATUS                       RESTARTS   AGE
pod/certified-operators-lddkh               1/1     Running                      0          155m
pod/community-operators-82cdb               1/1     Running                      0          16m
pod/marketplace-operator-587b7bf94d-2sg5n   1/1     Running                      0          153m
pod/marketplace-operator-7686ffb476-84x6s   0/1     CreateContainerConfigError   0          47s
pod/qe-app-registry-24l67                   1/1     Running                      0          156m
pod/redhat-marketplace-xx8jl                1/1     Running                      0          156m
pod/redhat-operators-rccr2                  1/1     Running                      0          156m

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/certified-operators            ClusterIP   172.30.236.173   <none>        50051/TCP           179m
service/community-operators            ClusterIP   172.30.65.242    <none>        50051/TCP           179m
service/marketplace-operator-metrics   ClusterIP   172.30.177.155   <none>        8383/TCP,8081/TCP   3h8m
service/qe-app-registry                ClusterIP   172.30.163.173   <none>        50051/TCP           160m
service/redhat-marketplace             ClusterIP   172.30.32.12     <none>        50051/TCP           179m
service/redhat-operators               ClusterIP   172.30.131.49    <none>        50051/TCP           179m

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/marketplace-operator   1/1     1            1           3h8m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/marketplace-operator-587b7bf94d   1         1         1       3h8m
replicaset.apps/marketplace-operator-7686ffb476   1         1         0       47s

The deployment is broke by:

# oc describe pod/marketplace-operator-7686ffb476-84x6s
  Warning  Failed          <invalid> (x10 over 75s)  kubelet            Error: secret "openshift-reg" not found

The deployment is not terminated directly.

# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
  {
    "lastTransitionTime": "2021-04-23T04:07:30Z",
    "lastUpdateTime": "2021-04-23T04:07:30Z",
    "message": "Deployment has minimum availability.",
    "reason": "MinimumReplicasAvailable",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2021-04-23T03:31:14Z",
    "lastUpdateTime": "2021-04-23T06:38:54Z",
    "message": "ReplicaSet \"marketplace-operator-7686ffb476\" is progressing.",
    "reason": "ReplicaSetUpdated",
    "status": "True",
    "type": "Progressing"
  }
]

CVO does not log the errors.
# grep 'deployment.*marketplace' cvo.log | grep -v Throttling
I0423 06:46:33.540152       1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
I0423 06:46:33.954966       1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
I0423 06:50:05.461871       1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
I0423 06:50:05.851594       1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678)

Comment 3 Yang Yang 2021-04-23 10:44:52 UTC

> # oc get deployment -ojson | jq -r '.items[].status.conditions'
> [
>   {
>     "lastTransitionTime": "2021-04-23T04:07:30Z",
>     "lastUpdateTime": "2021-04-23T04:07:30Z",
>     "message": "Deployment has minimum availability.",
>     "reason": "MinimumReplicasAvailable",
>     "status": "True",
>     "type": "Available"
>   },
>   {
>     "lastTransitionTime": "2021-04-23T03:31:14Z",
>     "lastUpdateTime": "2021-04-23T06:38:54Z",
>     "message": "ReplicaSet \"marketplace-operator-7686ffb476\" is
> progressing.",
>     "reason": "ReplicaSetUpdated",
>     "status": "True",
>     "type": "Progressing"
>   }
> ]
> 

The deployment is timeout eventually.

# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
  {
    "lastTransitionTime": "2021-04-23T09:35:15Z",
    "lastUpdateTime": "2021-04-23T09:35:15Z",
    "message": "Deployment has minimum availability.",
    "reason": "MinimumReplicasAvailable",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2021-04-23T10:33:25Z",
    "lastUpdateTime": "2021-04-23T10:33:25Z",
    "message": "ReplicaSet \"marketplace-operator-7686ffb476\" has timed out progressing.",
    "reason": "ProgressDeadlineExceeded",
    "status": "False",
    "type": "Progressing"
  }
]

Comment 4 Yang Yang 2021-04-23 10:54:42 UTC

Verifying with 4.8.0-0.nightly-2021-04-22-225832.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-22-225832   True        False         13m     Cluster version is 4.8.0-0.nightly-2021-04-22-225832

Update the marketplace deployment by injecting the env variables：

            envFrom:
            - secretRef:
                name: openshift-reg

The deployment has an error.
# oc get all
NAME                                        READY   STATUS                       RESTARTS   AGE
pod/certified-operators-59phh               1/1     Running                      0          15m
pod/community-operators-6pfwq               1/1     Running                      0          16m
pod/marketplace-operator-548df64d-5nctr     1/1     Running                      0          16m
pod/marketplace-operator-649ccff7f4-lfvnm   0/1     CreateContainerConfigError   0          7s
pod/qe-app-registry-vmrr7                   1/1     Running                      0          2m13s
pod/redhat-marketplace-hxhfn                1/1     Running                      0          15m
pod/redhat-operators-882cr                  1/1     Running                      0          15m

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/certified-operators            ClusterIP   172.30.30.200    <none>        50051/TCP           39m
service/community-operators            ClusterIP   172.30.102.1     <none>        50051/TCP           39m
service/marketplace-operator-metrics   ClusterIP   172.30.208.23    <none>        8383/TCP,8081/TCP   47m
service/qe-app-registry                ClusterIP   172.30.219.78    <none>        50051/TCP           19m
service/redhat-marketplace             ClusterIP   172.30.126.70    <none>        50051/TCP           39m
service/redhat-operators               ClusterIP   172.30.113.251   <none>        50051/TCP           39m

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/marketplace-operator   1/1     1            1           47m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/marketplace-operator-548df64d     1         1         1       47m
replicaset.apps/marketplace-operator-649ccff7f4   1         1         0       8s

# oc get pod/marketplace-operator-649ccff7f4-lfvnm -ojson | jq -r '.status.containerStatuses[].state'
{
  "waiting": {
    "message": "secret \"openshift-reg\" not found",
    "reason": "CreateContainerConfigError"
  }
}

# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
  {
    "lastTransitionTime": "2021-04-23T08:27:09Z",
    "lastUpdateTime": "2021-04-23T08:27:09Z",
    "message": "Deployment has minimum availability.",
    "reason": "MinimumReplicasAvailable",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2021-04-23T07:55:53Z",
    "lastUpdateTime": "2021-04-23T08:42:59Z",
    "message": "ReplicaSet \"marketplace-operator-649ccff7f4\" is progressing.",
    "reason": "ReplicaSetUpdated",
    "status": "True",
    "type": "Progressing"
  }
]

After a while, the operator is reconciled.

# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
  {
    "lastTransitionTime": "2021-04-23T08:27:09Z",
    "lastUpdateTime": "2021-04-23T08:27:09Z",
    "message": "Deployment has minimum availability.",
    "reason": "MinimumReplicasAvailable",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2021-04-23T07:55:53Z",
    "lastUpdateTime": "2021-04-23T08:46:14Z",
    "message": "ReplicaSet \"marketplace-operator-548df64d\" has successfully progressed.",
    "reason": "NewReplicaSetAvailable",
    "status": "True",
    "type": "Progressing"
  }
]

# oc get all
NAME                                      READY   STATUS    RESTARTS   AGE
pod/certified-operators-59phh             1/1     Running   0          21m
pod/community-operators-6pfwq             1/1     Running   0          22m
pod/marketplace-operator-548df64d-5nctr   1/1     Running   0          22m
pod/qe-app-registry-vmrr7                 1/1     Running   0          8m18s
pod/redhat-marketplace-hxhfn              1/1     Running   0          21m
pod/redhat-operators-882cr                1/1     Running   0          21m

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/certified-operators            ClusterIP   172.30.30.200    <none>        50051/TCP           45m
service/community-operators            ClusterIP   172.30.102.1     <none>        50051/TCP           45m
service/marketplace-operator-metrics   ClusterIP   172.30.208.23    <none>        8383/TCP,8081/TCP   53m
service/qe-app-registry                ClusterIP   172.30.219.78    <none>        50051/TCP           25m
service/redhat-marketplace             ClusterIP   172.30.126.70    <none>        50051/TCP           45m
service/redhat-operators               ClusterIP   172.30.113.251   <none>        50051/TCP           45m

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/marketplace-operator   1/1     1            1           53m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/marketplace-operator-548df64d     1         1         1       53m
replicaset.apps/marketplace-operator-649ccff7f4   0         0         0       6m13s

Comment 5 Yang Yang 2021-04-23 11:32:24 UTC

To verify commit#8132d6a, I changed the image digest to an invalid one and changed the progressDeadlineSeconds to 5. But I'm unlucky and I doesn't get ProgressDeadlineExceeded.

@Trevor, could you please hint me how can I get ProgressDeadlineExceeded scenario? Thanks.

Comment 6 W. Trevor King 2021-04-23 23:17:19 UTC

Possibly cordoning all the nodes and then killing something with replicas > 2 that is managed by the CVO?  Looking for candidates:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1385644983909879808/artifacts/e2e-aws/gather-extra/artifacts/deployments.json | jq -r '.items[] | select(.spec.replicas >= 2).metadata | .namespace + " " + .name'
  ...
  openshift-console downloads
  ...

So something like:

  $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
  $ for NODE in ${NODES}; do oc cordon "${NODE}"; done
  $ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)"
  $ oc -n openshift-console delete pod "${POD}"

and then wait 10 minutes [1] for the default progressDeadlineSeconds to expire.

[1]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds

Comment 7 Yang Yang 2021-04-25 10:24:19 UTC

Verifying commit#8132d6a.

Cordon all of the nodes:

    $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
    $ for NODE in ${NODES}; do oc cordon "${NODE}"; done

Delete a deployment:
    
    $ oc delete deployment.apps/downloadselete -n openshift-console

Wait 600s for ProgressDeadlineExceeded:

$ oc get deployment.apps/downloads -ojson | jq -r '.status.conditions'
[
  {
    "lastTransitionTime": "2021-04-25T09:49:57Z",
    "lastUpdateTime": "2021-04-25T09:49:57Z",
    "message": "Deployment does not have minimum availability.",
    "reason": "MinimumReplicasUnavailable",
    "status": "False",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2021-04-25T09:59:58Z",
    "lastUpdateTime": "2021-04-25T09:59:58Z",
    "message": "ReplicaSet \"downloads-7fdddb48b4\" has timed out progressing.",
    "reason": "ProgressDeadlineExceeded",
    "status": "False",
    "type": "Progressing"
  }
]

Describe download pod:

$ oc describe pod/downloads-7fdddb48b4-wnljj
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  12m   default-scheduler  0/6 nodes are available: 6 node(s) were unschedulable.
  Warning  FailedScheduling  12m   default-scheduler  0/6 nodes are available: 6 node(s) were unschedulable.

ProgressDeadlineExceeded error is not logged in pod event and CVO log file.

Comment 8 W. Trevor King 2021-04-25 20:44:44 UTC

> Delete a deployment...

If you delete the Deployment, it will remove both pods.  The CVO will create a replacement Deployment, but with 0 of 2 pods ready, that Deployment will be, as you saw, Available=False as well as ProgressDeadlineExceeded.  Comment 6 called for leaving the Deployment alone, but deleting a single pod.  I expect that will leave the Deployment Available=True while still getting us ProgressDeadlineExceeded.  But in either case, Available=False and ProgressDeadlineExceeded should both be sufficient to get the CVO to mention the issue in ClusterVersion conditions.  If it doesn't, can you include the ClusterVersion YAML?

Comment 9 Yang Yang 2021-04-26 02:14:36 UTC

Comment #6 does not get ProgressDeadlineExceeded so I delete the deployment.

# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2021-04-23T07:55:27Z"
    generation: 1
    name: version
    resourceVersion: "1694483"
    uid: 98e037f9-ed02-491d-894b-1f5965638c00
  spec:
    channel: stable-4.7
    clusterID: 93f31b0e-ef15-4b2f-b441-fe5c6c786231
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-04-23T08:22:33Z"
      message: Done applying 4.8.0-0.nightly-2021-04-22-225832
      status: "True"
      type: Available
    - lastTransitionTime: "2021-04-26T01:33:35Z"
      message: 'Some cluster operators are still updating: machine-config, monitoring'
      reason: ClusterOperatorsNotAvailable
      status: "True"
      type: Failing
    - lastTransitionTime: "2021-04-23T08:22:33Z"
      message: 'Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: some cluster
        operators have not yet rolled out'
      reason: ClusterOperatorsNotAvailable
      status: "False"
      type: Progressing
    - lastTransitionTime: "2021-04-23T07:55:27Z"
      message: 'Unable to retrieve available updates: currently reconciling cluster
        version 4.8.0-0.nightly-2021-04-22-225832 not found in the "stable-4.7" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2021-04-25T09:50:05Z"
      message: 'Cluster operator machine-config cannot be upgraded between minor versions:
        One or more machine config pools are updating, please see `oc get mcp` for
        further details'
      reason: PoolUpdating
      status: "False"
      type: Upgradeable
    desired:
      image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9
      version: 4.8.0-0.nightly-2021-04-22-225832
    history:
    - completionTime: "2021-04-23T08:22:33Z"
      image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9
      startedTime: "2021-04-23T07:55:27Z"
      state: Completed
      verified: false
      version: 4.8.0-0.nightly-2021-04-22-225832
    observedGeneration: 1
    versionHash: 14fG0Q-Un8E=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

# oc -n openshift-console  get all
NAME                             READY   STATUS    RESTARTS   AGE
pod/console-b5f64d9c7-lncw9      1/1     Running   0          18h
pod/console-b5f64d9c7-mk89l      1/1     Running   0          18h
pod/downloads-7fdddb48b4-wnljj   0/1     Pending   0          16h
pod/downloads-7fdddb48b4-zl2g2   0/1     Pending   0          16h

NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/console     ClusterIP   172.30.87.18     <none>        443/TCP   2d18h
service/downloads   ClusterIP   172.30.119.245   <none>        80/TCP    2d18h

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/console     2/2     2            2           2d18h
deployment.apps/downloads   0/2     2            0           16h

NAME                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/console-585b4f86d6     0         0         0       2d17h
replicaset.apps/console-5dcbfdc595     0         0         0       18h
replicaset.apps/console-6d466d4bcc     0         0         0       2d18h
replicaset.apps/console-b5f64d9c7      2         2         2       18h
replicaset.apps/downloads-7fdddb48b4   2         2         0       16h

NAME                                 HOST/PORT                                                                         PATH   SERVICES    PORT    TERMINATION          WILDCARD
route.route.openshift.io/console     console-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com            console     https   reencrypt/Redirect   None
route.route.openshift.io/downloads   downloads-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com          downloads   http    edge/Redirect        None

Comment 10 W. Trevor King 2021-04-26 05:16:05 UTC

Hmm.  Poking around in a cluster-bot cluster:

  $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
  4.8.0-0.nightly-2021-04-25-195440
  $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
  $ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done
  $ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)"
  $ oc -n openshift-console delete pod "${POD}"
  $ oc -n openshift-console get -o json deployment downloads | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
  2021-04-26T03:38:50Z Progressing=True NewReplicaSetAvailable: ReplicaSet "downloads-8dcc96bf7" has successfully progressed.
  2021-04-26T03:59:55Z Available=False MinimumReplicasUnavailable: Deployment does not have minimum availability.

Hrm, I didn't even have to wait 10m.  ClusterVersion doesn't say anything about this issue:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
  2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440
  2021-04-26T03:56:13Z Failing=False : 
  2021-04-26T03:56:13Z Progressing=False : Cluster version is 4.8.0-0.nightly-2021-04-25-195440
  2021-04-26T04:00:48Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

I wonder what the deal is with machine-config?

  $ oc get mcp
  NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
  master   rendered-master-555bdd56b2ffb5c9b65c53df4805605b   False     True       False      3              0                   3                     0                      36m
  worker   rendered-worker-5d9d6269bbbb1561d7ae1b34c447f4c9   False     True       False      3              0                   3                     0                      36m

3 of 3 in each pool are updated, but Updated=False and Updating=True because they are cordoned, and therefore not ready.  That doesn't feel right.  But anyway, back to the CVO:

  $ POD="$(oc -n openshift-cluster-version get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods)"
  $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|downloads\|Result of work' | grep -v 'Throttling request'
  ...
  I0426 04:08:49.958758       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
  I0426 04:08:56.715714       1 sync_worker.go:767] Running sync for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677)
  I0426 04:08:56.816355       1 sync_worker.go:779] Done syncing for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677)
  I0426 04:08:57.115298       1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677)
  I0426 04:09:15.087322       1 sync_worker.go:767] Running sync for route "openshift-console/downloads" (448 of 677)
  I0426 04:09:15.122404       1 sync_worker.go:779] Done syncing for route "openshift-console/downloads" (448 of 677)
  I0426 04:09:15.122460       1 sync_worker.go:767] Running sync for service "openshift-console/downloads" (449 of 677)
  I0426 04:09:15.321107       1 sync_worker.go:779] Done syncing for service "openshift-console/downloads" (449 of 677)
  I0426 04:09:18.161172       1 task_graph.go:555] Result of work: []
  ...

Hrm, no downloads deployment.  Ah, because the console folks moved it out from under the CVO to control it directly in [1].  So sorry for the stale choice of deployment.  Uncordon so the downloads deployment can recover:

  $ for NODE in ${NODES}; do oc adm uncordon "${NODE}"; done

And go looking for another deployment:

  $ oc adm release extract --to manifest
  $ grep -ho 'replicas: .*' $(grep -rl 'kind: Deployment' manifest) | sort | uniq -c
       40 replicas: 1

So how to trigger Available=True with ProgressDeadlineExceeded...  Maybe if I cordon everyone during an update, we'll be unable to schedule...  But I don't want the old pod to go away first, because then we'd be Available=False.  I see the default maxSurge is 25% [2].  And maxUnavailable rounds down [3].  So maybe our RollingUpdate deployments are surging.

  $ grep RollingUpdate manifest/*deployment*
  manifest/0000_50_olm_07-olm-operator.deployment.ibm-cloud-managed.yaml:    type: RollingUpdate
  manifest/0000_50_olm_07-olm-operator.deployment.yaml:    type: RollingUpdate
  manifest/0000_50_olm_08-catalog-operator.deployment.ibm-cloud-managed.yaml:    type: RollingUpdate
  manifest/0000_50_olm_08-catalog-operator.deployment.yaml:    type: RollingUpdate
  $ grep name manifest/0000_50_olm_07-olm-operator.deployment.yaml | head -n2
    name: olm-operator
    namespace: openshift-operator-lifecycle-manager

Ok, so I'm going to cordon all the nodes again, and then touch OLM to trigger a rolling update, which will hopefully surge a new pod, and then complain about that pod being stuck:

  $ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done
  $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq .spec.template.metadata.annotations
  {
    "workload.openshift.io/management": "{\"effect\": \"PreferredDuringScheduling\"}"
  }
  $ oc -n openshift-operator-lifecycle-manager patch deployment olm-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]'
  $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
  2021-04-26T03:23:18Z Progressing=True ReplicaSetUpdated: ReplicaSet "olm-operator-6c99f456cf" is progressing.
  2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability.
  $ oc -n openshift-operator-lifecycle-manager get pods
  NAME                                READY   STATUS    RESTARTS   AGE
  ...
  olm-operator-589957f7bd-z4tlv       1/1     Running   0          88m
  olm-operator-6c99f456cf-fr8t6       0/1     Pending   0          54s
  ...

Perfect :).  Waiting for 10m...  Now:

  $ oc -n openshift-operator-lifecycle-manager get pods | grep olm-operator
  olm-operator-589957f7bd-z4tlv       1/1     Running   0          97m
  olm-operator-6c99f456cf-fr8t6       0/1     Pending   0          10m
  $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
  2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability.
  2021-04-26T05:00:31Z Progressing=False ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail
  I0426 04:53:35.383334       1 task_graph.go:555] Result of work: []
  I0426 04:56:26.340740       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
  I0426 04:56:33.347692       1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677)
  I0426 04:56:53.696766       1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  I0426 04:56:53.996634       1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  I0426 04:56:54.547242       1 task_graph.go:555] Result of work: []
  I0426 04:59:45.504344       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
  I0426 04:59:52.619062       1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677)
  I0426 05:00:13.211034       1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  I0426 05:00:13.363510       1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)

The CVO sticks there for around five minutes waiting for machine-config to happy up, which never happens, because we're still cordoned.  Once that times out, we wait out the cool-down between sync cycles, start in again, and notice the stuck deployment:

  $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail
  I0426 04:56:53.696766       1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  I0426 04:56:53.996634       1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  I0426 04:56:54.547242       1 task_graph.go:555] Result of work: []
  I0426 04:59:45.504344       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
  I0426 05:00:13.211034       1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  I0426 05:00:13.363510       1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  I0426 05:05:27.417819       1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available]
  I0426 05:08:44.045069       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
  I0426 05:09:11.750990       1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
  E0426 05:09:11.901500       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.

It takes us another 5m to give up on the stuck Deployment (we should probably fail-fast on ProgressDeadlineExceeded to avoid stacking timeouts, but can adjust to that in follow-up work).  But eventually we do give up:

  $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail
  E0426 05:12:19.268254       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  E0426 05:12:37.640209       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  E0426 05:12:57.047203       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  E0426 05:13:22.513803       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  E0426 05:13:45.891628       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  E0426 05:14:00.301317       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  E0426 05:14:23.508240       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  I0426 05:14:25.958826       1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.]
  I0426 05:14:25.958900       1 sync_worker.go:939] Update error 501 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.)
  E0426 05:14:25.958922       1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
  2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440
  2021-04-26T03:56:13Z Progressing=False WorkloadNotProgressing: Error while reconciling 4.8.0-0.nightly-2021-04-25-195440: the workload openshift-operator-lifecycle-manager/olm-operator cannot roll out
  2021-04-26T04:48:57Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details
  2021-04-26T05:05:28Z Failing=True WorkloadNotProgressing: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.

Hooray, there we go :)

[1]: https://github.com/openshift/console-operator/pull/508
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge
[3]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable

Comment 11 Yang Yang 2021-04-26 11:50:47 UTC

Trevor, appreciate your help and detailed information.

Verifying commit#8132d6a.

Cordon all of the nodes:

    $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
    $ for NODE in ${NODES}; do oc cordon "${NODE}"; done

Update the catalog-operator deployment:

    $ oc patch deployment catalog-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]'

Check new catalog-operator pod is pending:

    $ oc get po
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-675bc6d7cb-kx5bm   1/1     Running   0          3d1h
catalog-operator-7b46fdf5-txzhx     0/1     Pending   0          15m
olm-operator-65b47b65f5-sj7xl       1/1     Running   0          3d1h
packageserver-676bc98c9-kgv5m       1/1     Running   0          3d1h
packageserver-676bc98c9-xqvsp       1/1     Running   0          3d1h

Wait 10m... check the deployment state:

   $ oc get deployment catalog-operator -ojson | jq -r '.status.conditions[]'
{
  "lastTransitionTime": "2021-04-23T08:26:40Z",
  "lastUpdateTime": "2021-04-23T08:26:40Z",
  "message": "Deployment has minimum availability.",
  "reason": "MinimumReplicasAvailable",
  "status": "True",
  "type": "Available"
}
{
  "lastTransitionTime": "2021-04-26T10:04:38Z",
  "lastUpdateTime": "2021-04-26T10:04:38Z",
  "message": "ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.",
  "reason": "ProgressDeadlineExceeded",
  "status": "False",
  "type": "Progressing"
}

After a while, check the CVO conditions:

    $ oc get clusterversion -o json | jq -r '.items[].status.conditions[]'
{
  "lastTransitionTime": "2021-04-23T08:22:33Z",
  "message": "Done applying 4.8.0-0.nightly-2021-04-22-225832",
  "status": "True",
  "type": "Available"
}
{
  "lastTransitionTime": "2021-04-26T10:14:20Z",
  "message": "deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.",
  "reason": "WorkloadNotProgressing",
  "status": "True",
  "type": "Failing"
}
{
  "lastTransitionTime": "2021-04-23T08:22:33Z",
  "message": "Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: the workload openshift-operator-lifecycle-manager/catalog-operator cannot roll out",
  "reason": "WorkloadNotProgressing",
  "status": "False",
  "type": "Progressing"
}
{
  "lastTransitionTime": "2021-04-23T07:55:27Z",
  "message": "Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-04-22-225832 not found in the \"stable-4.7\" channel",
  "reason": "VersionNotFound",
  "status": "False",
  "type": "RetrievedUpdates"
}
{
  "lastTransitionTime": "2021-04-26T08:05:10Z",
  "message": "Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details",
  "reason": "PoolUpdating",
  "status": "False",
  "type": "Upgradeable"
}

Check the CVO log:

$ oc -n openshift-cluster-version logs cluster-version-operator-6f4c448795-54z97 | grep 'deployment.*catalog\|Result of work' | tail -5
E0426 11:42:19.121497       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.
E0426 11:42:41.875235       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.
I0426 11:42:45.607570       1 task_graph.go:555] Result of work: [deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. Cluster operator monitoring is not available]
I0426 11:42:45.607688       1 sync_worker.go:939] Update error 502 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.)
E0426 11:42:45.607736       1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.

The 2 commits work as design. Moving it to verified state.

Comment 14 errata-xmlrpc 2021-07-27 23:01:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438