1884334 – CVO marks an upgrade as failed when an operator takes more than 20 minutes to rollout

Bug 1884334 - CVO marks an upgrade as failed when an operator takes more than 20 minutes to rollout

Summary: CVO marks an upgrade as failed when an operator takes more than 20 minutes to...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jack Ottofaro
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1825006 1888754 (view as bug list)
Depends On:
Blocks:	1884464 ocp-42-45-z-tracker
TreeView+	depends on / blocked

Reported:	2020-10-01 17:30 UTC by Scott Dodson
Modified:	2021-02-24 15:22 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:22:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kubeapi failure (94.48 KB, image/png) 2020-10-12 06:53 UTC, Yang Yang	no flags	Details
network failure (100.25 KB, image/png) 2020-10-12 06:54 UTC, Yang Yang	no flags	Details
machine-config failure (94.42 KB, image/png) 2020-10-12 07:10 UTC, Yang Yang	no flags	Details
Gathered CVO logs from a 4.5.15 -> 4.6.0-rc.4 update (904.84 KB, application/gzip) 2020-10-15 00:44 UTC, W. Trevor King	no flags	Details
CVO logs for upgrading from 4.6.16 to 4.7.0-0.nightly-2021-02-04-031352 (1.84 MB, application/gzip) 2021-02-07 01:56 UTC, Yang Yang	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 464	None	closed	Bug 1884334: pkv/cvo/status: Raise Operator leveling grace-period to 40 minutes	2021-02-16 20:13:07 UTC
Github	openshift cluster-version-operator pull 486	None	closed	Bug 1884334: UpdateError: enhance for ability to determine when upgrade failing	2021-02-16 20:13:07 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:22:54 UTC

Description Scott Dodson 2020-10-01 17:30:07 UTC

This bug was initially created as a copy of Bug #1862524

I am copying this bug because: 



Currently the CVO marks an upgrade as failed whenever an operator takes longer than 20 minutes to rollout. It's very common on clusters of any size to take more than 20 minutes to rollout operators which have daemonsets running on all hosts, in particular MCO, network, and dns operators. By moving this to 40 minutes we'll significantly reduce the noise so we can focus on upgrades which have real problems.

There's follow up to make more significant implementation changes here but we'll push those out more slowly

https://issues.redhat.com/browse/OTA-247

Comment 3 Yang Yang 2020-10-09 10:38:16 UTC

I'm verifying it by setting up a cluster which has 3 master nodes and 20 worker nodes with 4.5.14. Then trigger an upgrade to 4.6.0-0.nightly-2020-10-08-210814. Upgrade goes successfully. It took 27 mins to get network operator upgraded.

Comment 4 Yang Yang 2020-10-10 06:03:31 UTC

Hi Scott,

What's use case that leads to this failure? I'm not able to reproduce it.  I set up a cluster consisted of 3 master nodes and 20 worker nodes on 4.5.12 and did an upgrade to 4.6.0-fc.8. Upgrade was successful even if it took more than 20 minutes to get network and machine-config upgraded.

Comment 5 Scott Dodson 2020-10-10 18:02:35 UTC

It would be whenever any one operator takes longer than 20 minutes to upgrade. I would certainly have expected a cluster with 20 workers to have exceeded 20 minutes on those operators unless maybe you were using a local registry that was exceptionally fast? I was able to exceed 20 minutes in a 12 node cluster in AWS installed from the CI registry.

I'll see if I can confirm it when upgrading between those two versions.

Comment 6 W. Trevor King 2020-10-10 22:14:10 UTC

The updates will succeed with or without this patch.  The distinction is that before this patch, a single operator taking longer than 20m would lead to ClusterVersion reporting Failing=True.  With this patch, ClusterVersion will only report Failing=True if an operator takes more than 40m.  Otherwise cluster status, including the timing of the eventual successful completion, will be identical.  The use case is to reduce false-alarm Failing=True reports for things like the network operator that are expected to take a while during healthy updates.

Comment 7 Yang Yang 2020-10-12 06:48:00 UTC

Thanks Scott and Trevor for clarification.

I attempted to verify it with the following procedures:

1. Install an AWS cluster with 3 master nodes and 20 worker nodes on 4.5.14
2. Do a cluster upgrade to 4.6.0-0.nightly-2020-10-10-041109
3. Check Failing status

CVO is reporting Failing=True when an operator took around 13m. Attaching the screenshots.

Comment 8 Yang Yang 2020-10-12 06:53:46 UTC

Created attachment 1720840 [details]
kubeapi failure

Comment 9 Yang Yang 2020-10-12 06:54:35 UTC

Created attachment 1720841 [details]
network failure

Comment 10 Yang Yang 2020-10-12 07:10:40 UTC

Created attachment 1720854 [details]
machine-config failure

Comment 11 Yang Yang 2020-10-12 08:30:04 UTC

The cli I used to upgrade the cluster is 
# oc adm upgrade --to-image='registry.svc.ci.openshift.org/ocp/release@sha256:a33eff631dd29cfe412d9cdcfaec68844a18b902e657708e16dbd85203da4eed' --force --allow-explicit-upgrade

The must-gather tarball is online:

https://drive.google.com/file/d/1XfGhdudsKotzevRHOH9VjKUd5_0OiSrY/view?usp=sharing

Comment 12 W. Trevor King 2020-10-12 18:25:58 UTC

From comment 11's must-gather:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-10-12T05:52:30Z 2020-10-12T07:19:30Z 4.6.0-0.nightly-2020-10-10-041109 Completed false
2020-10-12T04:53:42Z 2020-10-12T05:21:41Z 4.5.14 Completed false

All nightlies should have acceptable signatures on them, so no need to --force the update.  Or maybe ART stops publishing nightly signatures once they begin cutting RCs?

From the screenshots, timestamps for the Failing=True were:

* 06:05:21Z Kube-API-server operator still updating
* 06:36:51Z Network operator still updating
* 07:08:05Z Machine-config operator still updating

There won't be anything in the captured CVO logs about that:

$ head -n1 namespaces/openshift-cluster-version/pods/cluster-version-operator-6fbfc576b9-4szqt/cluster-version-operator/cluster-version-operator/logs/current.log 
2020-10-12T07:12:27.038577910Z I1012 07:12:27.022793       1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1

Maybe we should grow update CI around these clusters with more compute nodes, so we can get the usual event monitoring and Prometheus captures we collect in CI...

Comment 13 Stephen Cuppett 2020-10-13 12:36:03 UTC

Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified.

https://coreos.slack.com/archives/CEGKQ43CP/p1602592336386900

Comment 15 W. Trevor King 2020-10-14 21:36:17 UTC

*** Bug 1825006 has been marked as a duplicate of this bug. ***

Comment 16 W. Trevor King 2020-10-15 00:44:47 UTC

Created attachment 1721660 [details]
Gathered CVO logs from a 4.5.15 -> 4.6.0-rc.4 update

Attempt at a reproducer:

cluster-bot: launch 4.5.15

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.6"}]'

$ oc -n openshift-machine-api get machinesets
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-p3gs8vk-f76d1-75z4s-worker-b   1         1         1       1           37m
ci-ln-p3gs8vk-f76d1-75z4s-worker-c   1         1         1       1           37m
ci-ln-p3gs8vk-f76d1-75z4s-worker-d   1         1         1       1           37m
$ oc -n openshift-machine-api patch machineset ci-ln-p3gs8vk-f76d1-75z4s-worker-b --type json -p '[{"op": "add", "path": "/spec/replicas", "value": 18}]'

Wait for the MachineSet to fill the request...

$ oc -n openshift-machine-api get --watch machineset ci-ln-p3gs8vk-f76d1-75z4s-worker-b
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-p3gs8vk-f76d1-75z4s-worker-b   18        18        1       1           40m
...
ci-ln-p3gs8vk-f76d1-75z4s-worker-b   18        18        18      18          42m

Start tailing the CVO:

$ while sleep 1; do POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log"; done

In another terminal, initiate the update:

$ oc adm upgrade --to 4.6.0-rc.4

A ways in:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2020-10-14T22:58:20Z RetrievedUpdates=True : 
2020-10-14T22:53:33Z Available=True : Done applying 4.5.15
2020-10-15T00:10:46Z Failing=True ClusterOperatorNotAvailable: Cluster operator network is still updating
2020-10-14T23:17:33Z Progressing=True ClusterOperatorNotAvailable: Unable to apply 4.6.0-rc.4: the cluster operator network has not yet successfully rolled out

At that point:

$ grep 'Running sync.*in state\|Running sync for clusteroperator.*network\|error running apply for clusteroperator.*network\|Done syncing for clusteroperator.*network' 2020-10-14T23:29:57+0000-cluster-version-operator-655f8f6749-dmffp.log
I1014 23:18:15.432717       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 0
I1014 23:24:21.153789       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 1
I1014 23:30:50.343295       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 2
I1014 23:38:09.389702       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 3
I1014 23:47:15.730610       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 4
I1014 23:47:54.142905       1 sync_worker.go:701] Running sync for clusteroperator "network" (493 of 618)
E1014 23:52:57.646403       1 task.go:81] error running apply for clusteroperator "network" (493 of 618): Cluster operator network is still updating
I1014 23:56:02.416775       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 5
I1014 23:56:46.828845       1 sync_worker.go:701] Running sync for clusteroperator "network" (493 of 618)
E1015 00:01:44.328551       1 task.go:81] error running apply for clusteroperator "network" (493 of 618): Cluster operator network is still updating
I1015 00:04:53.587571       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 6
I1015 00:05:31.899081       1 sync_worker.go:701] Running sync for clusteroperator "network" (493 of 618)
E1015 00:10:35.500661       1 task.go:81] error running apply for clusteroperator "network" (493 of 618): Cluster operator network is still updating
I1015 00:13:51.624648       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 7

So that is 23:47Z through 00:10Z, which is indeed short of the target 40m.  And later:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2020-10-14T22:58:20Z RetrievedUpdates=True : 
2020-10-14T22:53:33Z Available=True : Done applying 4.5.15
2020-10-15T00:19:46Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is still updating
2020-10-14T23:17:33Z Progressing=True ClusterOperatorNotAvailable: Unable to apply 4.6.0-rc.4: the cluster operator machine-config has not yet successfully rolled out
$ oc get -o json clusteroperator machine-config | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'2020-10-15T00:16:43Z Progressing=True : Working towards 4.6.0-rc.4
2020-10-15T00:16:44Z Degraded=False : 
2020-10-15T00:14:45Z Available=False : Cluster not available for 4.5.15
2020-10-14T22:24:10Z Upgradeable=True AsExpected: 
$ grep 'Running sync.*in state\|Running sync for clusteroperator.*machine-config\|error running apply for clusteroperator.*machine-config\|Done syncing for clusteroperator.*machine-config' 2020-10-14T23:29:57+0000-cluster-version-operator-655f8f6749-dmffp.log
I1014 23:18:15.432717       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 0
I1014 23:24:21.153789       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 1
I1014 23:30:50.343295       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 2
I1014 23:38:09.389702       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 3
I1014 23:47:15.730610       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 4
I1014 23:56:02.416775       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 5
I1015 00:04:53.587571       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 6
I1015 00:13:51.624648       1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 7
I1015 00:14:38.875574       1 sync_worker.go:701] Running sync for clusteroperator "machine-config" (522 of 618)

I've attached the resulting logs, after removing empty files and ones that only contained a subset of another (e.g. the first of two files for the same pod, after the first stream was interrupted by an API-server hiccup or whatever).

Comment 17 W. Trevor King 2020-10-15 02:25:30 UTC

So:

$ grep 'Running sync for clusteroperator.*machine-config\|error running apply for clusteroperator.*machine-config\|Done syncing for clusteroperator.*machine-config' 2020-10-15T00:17:37+0000-cluster-version-operator-655f8f6749-dmffp.log
I1015 00:14:38.875574       1 sync_worker.go:701] Running sync for clusteroperator "machine-config" (522 of 618)
E1015 00:19:33.537783       1 task.go:81] error running apply for clusteroperator "machine-config" (522 of 618): Cluster operator machine-config is still updating

goes right into:

2020-10-15T00:19:46Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is still updating

without any of the intended grace period.  Looking at the linked PR, the grace period is being compared to LastProgress.  Looking at the code:

$ git --no-pager log --oneline -1
0fa9b047dc (HEAD -> master, wking/master, origin/release-4.8, origin/release-4.7, origin/master, origin/HEAD) Merge pull request #472 from wking/demote-status-change-logging
$ git grep LastProgress | grep -v test
pkg/cvo/status.go:	if len(history) == 0 || status.Failure == nil || status.Reconciling || status.LastProgress.IsZero() {
pkg/cvo/status.go:	if now.Sub(status.LastProgress) > 40*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour {
pkg/cvo/sync_worker.go:	LastProgress time.Time
pkg/cvo/sync_worker.go:		status.LastProgress = time.Now()

Ah, because we were over an hour out from the 23:17Z update start time.  And there's really nothing in that logic that is specific to ClusterOperator.  So... this will take some shuffling...

Comment 18 W. Trevor King 2020-10-25 15:54:21 UTC

I didn't have time to refactor to address the issue discussed in comment 17.  Hopefully next sprint.

Comment 19 jamo luhrsen 2020-11-24 23:37:23 UTC

@wking, wondering if this failure is the same as this bug?

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331341959318474752

this log looks like it's hitting the 40m timeout discussed in this bz:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331341959318474752/artifacts/e2e-aws-proxy/ipi-install-install/container-logs/test.log

Comment 20 W. Trevor King 2020-11-25 00:09:13 UTC

(In reply to jamo luhrsen from comment #19)
> ... wondering if this failure is the same as this bug?...

Nope.  From your job:

level=info msg=Waiting up to 40m0s for the cluster at https://api.ci-op-q4qfig05-2659c.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize...
...
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating

So you are hitting the installer's 40m timeout for going from bootstrap-teardown to install-complete [1].

[1]: https://github.com/openshift/installer/blob/095b6e6e67fd105b53f1ced7bb783aa1eaf1847f/cmd/openshift-install/create.go#L353

Comment 21 W. Trevor King 2020-11-25 00:10:23 UTC

I think the proxy vs. machine-config thing is bug 1899979.

Comment 22 jamo luhrsen 2020-11-25 00:26:27 UTC

(In reply to W. Trevor King from comment #21)
> I think the proxy vs. machine-config thing is bug 1899979.

Thank you. I'll update 1899979 with the job info.

Comment 23 jamo luhrsen 2020-11-25 04:10:22 UTC

(In reply to jamo luhrsen from comment #22)
> (In reply to W. Trevor King from comment #21)
> > I think the proxy vs. machine-config thing is bug 1899979.
> 
> Thank you. I'll update 1899979 with the job info.

Actually, I think it looks more like bug 1901034, which I just updated with a few details.

Comment 24 W. Trevor King 2020-12-04 15:55:56 UTC

End of sprint; didn't get to this one.  Failing=True when things aren't that bad is annoying though, so hopefully next sprint.  Comment 17 breaks down the current issue, which is looking at status.LastProgress and history, but not distinguishing between long-pending manifest and failing manifest.

Comment 27 Lalatendu Mohanty 2021-01-08 16:41:44 UTC

Raising the priority to high as this will help us bring more clarity with respect to actual update failures in telemetry/datahub.

Comment 28 W. Trevor King 2021-01-21 19:24:31 UTC

*** Bug 1888754 has been marked as a duplicate of this bug. ***

Comment 30 Yang Yang 2021-02-05 06:40:32 UTC

I'm verifying it by upgrading from 4.6.16 to 4.7.0-0.nightly-2021-02-04-031352. 

1. Network operator does not reach 40 minutes timeout but failing=ture fired but finally it got ready. Is it expected behaviour?

In the middle of upgrade, we can see failing=true fires

status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-02-05T02:13:16Z"
      message: Done applying 4.6.16
      status: "True"
      type: Available
    - lastTransitionTime: "2021-02-05T04:55:09Z"
      message: Cluster operator network is not available
      reason: ClusterOperatorNotAvailable
      status: "True"
      type: Failing
    - lastTransitionTime: "2021-02-05T04:19:53Z"
      message: 'Unable to apply 4.7.0-0.nightly-2021-02-04-031352: the cluster operator network has not yet successfully rolled out'
      reason: ClusterOperatorNotAvailable
      status: "True"
      type: Progressing

The upgrade ends up with success.

status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-02-05T02:13:16Z"
      message: Done applying 4.7.0-0.nightly-2021-02-04-031352
      status: "True"
      type: Available
    - lastTransitionTime: "2021-02-05T05:07:54Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2021-02-05T05:40:37Z"
      message: Cluster version is 4.7.0-0.nightly-2021-02-04-031352
      status: "False"
      type: Progressing

2. While the clusterversion reports progressing=false and the target version is available, 2 operators still reports progressing=true. Does it deserve a fix?

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-02-04-031352   True        False         60s     Cluster version is 4.7.0-0.nightly-2021-02-04-031352

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-02-04-031352   True        False         False      2m42s
baremetal                                  4.7.0-0.nightly-2021-02-04-031352   True        False         False      60m
cloud-credential                           4.7.0-0.nightly-2021-02-04-031352   True        False         False      4h5m
cluster-autoscaler                         4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h59m
config-operator                            4.7.0-0.nightly-2021-02-04-031352   True        False         False      4h
console                                    4.7.0-0.nightly-2021-02-04-031352   True        False         False      9m16s
csi-snapshot-controller                    4.7.0-0.nightly-2021-02-04-031352   True        False         False      23m
dns                                        4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h59m
etcd                                       4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h44m
image-registry                             4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h24m
ingress                                    4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h37m
insights                                   4.7.0-0.nightly-2021-02-04-031352   True        False         False      4h
kube-apiserver                             4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h44m
kube-controller-manager                    4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h58m
kube-scheduler                             4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h58m
kube-storage-version-migrator              4.7.0-0.nightly-2021-02-04-031352   True        False         False      16m
machine-api                                4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h55m
machine-approver                           4.7.0-0.nightly-2021-02-04-031352   True        False         False      4h
machine-config                             4.7.0-0.nightly-2021-02-04-031352   True        False         False      2m38s
marketplace                                4.7.0-0.nightly-2021-02-04-031352   True        False         False      15m
monitoring                                 4.7.0-0.nightly-2021-02-04-031352   True        False         False      56m
network                                    4.7.0-0.nightly-2021-02-04-031352   True        True          False      31m
node-tuning                                4.7.0-0.nightly-2021-02-04-031352   True        False         False      58m
openshift-apiserver                        4.7.0-0.nightly-2021-02-04-031352   True        False         False      2m54s
openshift-controller-manager               4.7.0-0.nightly-2021-02-04-031352   True        False         False      3h59m
openshift-samples                          4.7.0-0.nightly-2021-02-04-031352   True        False         False      59m
operator-lifecycle-manager                 4.7.0-0.nightly-2021-02-04-031352   True        False         False      4h
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-02-04-031352   True        False         False      4h
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-02-04-031352   True        False         False      9m3s
service-ca                                 4.7.0-0.nightly-2021-02-04-031352   True        False         False      4h
storage                                    4.7.0-0.nightly-2021-02-04-031352   True        True          False      15m

Comment 31 Yang Yang 2021-02-05 11:05:51 UTC

Please ignore comment#30

Trying to verify it by upgrading from 4.6.16 to 4.7.0-0.nightly-2021-02-04-031352

Install a gcp cluster with 3 workers and then scale out worker nodes to 20

While cluster is upgrading, do

$ while sleep 2; do oc get clusterversion -oyaml | grep Failing -a5; done
 
- lastTransitionTime: "2021-02-05T10:34:15Z"
      message: Cluster operator network is not available
      reason: ClusterOperatorNotAvailable
      status: "True"
      type: Failing
    - lastTransitionTime: "2021-02-05T09:50:45Z"
      message: 'Unable to apply 4.7.0-0.nightly-2021-02-04-031352: the cluster operator network has not yet successfully rolled out'
      reason: ClusterOperatorNotAvailable
      status: "True"
      type: Progressing


# oc logs pod/cluster-version-operator-595847559-h2jl8 | grep 'Running sync.*in state\|Running sync for clusteroperator.*network\|error running apply for clusteroperator.*network\|Done syncing for clusteroperator.*network'
I0205 09:51:14.513873       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 0
I0205 09:57:18.731310       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 1
I0205 10:03:48.470422       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 2
I0205 10:10:57.148100       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 3
I0205 10:19:46.907951       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 4
I0205 10:28:30.790725       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 5
I0205 10:30:03.104291       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
E0205 10:34:12.704584       1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available
I0205 10:37:13.280623       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 6
I0205 10:37:55.240904       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
E0205 10:42:55.194336       1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available
I0205 10:45:47.166156       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 7
I0205 10:46:29.126361       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
E0205 10:51:29.079421       1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available
I0205 10:54:41.979907       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 8
I0205 10:55:23.939235       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
I0205 10:55:23.939533       1 sync_worker.go:777] Done syncing for clusteroperator "network" (532 of 668)


Every time when error running apply for clusteroperator "network" out, Failing=true fires.

Wondering when the co network starts to sync, is it at the same time of updating at attempt 0?

Comment 32 Jack Ottofaro 2021-02-05 16:35:34 UTC

(In reply to Yang Yang from comment #31)
> Please ignore comment#30
> 
> Trying to verify it by upgrading from 4.6.16 to
> 4.7.0-0.nightly-2021-02-04-031352
> 
> Install a gcp cluster with 3 workers and then scale out worker nodes to 20
> 
If a CO is Available=False as network was in your test then Failing=True.

Not sure about your question but can you provide the CVO log from that run?

Comment 33 Yang Yang 2021-02-07 01:56:28 UTC

Created attachment 1755411 [details]
CVO logs for upgrading from 4.6.16 to 4.7.0-0.nightly-2021-02-04-031352

Comment 34 Yang Yang 2021-02-07 02:15:36 UTC

(In reply to Jack Ottofaro from comment #32)
> (In reply to Yang Yang from comment #31)
> > Please ignore comment#30
> > 
> > Trying to verify it by upgrading from 4.6.16 to
> > 4.7.0-0.nightly-2021-02-04-031352
> > 
> > Install a gcp cluster with 3 workers and then scale out worker nodes to 20
> > 
> If a CO is Available=False as network was in your test then Failing=True.
> Not sure about your question but can you provide the CVO log from that run?

In my testing, we can see,

I0205 10:30:03.104291       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
E0205 10:34:12.704584       1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available

Does that mean the CO network started to sync at 10:30:03.104291 and finished at 10:34:12.704584? 

If that's true, when the network reported ClusterOperatorNotAvailable, it only took 4 minutes to sync, but failing=True fired.

From your PR statement, ClusterOperatorNotAvailable will not result in cluster Failing=True unless sync'ing time has exceeded 40 minutes. Does the sync'ing time refer to a CO's sync'ing time?

Comment 35 W. Trevor King 2021-02-07 05:19:34 UTC

Nit: attachment from comment 33 has a bunch of empty files; no need to include those.  And there are a few sequentially-growing files for cluster-version-operator-595847559-h2jl8, all starting at 09:51:13.692221.  The final, largest one of those contains all the lines from the earlier, smaller two, as well as additional lines, so it's the only file from that pod which needs to be included.  All the files for cluster-version-operator-b5997dd8c-9xgjx are just "unable to retrieve container logs for cri-o://...", which isn't all that useful.  Not clear what was going on with that pod.  Events or Pod YAML might say, but probably not all that important.  So the only two files which impact this bug are:

  2021-02-05T09:58:21+0000-cluster-version-operator-595847559-h2jl8.log
  2021-02-05T11:03:07+0000-cluster-version-operator-595847559-2sczw.log

Timestamps in the filenames are when the gathers started.  Getting times for when the pods started:

  $ head -n1 2021-02-05T09:58:21* 2021-02-05T11:03:07*
  ==> 2021-02-05T09:58:21+0000-cluster-version-operator-595847559-h2jl8.log <==
  I0205 09:51:13.692221       1 start.go:21] ClusterVersionOperator 4.7.0-202102032256.p0-9f3368d

  ==> 2021-02-05T11:03:07+0000-cluster-version-operator-595847559-2sczw.log <==
  I0205 11:09:10.872857       1 start.go:21] ClusterVersionOperator 4.7.0-202102032256.p0-9f3368d

So your 10:30 bits in comment 34 are from the h2jl8 log.  [1] will try to explain the CVO's major moves, but digging in by hand, because it's easier to paste text into bugs than share screenshots, with notes inline:

  $ grep 'Running sync.*in state\|Result of work\|clusteroperator "network"' 2021-02-05T09:58:21* | grep -v Precreated
  I0205 09:51:14.513873       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 0
  I0205 09:51:15.393281       1 task_graph.go:555] Result of work: []
  I0205 09:56:56.427293       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is still updating]

Attempt 0 timed out waiting for kube-apiserver.  CVO isn't mad, but it's just going to start the sync again to ensure the earlier manifests haven't drifted since it last checked them.

  I0205 09:57:18.731310       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 1
  I0205 09:57:18.958491       1 task_graph.go:555] Result of work: []
  I0205 10:03:00.646547       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is still updating]

Still waiting for kube-apiserver after attempt 1.

  I0205 10:03:48.470422       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 2
  I0205 10:03:48.642119       1 task_graph.go:555] Result of work: []
  I0205 10:09:30.398527       1 task_graph.go:555] Result of work: [Cluster operator config-operator is still updating Cluster operator machine-api is still updating Cluster operator openshift-apiserver is still updating]

Attempt 2 got farther, now we're waiting for the machine-api and openshift-apiserver.

  I0205 10:10:57.148100       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 3
  I0205 10:10:57.380076       1 task_graph.go:555] Result of work: []
  I0205 10:16:39.089117       1 task_graph.go:555] Result of work: [Cluster operator cluster-autoscaler is still updating Cluster operator image-registry is still updating Cluster operator monitoring is still updating Cluster operator console is still updating Cluster operator operator-lifecycle-manager-packageserver is still updating Cluster operator marketplace is still updating]

And farther, now we're waiting for cluster-autoscaler, etc.

  I0205 10:19:46.907951       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 4
  I0205 10:19:47.140019       1 task_graph.go:555] Result of work: []
  I0205 10:25:28.821398       1 task_graph.go:555] Result of work: [Cluster operator insights is degraded]

CVO starts to get suspicious.  Should't be Failing yet, but we will eventually go Failing=True if insights stays degraded for long enough.  We probably just got interested in the insights operator in this attempt 4 (having been blocked by other, earlier manifests in previous attempts), so it probably hasn't been long enough for us to actually be mad at it yet.

  I0205 10:28:30.790725       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 5
  I0205 10:28:31.013985       1 task_graph.go:555] Result of work: []
  I0205 10:30:03.104291       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
  E0205 10:34:12.704584       1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available
  I0205 10:34:12.704720       1 task_graph.go:555] Result of work: [Cluster operator network is not available]

First round where we got far enough in to be curious about the network operator, which is when the network operator's 40 minute clock starts.  But the clock doesn't matter, because Available=False is seriously bad.  We should immediately go Failing=True, event though we are only 4m into the 40 minute clock.

  I0205 10:37:13.280623       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 6
  I0205 10:37:13.576075       1 task_graph.go:555] Result of work: []
  I0205 10:37:55.240904       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
  E0205 10:42:55.194336       1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available
  I0205 10:42:55.194430       1 task_graph.go:555] Result of work: [Cluster operator network is not available]

Still seriously bad, so still Failing=True, even though we're now 8m into the network operator's 40 minute clock.

  I0205 10:45:47.166156       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 7
  I0205 10:45:47.460320       1 task_graph.go:555] Result of work: []
  I0205 10:46:29.126361       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
  E0205 10:51:29.079421       1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available
  I0205 10:51:29.079558       1 task_graph.go:555] Result of work: [Cluster operator network is not available]

Still seriously bad, so still Failing=True.

  I0205 10:54:41.979907       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 8
  I0205 10:54:42.193448       1 task_graph.go:555] Result of work: []
  I0205 10:55:23.939235       1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668)
  I0205 10:55:23.939533       1 sync_worker.go:777] Done syncing for clusteroperator "network" (532 of 668)
  I0205 11:00:23.895880       1 task_graph.go:555] Result of work: [Cluster operator machine-config is still updating]

Ahh, network finally happy again.  In fact, not only is it back to Available=True, but it's completed its update.  We should clear Failing=False.  Now we're waiting on machine-config, but there is nothing concerning about its current state, it just hasn't leveled yet.

  I0205 11:03:27.017620       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 9
  I0205 11:03:27.382521       1 task_graph.go:555] Result of work: []
  I0205 11:08:59.439678       1 task_graph.go:555] Result of work: [Cluster operator authentication is degraded]

Likely the machine-config operator rolling the nodes while it updated freaked out the auth operator.  It's been a while since we asked auth to update, so maybe we go straight to Failing=True again here?  Jack, we might want to reset times for operators any time we successfully reconcile their manifests.  I don't think that blocks this particular bug, it's just a follow-up idea.

Moving on to the next pod:

  $ grep 'Running sync.*in state\|Result of work' 2021-02-05T11:03:07* | head     
  I0205 11:09:15.852963       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 0
  I0205 11:09:17.919567       1 task_graph.go:555] Result of work: []
  I0205 11:14:57.767259       1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded]

New degraded operator, but we don't preserve the operator clocks between CVO pods, so now openshift-apiserver is only 5 minutes into its clock, and we set Failing=False, because openshift-apiserver is only Degraded=True, and that's not too bad.

  I0205 11:15:20.071310       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 1
  I0205 11:15:20.265234       1 task_graph.go:555] Result of work: []
  I0205 11:21:01.984970       1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded]

12 minutes into the openshift-apiserver clock, it's still just Degraded=True, so we stay Failing=False.

  I0205 11:21:49.201918       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 2
  I0205 11:21:49.391717       1 task_graph.go:555] Result of work: []
  I0205 11:24:16.786081       1 task_graph.go:555] Result of work: []
  I0205 11:27:07.743276       1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Reconciling at attempt 0

Update complete :)

Looks like we don't log ClusterVersion condition changes, but my guesses from above are that we went Failing=True at 10:34, Failing=False at 11:00, Failing=True at 11:08, Failing=False at 11:14, and update-complete at 11:24.  If that matches what you saw, it's probably enough to verify this bug.  If it diverges, tell me where.

[1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/hack/log-explainer.py

Comment 36 Yang Yang 2021-02-07 08:08:33 UTC

Trevor, thanks a lot for walking me through the log. Really appreciate it.

That's exactly what I saw.

So from my understanding, not sure if they're correct,

1) if a co is in a state of Available=False, then Failing=True fires even if it does not exceed 40 min time slot.
2) If a co is in a state of degraded=True, then Failing=True fires only when it goes beyond 40 min time slot.
3) If a co is in a state of still updating, then Failing=True fires only when it goes beyond 40 min time slot.

Comment 37 Yang Yang 2021-02-07 11:37:43 UTC

Based on comment#35, it works as design. Moving it to verified state.

Comment 38 W. Trevor King 2021-02-08 03:33:00 UTC

> 1) if a co is in a state of Available=False, then Failing=True fires even if it does not exceed 40 min time slot.
> 2) If a co is in a state of degraded=True, then Failing=True fires only when it goes beyond 40 min time slot.

Yes to both of these, with some corner cases like "the CVO does not actually care about your ClusterOperator state until it is deep enough into the update graph [1] to get to that manifest.

> 3) If a co is in a state of still updating, then Failing=True fires only when it goes beyond 40 min time slot.

If a ClusterOperator is Available=True and Degraded=False, we will now happily wait forever for it to level up without going Failing=True [2].  This is useful for things like networking's DNS DaemonSet, which can take quite a while to roll out over a cluster with hundreds of compute nodes.

[1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/docs/user/reconciliation.md#manifest-graph
[2]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/cvo/internal/operatorstatus.go#L253

Comment 41 errata-xmlrpc 2021-02-24 15:22:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.