1763821 – [upgrade-4.1-4.2] Canceling the task graph partway though should be an error even if no tasks fail

Bug 1763821 - [upgrade-4.1-4.2] Canceling the task graph partway though should be an error even if no tasks fail

Summary: [upgrade-4.1-4.2] Canceling the task graph partway though should be an error ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.0
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1763822 1763823
TreeView+	depends on / blocked

Reported:	2019-10-21 16:54 UTC by W. Trevor King
Modified:	2020-01-23 11:08 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1763822 1763823 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:08:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 255	0	'None'	closed	Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail	2020-05-28 05:55:54 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:08:49 UTC

Description W. Trevor King 2019-10-21 16:54:42 UTC

From [1]:

2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
...
2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
...
2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
...
2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
...
2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
...

Where the CVO cancels some workers, sees that there are no errors, and decides "upgrade complete" despite never having attempted to push the bulk of its manifests. With this commit, the result of work will include several worker-canceled errors, and we'll take another upgrade round instead of declaring success and moving into reconciling.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log

Comment 1 Clayton Coleman 2019-10-21 17:37:47 UTC

Even worse, we'll switch into reconciling mode which means we might upgrade nodes before control plane which is not allowed.

This is a 4.1 to 4.2 upgrade blocker.

Comment 2 liujia 2019-10-22 07:45:16 UTC

I have tried a normal upgrade steps against 4.1.20 to 4.2.0. Upgrade finished actually and no worker cancel msg in the cvo log.

According to the discussion with devs on slack(https://coreos.slack.com/archives/CEGKQ43CP/p1571710624054300), this should be race condition and only happened at a ratio during e2e test. and the appearance is that although the e2e job shows the upgrade successfully/finished, but the co did not finish sync(cluster upgrade is Progressing: Working towards 0.0.1-2019-10-21-095122: 25% complete). This check point is included in our upgrade test case, so i don't think qe need more case about it.

About the bug's reproduce and verify, it reproduced more easily in e2e test such as the example jobs in https://coreos.slack.com/archives/CEKNRGF25/p1571655534427500, so we can do some regression test against the target build when the pr landed.

please feel free to correct me.

Comment 4 liujia 2019-10-25 03:14:51 UTC

I tried upgrade from 4.2.1 to 4.3.0-0.nightly-2019-10-24-203507, it failed.
Checked our ci test result on https://openshift-release.svc.ci.openshift.org, there is still not available for 4.2.1 to 4.3 upgrade path. 
So this bug's regression test is blocked.

Comment 5 liujia 2019-10-31 08:54:06 UTC

Regression test pass. Upgrade v4.2.2 to 4.3.0-0.nightly-2019-10-31-022441 successfully.
# oc get clusterversion -o json|jq .items[0].status.history[
  {
    "completionTime": "2019-10-31T07:00:44Z",
    "image": "registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-10-31-022441",
    "startedTime": "2019-10-31T06:13:16Z",
    "state": "Completed",
    "verified": false,
    "version": "4.3.0-0.nightly-2019-10-31-022441"
  },
  {
    "completionTime": "2019-10-31T03:23:31Z",
    "image": "registry.svc.ci.openshift.org/ocp/release@sha256:dc782b44cac3d59101904cc5da2b9d8bdb90e55a07814df50ea7a13071b0f5f0",
    "startedTime": "2019-10-31T02:58:46Z",
    "state": "Completed",
    "verified": false,
    "version": "4.2.2"
  }
]

No need extra test case according to comment 2, so remove needtestcase.

Comment 7 errata-xmlrpc 2020-01-23 11:08:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.