1768255 – installer reports 100% complete but failing components

Bug 1768255 - installer reports 100% complete but failing components

Summary: installer reports 100% complete but failing components

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jack Ottofaro
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1829118 1897612 (view as bug list)
Depends On:
Blocks:	dit
TreeView+	depends on / blocked

Reported:	2019-11-03 18:06 UTC by Ben Parees
Modified:	2021-02-24 15:11 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously install/upgrade progressing message would be similar to: Working towards 4.3.0-0.ci-2019-11-01-122324: 100% complete Due to rounding the percent shown may be 100 before the install/upgrade was actually complete. With this fix we no longer round up and have changed the progressing message to: Working towards 4.3.0-0.ci-2019-11-01-122324: 660 of 668 done (98% complete)
Clone Of:
Environment:
Last Closed:	2021-02-24 15:10:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 497	0	None	closed	Bug 1768255: replace Fraction with Done and Total	2021-02-18 20:10:10 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:11:52 UTC

Description Ben Parees 2019-11-03 18:06:19 UTC

Description of problem:

install logs shows:

level=info msg="Cluster operator console Available is False with DeploymentAvailableFailedUpdate: DeploymentAvailable: 1 replicas ready at version 4.3.0-0.ci-2019-11-01-122324"
level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Working towards 4.3.0-0.ci-2019-11-01-122324: 100% complete"

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.3/127

Would expect this to show less than 100% complete

(I am opening a separate bug against console for the fact that it did not become available)


Similar behavior seen in:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.3/127

Comment 1 Abhinav Dahiya 2019-11-04 17:58:55 UTC

Since we didn't incorrectly mark the cluster ready to use, and this is a cosmetic fix.

Comment 2 W. Trevor King 2019-11-23 05:35:01 UTC

cosmetic fix will happen in the CVO, whose message the installer is just passing along.

Comment 3 W. Trevor King 2020-01-31 19:31:10 UTC

I would really like to say:

  Working towards 4.3.0-0.ci-2019-11-01-122324: $n of $m objects applied (100%) in the current sync round

with maybe batched updates so we didn't push too often (e.g. push when there has been a change but the current ClusterVersion status is >30s old).  For the specific job that lead to this bug [1]:

  2019-11-01T15:44:07.358978842Z E1101 15:44:07.358590       1 task.go:77] error running apply for clusteroperator "console" (332 of 486): Cluster operator console has not yet reported success
  ...
  2019-11-01T15:44:07.35927455Z I1101 15:44:07.359244       1 task_graph.go:611] Result of work: [Cluster operator console has not yet reported success]

Presumably the ClusterOperator was at the back of its manifest block, and we successfully pushed all 485 other manifests in that sync round.  Including "in the current sync round" would also mitigate bug 1690816.

I'd also like that 'Result of work' to go into .extensions or some such on the cluster object, for folks who want a dive into the details of the sticking manifests.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.3/127/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-01-122324-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-cluster-version/pods/cluster-version-operator-6c89697849-h9p7t/cluster-version-operator/cluster-version-operator/logs/current.log

Comment 5 W. Trevor King 2020-05-19 03:26:47 UTC

*** Bug 1829118 has been marked as a duplicate of this bug. ***

Comment 6 W. Trevor King 2020-05-19 03:30:14 UTC

[1] would make collecting information like "which manifests have we failed to push?" easier, but it's unlikely to land during freeze.  Punting to UpcomingSprint; hopefully we'll make some progress here once master and 4.6 split off from 4.5.

[1]: https://github.com/openshift/cluster-version-operator/pull/264

Comment 7 Lalatendu Mohanty 2020-06-09 12:56:36 UTC

We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features.  Hence we are adding UpcomingSprint now, and we'll revisit the next sprint.

Comment 8 Lalatendu Mohanty 2020-07-09 14:08:24 UTC

We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features.  Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.

Comment 9 Jack Ottofaro 2020-07-30 20:00:48 UTC

We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features.  Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.

Comment 10 W. Trevor King 2020-08-25 17:14:33 UTC

This is more feature-y, and we're past feature freeze for 4.6.  Moving this to 4.7.

Comment 12 W. Trevor King 2020-09-13 04:58:39 UTC

I still want comment 3, but this is a cosmetic issue, so it's taken a backseat to more impactful bugs.  Maybe next sprint...

Comment 13 W. Trevor King 2020-10-04 02:42:41 UTC

Comment 12 is still current.

Comment 14 W. Trevor King 2020-11-13 17:53:25 UTC

*** Bug 1897612 has been marked as a duplicate of this bug. ***

Comment 16 W. Trevor King 2020-12-16 20:52:02 UTC

Replacing Fraction [1] with done/total [2] and rendering done and total (and a locally-computed percent) with comment 3's:

  Working towards 4.3.0-0.ci-2019-11-01-122324: $n of $m objects applied (100%) in the current sync round

seems like it wouldn't be that bad.  I don't see a way to get to [3]'s Complete without syncing all the resources.  If there is a way, it's probably its own bug, because we don't want to transition to reconciling mode before we have reconciled all the manifests.

[1]: https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/pkg/cvo/sync_worker.go#L92
[2]: https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/pkg/cvo/sync_worker.go#L783-L784
[3]: https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/pkg/cvo/sync_worker.go#L753

Comment 17 Lalatendu Mohanty 2021-01-08 15:57:11 UTC

Reducing the severity of the bug as this a cosmetic issue and not causing the CI jobs to fail.

Comment 19 liujia 2021-01-22 09:30:34 UTC

Installation monitor against 4.7.0-0.nightly-2021-01-21-172657

...
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2021-01-21-172657: 640 of 664 done (96% complete)
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2021-01-21-172657: 642 of 664 done (96% complete)
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2021-01-21-172657: 644 of 664 done (96% complete)
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2021-01-21-172657: 649 of 664 done (97% complete)
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2021-01-21-172657: 658 of 664 done (99% complete)
level=debug msg=Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2
level=debug msg=Cluster is initialized

Upgrade monitor against v4.6 to 4.7.0-0.nightly-2021-01-21-172657
version   4.6.13   True   True   3s    Working towards 4.7.0-0.nightly-2021-01-21-172657: downloading update
version   4.6.13   True   True   63s   Working towards 4.7.0-0.nightly-2021-01-21-172657: 70 of 664 done (10% complete)
..
version   4.6.13   True   True   4m6s   Working towards 4.7.0-0.nightly-2021-01-21-172657: 96 of 664 done (14% complete)
version   4.6.13   True   True   18m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 116 of 664 done (17% complete)
version   4.6.13   True   True   19m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 174 of 664 done (26% complete), waiting on machine-api, openshift-apiserver
version   4.6.13   True   True   21m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 175 of 664 done (26% complete)
version   4.6.13   True   True   23m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 358 of 664 done (53% complete)
version   4.6.13   True   True   24m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 497 of 664 done (74% complete)
version   4.6.13   True   True   25m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 516 of 664 done (77% complete)
version   4.6.13   True   True   26m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 518 of 664 done (78% complete), waiting on cluster-autoscaler
version   4.6.13   True   True   34m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 527 of 664 done (79% complete)
version   4.6.13   True   True   35m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 527 of 664 done (79% complete), waiting on network
version   4.6.13   True   True   44m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 556 of 664 done (83% complete)
version   4.6.13   True   True   45m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 191 of 664 done (28% complete)
version   4.6.13   True   True   50m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 556 of 664 done (83% complete)
version   4.6.13   True   True   51m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 556 of 664 done (83% complete), waiting on machine-config
version   4.6.13   True   True   55m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 3 of 664 done (0% complete)
version   4.6.13   True   True   56m   Working towards 4.7.0-0.nightly-2021-01-21-172657: 175 of 664 done (26% complete)
version   4.7.0-0.nightly-2021-01-21-172657   True   False   0s    Cluster version is 4.7.0-0.nightly-2021-01-21-172657

New reporting progress looks good.

Comment 22 errata-xmlrpc 2021-02-24 15:10:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.