Bug 1771316

Summary: cluster version object does not reflect the cluster's correct status
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: medium Docs Contact:
Priority: low    
Version: 4.3.0CC: adahiya, aos-bugs, jokerman, wking
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:15:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liujia 2019-11-12 07:26:00 UTC
Description of problem:
Upgrade v4.2.4 cluster to v4.3 nightly build. During the upgrade process, the clusterversion did not show the upgrade progress, but just give the target version.

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.4     True        True          21s     Working towards 4.3.0-0.nightly-2019-11-12-000306: 1% complete

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.4     True        False         36s     Cluster version is 4.3.0-0.nightly-2019-11-12-000306

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.4     True        False         115s    Cluster version is 4.3.0-0.nightly-2019-11-12-000306
....

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.4     True        False         5m14s   Cluster version is 4.3.0-0.nightly-2019-11-12-000306

# ./oc adm upgrade
Cluster version is 4.3.0-0.nightly-2019-11-12-000306

warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: currently installed version 4.3.0-0.nightly-2019-11-12-000306 not found in the "stable-4.2" channel

checked the history, it shows the upgrade is not completed yet.
# oc get clusterversion -o json|jq .items[0].status.history
[
  {
    "completionTime": null,
    "image": "registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306",
    "startedTime": "2019-11-12T03:55:56Z",
    "state": "Partial",
    "verified": false,
    "version": "4.3.0-0.nightly-2019-11-12-000306"
  },
  {
    "completionTime": "2019-11-12T03:51:56Z",
    "image": "registry.svc.ci.openshift.org/ocp/release@sha256:cebce35c054f1fb066a4dc0a518064945087ac1f3637fe23d2ee2b0c433d6ba8",
    "startedTime": "2019-11-12T03:34:41Z",
    "state": "Completed",
    "verified": false,
    "version": "4.2.4"
  }
]

checked the co status, some of co are still updating.
# ./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.4                               True        False         False      15m
cloud-credential                           4.2.4                               True        False         False      29m
cluster-autoscaler                         4.2.4                               True        False         False      25m
console                                    4.2.4                               True        False         False      18m
dns                                        4.2.4                               True        False         False      29m
image-registry                             4.2.4                               True        False         False      21m
ingress                                    4.2.4                               True        False         False      20m
insights                                   4.2.4                               True        False         False      30m
kube-apiserver                             4.3.0-0.nightly-2019-11-12-000306   True        False         False      28m
kube-controller-manager                    4.3.0-0.nightly-2019-11-12-000306   True        False         False      27m
kube-scheduler                             4.3.0-0.nightly-2019-11-12-000306   True        False         False      27m
machine-api                                4.3.0-0.nightly-2019-11-12-000306   True        False         False      30m
machine-config                             4.2.4                               True        False         False      29m
marketplace                                4.2.4                               True        False         False      25m
monitoring                                 4.2.4                               True        False         False      19m
network                                    4.2.4                               True        False         False      29m
node-tuning                                4.3.0-0.nightly-2019-11-12-000306   True        False         False      95s
openshift-apiserver                        4.3.0-0.nightly-2019-11-12-000306   True        False         False      26m
openshift-controller-manager               4.2.4                               True        False         False      27m
openshift-samples                          4.3.0-0.nightly-2019-11-12-000306   True        False         False      56s
operator-lifecycle-manager                 4.2.4                               True        False         False      28m
operator-lifecycle-manager-catalog         4.2.4                               True        False         False      28m
operator-lifecycle-manager-packageserver   4.2.4                               True        False         False      28m
service-ca                                 4.2.4                               True        False         False      29m
service-catalog-apiserver                  4.3.0-0.nightly-2019-11-12-000306   True        False         False      26m
service-catalog-controller-manager         4.2.4                               True        False         False      26m
storage                                    4.3.0-0.nightly-2019-11-12-000306   True        False         False      107s

But actually the upgrade is still ongoing, and above co will update finally.
# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-12-000306   True        False         70m     Cluster version is 4.3.0-0.nightly-2019-11-12-000306

# ./oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2019-11-12T03:55:56Z|2019-11-12T04:33:54Z|Completed|4.3.0-0.nightly-2019-11-12-000306
2019-11-12T03:34:41Z|2019-11-12T03:51:56Z|Completed|4.2.4

# ./oc get co|grep "4.3.0-0.nightly-2019-11-12-000306"|wc -l
27

# ./oc get co|grep "4.2.4"|wc -l
0

Version-Release number of the following components:
4.3.0-0.nightly-2019-11-12-000306
# ./oc version
Client Version: openshift-clients-4.3.0-201910250623-48-g8c30708e
Server Version: 4.3.0-0.nightly-2019-11-12-000306
Kubernetes Version: v1.16.2

How reproducible:
not 100% reproduce, hit once

Steps to Reproduce:
1. Upgrade ocp from v4.2.4 to 4.3.0-0.nightly-2019-11-12-000306
# ./oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 --allow-explicit-upgrade --forceUpdating to release image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306
2.
3.

Actual results:
cluster version does not show correct status during upgrade.

Expected results:
cluster version should show correct status during upgrade.

Additional info:
When the status is not correct, i checked the cvo works well.
# ./oc get all -n openshift-cluster-version
NAME                                            READY   STATUS      RESTARTS   AGE
pod/cluster-version-operator-7446bc8f98-kdwnr   1/1     Running     0          11m
pod/version--gvw9x-phzjb                        0/1     Completed   0          11m

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cluster-version-operator   ClusterIP   172.30.240.16   <none>        9099/TCP   33m

NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cluster-version-operator   1/1     1            1           33m

NAME                                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/cluster-version-operator-7446bc8f98   1         1         1       11m
replicaset.apps/cluster-version-operator-77545448d6   0         0         0       33m
replicaset.apps/cluster-version-operator-7b786f87c6   0         0         0       33m

NAME                       COMPLETIONS   DURATION   AGE
job.batch/version--gvw9x   1/1           15s        11m

Comment 6 W. Trevor King 2019-11-14 02:44:52 UTC
Looking at the attached must-gather fragment:

$ sha1sum log.tgz 
0aa3c5a2c7e8a1cae260ee8a9ce158f57d4088b3  log.tgz
$ tar xzf log.tgz
$ grep 'Running sync.*in state\|Result of work' log/namespaces/openshift-cluster-version/pods/cluster-version-operator-7446bc8f98-p4vjs/cluster-version-operator/cluster-version-operator/logs/current.log | head -n6
2019-11-12T04:22:23.005709831Z I1112 04:22:23.003748       1 sync_worker.go:464] Running sync registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 (force=true) on generation 2 in state Updating at attempt 0
2019-11-12T04:28:08.055820161Z I1112 04:28:08.055806       1 task_graph.go:611] Result of work: [Cluster operator monitoring is still updating]
2019-11-12T04:28:33.341970344Z I1112 04:28:33.341890       1 sync_worker.go:464] Running sync registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 (force=true) on generation 2 in state Updating at attempt 1
2019-11-12T04:33:49.389389153Z I1112 04:33:49.389357       1 task_graph.go:611] Result of work: []
2019-11-12T04:36:41.915469438Z I1112 04:36:41.915369       1 sync_worker.go:464] Running sync registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 (force=true) on generation 2 in state Reconciling at attempt 0
2019-11-12T04:37:14.276571228Z I1112 04:37:14.276551       1 task_graph.go:611] Result of work: []

So that seems reasonable, and 04:33Z matches the upgrade-complete from your comment 0 history.  I'm looking to see if I can find anything from earlier...

Comment 7 W. Trevor King 2019-11-14 02:58:39 UTC
What we might need, is a dump of the CVO logs when it is reporting the early 'Cluster version is ...', before that pod gets killed (probably due to the control-plane machine it was on getting updated) and rescheduled elsewhere and blowing away the old logs.

Comment 8 liujia 2019-11-14 03:39:56 UTC
ok, got u. I will catch more logs before the original cvo pod killed when I hit it again.

Comment 9 Scott Dodson 2020-02-03 18:27:34 UTC
Can we please confirm whether or not this still happens both in the latest 4.4 builds and the 4.3 release builds?

Comment 10 liujia 2020-02-04 08:11:06 UTC
Monitoring an upgrade from v4.3.0 to latest 4.4.0-0.nightly-2020-02-04-002939, did not hit the issue.
Monitoring another upgrade from v4.2.16 to latest v4.3.0, did not hit the issue.

Comment 12 errata-xmlrpc 2020-05-04 11:15:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581