1765219 – [Disruptive] Cluster upgrade should maintain a functioning cluster

Bug 1765219 - [Disruptive] Cluster upgrade should maintain a functioning cluster

Summary: [Disruptive] Cluster upgrade should maintain a functioning cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Stefan Schimanski
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-24 14:39 UTC by Fabio Bertinatto
Modified:	2023-09-14 05:44 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:09:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:09:35 UTC

Comment 1 Abhinav Dahiya 2019-10-30 16:14:58 UTC

This should have been fixed already by https://github.com/openshift/cluster-version-operator/pull/265

Comment 3 liujia 2019-11-05 08:41:19 UTC

From job [1], i did not see any obvious log about if it's related with api version. And from recent ci build jobs, i notice the similar fail happen again in [2].

@Stefan Schimanski
I'm not quite sure how should qe verify the bug against pr#265. Could u help confirm?


[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/9490
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10456

Comment 4 liujia 2019-12-06 03:19:17 UTC

Still see this test failure for recent ci tests. 
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12127/

[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:130]: during upgrade
Unexpected error:
    <*errors.errorString | 0xc000be4360>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-05-213858: 13% complete",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-05-213858: 13% complete
occurred

Comment 5 Stefan Schimanski 2019-12-11 16:02:17 UTC

@liujia: Not every run failing with "Cluster upgrade should maintain a functioning cluster" is due to the CRD topic fixed in https://github.com/openshift/cluster-version-operator/pull/265. This was about one very specific case where upgrade could fail.

What #265 fixed was about failing updates of very early CRDs before kube-apiserver was updated to 4.3.

Looking https://search.svc.ci.openshift.org/?search=resource+may+have+been+deleted&maxAge=168h&context=2&type=all suggests me that this is not the case anymore. Moving back to modified.

Comment 6 Xingxing Xia 2019-12-13 08:02:57 UTC

While verifying bug 1779237#c4 yesterday, its tested cluster also failed, stuck at below for 20+ hours:
oc get clusterversion
version   4.2.0-0.nightly-2019-12-11-171302   True        True          22h     Working towards 4.3.0-0.nightly-2019-12-12-021332: 13% complete
[xxia 2019-12-13 15:47:48 my]$ oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress                                    4.3.0-0.nightly-2019-12-12-021332   False       True          True       21h
kube-apiserver                             4.3.0-0.nightly-2019-12-12-021332   True        True          True       23h
kube-controller-manager                    4.3.0-0.nightly-2019-12-12-021332   True        True          True       23h
kube-scheduler                             4.3.0-0.nightly-2019-12-12-021332   True        False         True       23h
machine-config                             4.2.0-0.nightly-2019-12-11-171302   False       True          True       21h
monitoring                                 4.3.0-0.nightly-2019-12-12-021332   False       True          True       21h
network                                    4.3.0-0.nightly-2019-12-12-021332   True        True          True       23h
[xxia 2019-12-13 15:48:14 my]$ oc get no
NAME                                         STATUS                        ROLES    AGE   VERSION
ip-10-0-135-186.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   23h   v1.14.6+cebabbf4a
ip-10-0-141-151.us-east-2.compute.internal   Ready                         master   23h   v1.14.6+cebabbf4a
ip-10-0-147-188.us-east-2.compute.internal   Ready                         worker   23h   v1.14.6+cebabbf4a
ip-10-0-153-212.us-east-2.compute.internal   NotReady,SchedulingDisabled   master   23h   v1.14.6+cebabbf4a
ip-10-0-170-139.us-east-2.compute.internal   Ready                         master   23h   v1.14.6+cebabbf4a
[xxia 2019-12-13 15:48:31 my]$ oc describe co kube-apiserver
...
Status:
  Conditions:
    Last Transition Time:  2019-12-12T10:12:55Z
    Message:               NodeControllerDegraded: The master node(s) "ip-10-0-153-212.us-east-2.compute.internal" not ready
    Reason:                NodeControllerDegradedMasterNodesReady
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-12-13T03:06:56Z
    Message:               Progressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 9
    Reason:                Progressing
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-12-12T08:11:12Z
    Message:               Available: 3 nodes are active; 3 nodes are at revision 7; 0 nodes have achieved new revision 9
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-12-12T08:10:06Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
...
[xxia 2019-12-13 15:52:22 my]$ oc get po -n openshift-kube-apiserver -l apiserver --show-labels
NAME                                                        READY   STATUS    RESTARTS   AGE   LABELS
kube-apiserver-ip-10-0-141-151.us-east-2.compute.internal   3/3     Running   0          22h   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-ip-10-0-153-212.us-east-2.compute.internal   3/3     Running   0          22h   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-ip-10-0-170-139.us-east-2.compute.internal   3/3     Running   0          22h   apiserver=true,app=openshift-kube-apiserver,revision=7

I checked https://openshift-release.svc.ci.openshift.org/ , click latest payload - 4.3.0-0.nightly-2019-12-13-032731, in https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-13-032731 , saw "4.2.10 (changes) - Failed", click it, saw https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12586 also shows:
Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-13-032731: 13% complete

I'm not sure if this failure is same as original bug report. But 4.2 to 4.3 upgrade indeed does not work now.

Comment 8 liujia 2019-12-13 08:17:56 UTC

(In reply to Stefan Schimanski from comment #5)
> @liujia: Not every run failing with "Cluster upgrade should maintain a
> functioning cluster" is due to the CRD topic fixed in
> https://github.com/openshift/cluster-version-operator/pull/265. This was
> about one very specific case where upgrade could fail.
> 
> What #265 fixed was about failing updates of very early CRDs before
> kube-apiserver was updated to 4.3.
> 
> Looking
> https://search.svc.ci.openshift.org/
> ?search=resource+may+have+been+deleted&maxAge=168h&context=2&type=all
> suggests me that this is not the case anymore. Moving back to modified.

Thx for point out the concrete error info in failed unit test-[Disruptive] Cluster upgrade should maintain a functioning cluster. Double confirm there was not the same error info in job https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12127/ from last verify.

Comment 9 liujia 2019-12-13 08:23:55 UTC

> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/12586 also shows:
> Cluster did not complete upgrade: timed out waiting for the condition:
> Working towards 4.3.0-0.nightly-2019-12-13-032731: 13% complete
> 
> I'm not sure if this failure is same as original bug report. But 4.2 to 4.3
> upgrade indeed does not work now.

Double check https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12586, it is not the same with the fix in pr265 about CRD api version. So @xxia, you could file a new bug for your new issue. i will verify the bug.

Comment 10 Xingxing Xia 2019-12-13 09:17:25 UTC

(In reply to liujia from comment #9)
> So @xxia, you could file a new bug for your new issue
Thanks, should be same as bug 1778904 .

Comment 12 errata-xmlrpc 2020-01-23 11:09:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 13 Red Hat Bugzilla 2023-09-14 05:44:58 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.